Skip to content

STF-322: Bounded transport-failure retry in WebServiceClient#619

Open
oschwald wants to merge 4 commits intomainfrom
greg/stf-322
Open

STF-322: Bounded transport-failure retry in WebServiceClient#619
oschwald wants to merge 4 commits intomainfrom
greg/stf-322

Conversation

@oschwald
Copy link
Copy Markdown
Member

@oschwald oschwald commented May 2, 2026

Summary

  • Adds WebServiceClient.Builder.maxRetries(int) (defaults to 1) so a single transient transport failure (Connection reset, Broken pipe, EOF, closed channel, etc.) is retried transparently.
  • Predicate is permissive by exclusion: any IOException from httpClient.send() is retried EXCEPT HttpTimeoutException (which covers both request-phase and connect-phase timeouts, since HttpConnectTimeoutException extends it) and InterruptedIOException. Both timeouts are customer-set budgets that retrying would silently extend; InterruptedIOException is a user-cancellation signal we honor. HTTP 4xx/5xx come back as HttpResponse objects (not IOExceptions) and are surfaced through the existing exception hierarchy, so the predicate is structurally unable to retry them.
  • Restores the interrupt flag in both InterruptedException rewrap paths. Updates README and CHANGELOG; opt out via .maxRetries(0).

Fixes STF-322. Parallel PR for geoip2-java uses a byte-identical predicate (maxmind/GeoIP2-java#693).

Test plan

  • mvn checkstyle:check clean
  • mvn verify — 235/235 pass (24 existing + 8 new WebServiceClientTest cases)
  • Reviewer confirms README guidance points customers at jdk.httpclient.keepalive.timeout as the complementary workaround

🤖 Generated with Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a transport-failure retry mechanism to the WebServiceClient, defaulting to one retry for specific IOExceptions such as connection resets, broken pipes, and connect timeouts. The implementation includes a new maxRetries configuration in the WebServiceClient.Builder, updated documentation in the README and CHANGELOG, and comprehensive test coverage for various retry scenarios. Feedback was provided regarding a potential integer overflow in the sendWithRetry method; if maxRetries is set to Integer.MAX_VALUE, the loop condition fails and results in a NullPointerException because lastException remains null. A refactor was suggested to use a while loop to avoid this overflow and simplify the logic.

Comment on lines +390 to +403
IOException lastException = null;
int attempts = maxRetries + 1;
for (int i = 0; i < attempts; i++) {
try {
return httpClient.send(request, HttpResponse.BodyHandlers.ofInputStream());
} catch (IOException e) {
if (!isRetriableTransportFailure(e) || i == attempts - 1) {
throw e;
}
lastException = e;
}
}
// Unreachable: loop either returns or throws.
throw lastException;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of sendWithRetry has a potential NullPointerException due to integer overflow and contains redundant logic.

If maxRetries is set to Integer.MAX_VALUE, the expression maxRetries + 1 overflows to Integer.MIN_VALUE. Consequently, the loop condition i < attempts (i.e., 0 < -2147483648) evaluates to false, the loop is skipped entirely, and the method proceeds to throw lastException;. Since lastException is initialized to null, a NullPointerException is thrown instead of the intended transport error.

Additionally, the lastException variable is redundant because the loop is designed to either return a successful response or throw the current exception e on the final attempt or upon encountering a non-retriable failure. Refactoring to a while loop avoids the overflow risk and simplifies the logic.

    private HttpResponse<InputStream> sendWithRetry(HttpRequest request)
        throws IOException, InterruptedException {
        int attempts = 0;
        while (true) {
            try {
                return httpClient.send(request, HttpResponse.BodyHandlers.ofInputStream());
            } catch (IOException e) {
                if (!isRetriableTransportFailure(e) || attempts >= maxRetries) {
                    throw e;
                }
                attempts++;
            }
        }
    }

@oschwald
Copy link
Copy Markdown
Member Author

oschwald commented May 6, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a configurable retry mechanism for transport-level failures in the WebServiceClient. It introduces a maxRetries setting in the WebServiceClient.Builder, defaulting to one retry for issues like connection resets, broken pipes, and connection timeouts, while specifically excluding request-phase timeouts and HTTP 4xx/5xx responses. The changes include updates to the documentation, logic to handle retries within the client, and a comprehensive suite of tests to verify the new behavior and ensure correct thread interruption handling. I have no feedback to provide.

@oschwald
Copy link
Copy Markdown
Member Author

oschwald commented May 6, 2026

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a transport-failure retry mechanism to the WebServiceClient, defaulting to one retry for transient errors such as connection resets or broken pipes. It adds a maxRetries(int) configuration to the Builder and includes comprehensive documentation and tests. Review feedback identifies a discrepancy between the broad implementation of the retry predicate and the narrower description in the PR, as well as an inconsistency regarding the handling of HttpConnectTimeoutException.

I am having trouble creating individual review comments. Click here to see my feedback.

src/main/java/com/maxmind/minfraud/WebServiceClient.java (401-422)

medium

There is a discrepancy between the PR description and the implementation of this predicate. The description mentions a "narrow" predicate that "walks the cause chain" to match specific errors like Connection reset or Broken pipe. However, the current implementation is broad, retrying any IOException that isn't a timeout or interruption.

If the intent was to limit retries to specific transport failures as described, the implementation should be updated to inspect the exception's cause or message. If the broad approach is intentional (which is often acceptable for a single retry on a stale connection), the PR description should be updated to avoid confusion.

src/main/java/com/maxmind/minfraud/WebServiceClient.java (409-411)

medium

The PR summary states that HttpConnectTimeoutException is retried, but this implementation returns false for all HttpTimeoutException instances, which includes HttpConnectTimeoutException. The README and CHANGELOG also state that it is excluded.

If the intention was indeed to retry connection timeouts while excluding request-phase timeouts (as suggested by the PR summary), you should check for HttpConnectTimeoutException specifically before the general HttpTimeoutException check.

        if (e instanceof java.net.http.HttpConnectTimeoutException) {
            return true;
        }
        if (e instanceof HttpTimeoutException) {
            return false;
        }

oschwald and others added 4 commits May 6, 2026 19:44
When the JDK HttpClient pool reuses an idle connection that an intermediary
(load balancer, proxy, NAT) has silently closed, the next send() fails with
"Connection reset", "Broken pipe", or related transport errors. A single
retry recovers transparently without exposing this race to callers. The
default JDK keep-alive timeout exceeds many intermediaries' idle timeout,
making this mismatch the common case.

The retry predicate is permissive by exclusion: any IOException from
httpClient.send() is retried EXCEPT HttpTimeoutException (covering both
request-phase and connect-phase timeouts, since HttpConnectTimeoutException
is a subclass) and InterruptedIOException. Both timeouts are customer-set
budgets that retrying would silently extend; InterruptedIOException is a
user-cancellation signal. HTTP 4xx and 5xx responses are surfaced as
HttpException (and subclasses) from a separate code path -- they come back
as HttpResponse objects rather than IOExceptions, so the predicate is
structurally unable to retry them.

Customers can opt out via .maxRetries(0). Default is 1 (one retry, two total
attempts). The interrupt flag is restored before rewrapping
InterruptedException, and a pre-set interrupt short-circuits the predicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing catch (InterruptedException) blocks in reportTransaction()
and responseFor() rewrap into MinFraudException without restoring the
thread's interrupt status, silently swallowing the cancellation signal.
Per Java's interruption protocol, code that catches InterruptedException
without rethrowing it should re-set the flag so callers up the stack can
observe the cancellation.

This is an independent bug fix bundled into the STF-322 retry work because
the retry feature exposes the path more often. Per project commit hygiene
it lands as a separate commit so it can be cherry-picked or reverted on
its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover all 8 scenarios: connection-reset retry on score and reportTransaction
endpoints, no retry on HttpTimeoutException, retry on connect timeout
(deterministic via a closed local ServerSocket), no retry on 4xx/5xx,
.maxRetries(0) opt-out, and pre-interrupt short-circuit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant