Skip to content

feat: Add exponential backoff and log deduplication to Spotlight#5025

Open
mattico wants to merge 4 commits intogetsentry:mainfrom
mattico:spotlight-backoff
Open

feat: Add exponential backoff and log deduplication to Spotlight#5025
mattico wants to merge 4 commits intogetsentry:mainfrom
mattico:spotlight-backoff

Conversation

@mattico
Copy link
Copy Markdown

@mattico mattico commented Mar 13, 2026

Closes #3481.

Implements the error handling behavior from the SDK docs Spotlight spec. When the Spotlight server is unreachable, the SDK now logs the error only once (resetting after success) and implements exponential backoff (1s initial, 60s max) before retrying.

mattico and others added 2 commits March 13, 2026 11:53
…sport

When the Spotlight server is unreachable, the SDK now logs the error
only once and implements exponential backoff (1s initial, 60s max)
before retrying, per the Spotlight error handling spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattico

This comment was marked as outdated.

@mattico
Copy link
Copy Markdown
Author

mattico commented Mar 13, 2026

The SDK docs suggest this retry logic:

    def send(self, envelope):
        try:
            # Attempt to send
            self._send_envelope(envelope)
            # Reset retry delay on success
            self.retry_delay = 1.0
            self.error_logged = False
        except ConnectionError as e:
            # Exponential backoff
            if not self.error_logged:
                logger.error(f"Spotlight server unreachable at {self.url}: {e}")
                self.error_logged = True

            # Wait before retry
            time.sleep(self.retry_delay)
            self.retry_delay = min(self.retry_delay * 2, self.max_retry_delay)

            # Retry once, then give up for this envelope
            try:
                self._send_envelope(envelope)
                self.retry_delay = 1.0
            except ConnectionError:
                # Silently drop envelope after retry
                pass

While the actual sentry-python SDK implements this retry logic:

https://github.com/getsentry/sentry-python/blob/6c6705a3d990559a80a48477a873d3171b928b12/sentry_sdk/spotlight.py#L59-L97

    def capture_envelope(self, envelope):
        # type: (Envelope) -> None

        # Check if we're in backoff period - skip sending to avoid blocking
        if self._last_error_time > 0:
            time_since_error = time.time() - self._last_error_time
            if time_since_error < self._retry_delay:
                # Still in backoff period, skip this envelope
                return

        body = io.BytesIO()
        envelope.serialize_into(body)
        try:
            req = self.http.request(
                url=self.url,
                body=body.getvalue(),
                method="POST",
                headers={
                    "Content-Type": "application/x-sentry-envelope",
                },
            )
            req.close()
            # Success - reset backoff state
            self._retry_delay = self.INITIAL_RETRY_DELAY
            self._last_error_time = 0.0
        except Exception as e:
            self._last_error_time = time.time()

            # Increase backoff delay exponentially first, so logged value matches actual wait
            self._retry_delay = min(self._retry_delay * 2, self.MAX_RETRY_DELAY)

            # Log error once per backoff cycle (we skip sends during backoff, so only one failure per cycle)
            sentry_logger.warning(
                "Failed to send envelope to Spotlight at %s: %s. "
                "Will retry after %.1f seconds.",
                self.url,
                e,
                self._retry_delay,
            )

There are some differences:

  • The docs only catch ConnectionError rather than all Exceptions.
  • The docs attempt to send every envelope up to twice, the backoff only delays the second attempt. The SDK drops envelopes during the backoff period, and does not retry failed envelopes.

I suppose both the example code and the SDK do fit the stated requirements, just differently.

I implemented behavior closer to the Python SDK than the docs suggestion, which I think is better anyhow. The one difference I can see with my implementation compared to the Python SDK is it logs one warning per backoff cycle while this logs only the first error (resetting after a success).

@mattico mattico marked this pull request as ready for review March 13, 2026 18:34
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.10%. Comparing base (abf4d67) to head (2df3b7b).
⚠️ Report is 35 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5025      +/-   ##
==========================================
+ Coverage   73.91%   74.10%   +0.19%     
==========================================
  Files         497      499       +2     
  Lines       17974    18079     +105     
  Branches     3517     3514       -3     
==========================================
+ Hits        13285    13397     +112     
+ Misses       3833     3826       -7     
  Partials      856      856              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

using var response = await _httpClient.SendAsync(request, cancellationToken).ConfigureAwait(false);
await HandleResponseAsync(response, processedEnvelope, cancellationToken).ConfigureAwait(false);

_backoff.RecordSuccess();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The backoff mechanism in SpotlightHttpTransport is incorrectly reset on HTTP error responses (4xx/5xx) because _backoff.RecordSuccess() is called regardless of the response status.
Severity: HIGH

Suggested Fix

The HandleResponseAsync method should return a status indicating success. SpotlightHttpTransport should then check this status. Call _backoff.RecordSuccess() only if the response was successful (e.g., 200 OK). Otherwise, call _backoff.RecordFailure() to correctly trigger the exponential backoff logic for HTTP errors.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/Sentry/Http/SpotlightHttpTransport.cs#L55

Potential issue: In `SpotlightHttpTransport.SendEnvelopeAsync`, the
`_backoff.RecordSuccess()` method is called after `HandleResponseAsync`. However,
`HandleResponseAsync` does not throw an exception for non-200 HTTP status codes. If the
Spotlight server returns an HTTP error (e.g., 500, 503), the backoff state is
incorrectly reset, and the retry delay is not increased. This defeats the exponential
backoff mechanism for HTTP-level failures, causing the SDK to spam a struggling server
instead of waiting. The intended error logging for this failure path is also skipped.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge the python SDK does not consider an HTTP error response as a failure for backoff purposes (urllib3 doesn't throw for those). And the SDK docs don't either (only catch ConnectionError).

I haven't thought deeply about it otherwise.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@jamescrosswell
Copy link
Copy Markdown
Collaborator

Thanks heaps for the PR @mattico ! It looks pretty good to me. There are some issues with a couple of our CI checks when running against forks of the repo, which I'm asking for advice on. Once we've worked through those though, I think we should be able to get this reviewed and merged in.

Apologies for the delay...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Spotlight without sending to Sentry

2 participants