fix(monitor): durable sample-upload retries with bounded backlog#2425
fix(monitor): durable sample-upload retries with bounded backlog#2425
Conversation
A transient R2 PUT failure during _upload_to_r2 dropped sample data for up to 10 steps because the previous code only retried 3 times within ~40s and then discarded the parquet. With log_extras.interval=10 this meant a single hiccup blanked the dashboard data tab for an entire run prefix. - Replace the hand-rolled per-call retry loops in _upload_to_r2, _request_presigned_url and _confirm_samples_upload with tenacity AsyncRetrying. 6 attempts capped at 120s wall-clock, exponential backoff with jitter, retrying only TransportError, RemoteProtocolError and 408/429/5xx. Non-retryable errors fail fast. - Move sample uploads through a bounded FIFO backlog (max 5) on the monitor's bg event loop. log_samples now enqueues the parquet bytes and triggers a serialized drain that processes oldest-first. Failed steps stay in the backlog and are reattempted on the next log_samples tick instead of being silently dropped. - Log the full exception via loguru opt(exception=True) when an upload is parked in the backlog so the actual error type is visible instead of an opaque OSError Errno 0.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72990286c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex review on #2425 caught a race: backlog eviction runs without the upload lock, so a concurrent log_samples() call could popleft the head the drain coroutine was peeking at and currently uploading. The drain's own success-path popleft would then remove a different (un-uploaded) step, silently dropping its data. Fix: popleft the in-flight item into a local variable before awaiting the HTTP call. The in-flight item is no longer in the queue, so concurrent eviction cannot touch it. On failure we appendleft to preserve ordering for the next drain tick.
|
@codex review |
Two new tests: 1. Failure-then-retry: a step that fails upload stays at the head of the backlog and is reattempted on the next log_samples tick. 2. Concurrent-eviction race (the codex P1 from PR review): start a drain that pops the in-flight item into a local var and awaits, then fire 5 concurrent log_samples calls that each evict from the queue. Verify the in-flight step survives and the correctly-oldest queued step is the one evicted on backlog overflow.
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5bd8639. Configure here.
…enqueuing Cursor Bugbot on #2425 caught that the outer drain's `except Exception` re-queued every failure, including the non-retryable 4xx errors that the inner tenacity layer (correctly) fails fast on. A permanently- failing step (e.g. 400 from /samples/presign, expired API key) would sit at the queue head and block every subsequent upload until backlog overflow eventually evicted it. Fix: split the drain's failure path by `_is_retryable_upload_error(e)`. Retryable → appendleft and stop draining (existing behavior, will retry next tick). Non-retryable → drop the step, clear from `_pending_sample_steps`, and `continue` so later items aren't blocked. Adds `test_non_retryable_sample_upload_failure_does_not_block_queue` which preloads the queue with a step that raises a 400, plus three healthy steps, and verifies that all four are attempted and the queue ends up empty (the 400 step dropped rather than re-enqueued).
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dff33d5cc8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…o-back retries Codex P2 on #2425: when R2/API is down, multiple log_samples() calls queue drain coroutines on the upload lock. After the first one finishes its full ~120s tenacity retry budget and appendlefts the failed head back, the next waiter immediately reacquires the lock and retries the same head, burning another full budget. With sample interval=10, this ties up the monitor event loop for several full retry budgets in a row during a sustained outage and delays distributions/metrics uploads on the same client. Fix: arm a `_retryable_cooldown_until` timestamp (60s) on retryable failure. Drain checks it at the top of each iteration and bails when active. Cleared on success. Updates `test_sample_upload_failure_keeps_step_in_backlog_and_retries_next_tick` to advance the fake clock past the cooldown before the second tick, and adds `test_retryable_failure_arms_cooldown_to_prevent_back_to_back_retries` which asserts three rapid-fire failed log_samples calls produce only one upload attempt, then verifies a successful tick after the cooldown.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e88ee14aa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P2 on #2425: a retryable failure on the last log_samples tick (or just before close) leaves the parquet bytes appendleft'd back into the backlog with the upload future already complete. _flush() only waits on pending futures, not on the queue, so close() proceeds to aclose the HTTP client and stop the loop while the queued samples sit in memory and are never retried — losing the very data the backlog was meant to preserve. Fix: extract the drain loop into _drain_sample_backlog and call it once more from _flush() with ignore_cooldown=True, bounded by the existing flush timeout. Any items still in the queue after that drain are logged as lost.
|
@codex review |

Why
A transient R2 PUT failure during
_upload_to_r2drops sample data for ~10 steps. The previous code retried 3 times within ~40 s, then discarded the parquet bytes; withlog_extras.interval=10a single network blip blanks the dashboard data tab for an entire run prefix.What
_upload_to_r2/_request_presigned_url/_confirm_samples_uploadwithtenacity.AsyncRetrying. 6 attempts capped at 120 s wall-clock, exponential backoff + jitter. Retries onlyTransportError,RemoteProtocolErrorand 408/429/5xx — 4xx fails fast.(step, parquet_bytes)on the monitor's bg loop.log_samplesenqueues + triggers a serialized drain (oldest-first). Failed steps stay in the backlog and are reattempted on the nextlog_samplestick instead of being silently dropped.logger.opt(exception=True)so the actual httpx error type is visible instead of opaqueOSError: [Errno 0].Note
Medium Risk
Changes the sample-upload pipeline (async queueing, retry/cooldown, and shutdown flushing), so bugs could affect monitoring reliability or stall uploads, but scope is limited to PrimeMonitor sample logging.
Overview
Improves
PrimeMonitor.log_samplesdurability by queueing sample parquet payloads into a bounded FIFO backlog and draining them serially on the background event loop, so transient upload failures no longer drop steps.Replaces ad-hoc retry loops in the presign/R2 PUT/confirm flow with a shared
tenacityretry policy that only retries retryablehttpxtransport errors and 408/429/5xx responses, with capped exponential backoff + jitter and clearer retry/error logging.Enhances shutdown behavior by attempting a final backlog drain in
_flush()(bypassing cooldown) and warning if samples remain queued and will be lost.Reviewed by Cursor Bugbot for commit 5c96d98. Bugbot is set up for automated code reviews on this repo. Configure here.