Skip to content

fix(monitor): durable sample-upload retries with bounded backlog#2425

Open
JannikSt wants to merge 7 commits intomainfrom
fix/sample-upload-retry-queue
Open

fix(monitor): durable sample-upload retries with bounded backlog#2425
JannikSt wants to merge 7 commits intomainfrom
fix/sample-upload-retry-queue

Conversation

@JannikSt
Copy link
Copy Markdown
Member

@JannikSt JannikSt commented May 5, 2026

Why

A transient R2 PUT failure during _upload_to_r2 drops sample data for ~10 steps. The previous code retried 3 times within ~40 s, then discarded the parquet bytes; with log_extras.interval=10 a single network blip blanks the dashboard data tab for an entire run prefix.

What

  • Inner retry: replace hand-rolled loops in _upload_to_r2 / _request_presigned_url / _confirm_samples_upload with tenacity.AsyncRetrying. 6 attempts capped at 120 s wall-clock, exponential backoff + jitter. Retries only TransportError, RemoteProtocolError and 408/429/5xx — 4xx fails fast.
  • Outer retry: bounded FIFO backlog (max 5) of (step, parquet_bytes) on the monitor's bg loop. log_samples enqueues + triggers a serialized drain (oldest-first). Failed steps stay in the backlog and are reattempted on the next log_samples tick instead of being silently dropped.
  • Better error logging: full traceback via logger.opt(exception=True) so the actual httpx error type is visible instead of opaque OSError: [Errno 0].

Note

Medium Risk
Changes the sample-upload pipeline (async queueing, retry/cooldown, and shutdown flushing), so bugs could affect monitoring reliability or stall uploads, but scope is limited to PrimeMonitor sample logging.

Overview
Improves PrimeMonitor.log_samples durability by queueing sample parquet payloads into a bounded FIFO backlog and draining them serially on the background event loop, so transient upload failures no longer drop steps.

Replaces ad-hoc retry loops in the presign/R2 PUT/confirm flow with a shared tenacity retry policy that only retries retryable httpx transport errors and 408/429/5xx responses, with capped exponential backoff + jitter and clearer retry/error logging.

Enhances shutdown behavior by attempting a final backlog drain in _flush() (bypassing cooldown) and warning if samples remain queued and will be lost.

Reviewed by Cursor Bugbot for commit 5c96d98. Bugbot is set up for automated code reviews on this repo. Configure here.

A transient R2 PUT failure during _upload_to_r2 dropped sample data for
up to 10 steps because the previous code only retried 3 times within ~40s
and then discarded the parquet. With log_extras.interval=10 this meant a
single hiccup blanked the dashboard data tab for an entire run prefix.

- Replace the hand-rolled per-call retry loops in _upload_to_r2,
  _request_presigned_url and _confirm_samples_upload with tenacity
  AsyncRetrying. 6 attempts capped at 120s wall-clock, exponential
  backoff with jitter, retrying only TransportError, RemoteProtocolError
  and 408/429/5xx. Non-retryable errors fail fast.
- Move sample uploads through a bounded FIFO backlog (max 5) on the
  monitor's bg event loop. log_samples now enqueues the parquet bytes
  and triggers a serialized drain that processes oldest-first. Failed
  steps stay in the backlog and are reattempted on the next log_samples
  tick instead of being silently dropped.
- Log the full exception via loguru opt(exception=True) when an upload
  is parked in the backlog so the actual error type is visible instead
  of an opaque OSError Errno 0.
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented May 5, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72990286c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/prime_rl/utils/monitor/prime.py
Codex review on #2425 caught a race: backlog eviction runs without the
upload lock, so a concurrent log_samples() call could popleft the head
the drain coroutine was peeking at and currently uploading. The drain's
own success-path popleft would then remove a different (un-uploaded)
step, silently dropping its data.

Fix: popleft the in-flight item into a local variable before awaiting
the HTTP call. The in-flight item is no longer in the queue, so concurrent
eviction cannot touch it. On failure we appendleft to preserve ordering
for the next drain tick.
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented May 6, 2026

@codex review

Two new tests:

1. Failure-then-retry: a step that fails upload stays at the head of
   the backlog and is reattempted on the next log_samples tick.

2. Concurrent-eviction race (the codex P1 from PR review): start a
   drain that pops the in-flight item into a local var and awaits,
   then fire 5 concurrent log_samples calls that each evict from the
   queue. Verify the in-flight step survives and the correctly-oldest
   queued step is the one evicted on backlog overflow.
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented May 6, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5bd8639. Configure here.

Comment thread src/prime_rl/utils/monitor/prime.py Outdated
…enqueuing

Cursor Bugbot on #2425 caught that the outer drain's `except Exception`
re-queued every failure, including the non-retryable 4xx errors that
the inner tenacity layer (correctly) fails fast on. A permanently-
failing step (e.g. 400 from /samples/presign, expired API key) would
sit at the queue head and block every subsequent upload until backlog
overflow eventually evicted it.

Fix: split the drain's failure path by `_is_retryable_upload_error(e)`.
Retryable → appendleft and stop draining (existing behavior, will retry
next tick). Non-retryable → drop the step, clear from
`_pending_sample_steps`, and `continue` so later items aren't blocked.

Adds `test_non_retryable_sample_upload_failure_does_not_block_queue`
which preloads the queue with a step that raises a 400, plus three
healthy steps, and verifies that all four are attempted and the queue
ends up empty (the 400 step dropped rather than re-enqueued).
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented May 6, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dff33d5cc8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/prime_rl/utils/monitor/prime.py
…o-back retries

Codex P2 on #2425: when R2/API is down, multiple log_samples() calls
queue drain coroutines on the upload lock. After the first one finishes
its full ~120s tenacity retry budget and appendlefts the failed head
back, the next waiter immediately reacquires the lock and retries the
same head, burning another full budget. With sample interval=10, this
ties up the monitor event loop for several full retry budgets in a row
during a sustained outage and delays distributions/metrics uploads on
the same client.

Fix: arm a `_retryable_cooldown_until` timestamp (60s) on retryable
failure. Drain checks it at the top of each iteration and bails when
active. Cleared on success.

Updates `test_sample_upload_failure_keeps_step_in_backlog_and_retries_next_tick`
to advance the fake clock past the cooldown before the second tick, and
adds `test_retryable_failure_arms_cooldown_to_prevent_back_to_back_retries`
which asserts three rapid-fire failed log_samples calls produce only one
upload attempt, then verifies a successful tick after the cooldown.
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented May 6, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e88ee14aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/prime_rl/utils/monitor/prime.py
JannikSt added 2 commits May 5, 2026 22:33
Removes the tests added in 5bd8639, dff33d5, and 8e88ee1. The
behavior they covered (failure-retry persistence, in-flight survives
concurrent eviction, non-retryable drop, retryable cooldown) is still
implemented and was verified locally; just keeping the diff small.
Codex P2 on #2425: a retryable failure on the last log_samples tick (or
just before close) leaves the parquet bytes appendleft'd back into the
backlog with the upload future already complete. _flush() only waits on
pending futures, not on the queue, so close() proceeds to aclose the
HTTP client and stop the loop while the queued samples sit in memory
and are never retried — losing the very data the backlog was meant to
preserve.

Fix: extract the drain loop into _drain_sample_backlog and call it once
more from _flush() with ignore_cooldown=True, bounded by the existing
flush timeout. Any items still in the queue after that drain are logged
as lost.
@JannikSt
Copy link
Copy Markdown
Member Author

JannikSt commented May 6, 2026

@codex review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant