fix(monitor): durable sample-upload retries with bounded backlog by JannikSt · Pull Request #2425 · PrimeIntellect-ai/prime-rl

JannikSt · 2026-05-05T23:50:11Z

Why

A transient R2 PUT failure during _upload_to_r2 drops sample data for ~10 steps. The previous code retried 3 times within ~40 s, then discarded the parquet bytes; with log_extras.interval=10 a single network blip blanks the dashboard data tab for an entire run prefix.

What

Inner retry: replace hand-rolled loops in _upload_to_r2 / _request_presigned_url / _confirm_samples_upload with tenacity.AsyncRetrying. 6 attempts capped at 120 s wall-clock, exponential backoff + jitter. Retries only TransportError, RemoteProtocolError and 408/429/5xx — 4xx fails fast.
Outer retry: bounded FIFO backlog (max 5) of (step, parquet_bytes) on the monitor's bg loop. log_samples enqueues + triggers a serialized drain (oldest-first). Failed steps stay in the backlog and are reattempted on the next log_samples tick instead of being silently dropped.
Better error logging: full traceback via logger.opt(exception=True) so the actual httpx error type is visible instead of opaque OSError: [Errno 0].

Note

Medium Risk
Changes the sample-upload pipeline (async queueing, retry/cooldown, and shutdown flushing), so bugs could affect monitoring reliability or stall uploads, but scope is limited to PrimeMonitor sample logging.

Overview
Improves PrimeMonitor.log_samples durability by queueing sample parquet payloads into a bounded FIFO backlog and draining them serially on the background event loop, so transient upload failures no longer drop steps.

Replaces ad-hoc retry loops in the presign/R2 PUT/confirm flow with a shared tenacity retry policy that only retries retryable httpx transport errors and 408/429/5xx responses, with capped exponential backoff + jitter and clearer retry/error logging.

Enhances shutdown behavior by attempting a final backlog drain in _flush() (bypassing cooldown) and warning if samples remain queued and will be lost.

^{Reviewed by Cursor Bugbot for commit 5c96d98. Bugbot is set up for automated code reviews on this repo. Configure here.}

A transient R2 PUT failure during _upload_to_r2 dropped sample data for up to 10 steps because the previous code only retried 3 times within ~40s and then discarded the parquet. With log_extras.interval=10 this meant a single hiccup blanked the dashboard data tab for an entire run prefix. - Replace the hand-rolled per-call retry loops in _upload_to_r2, _request_presigned_url and _confirm_samples_upload with tenacity AsyncRetrying. 6 attempts capped at 120s wall-clock, exponential backoff with jitter, retrying only TransportError, RemoteProtocolError and 408/429/5xx. Non-retryable errors fail fast. - Move sample uploads through a bounded FIFO backlog (max 5) on the monitor's bg event loop. log_samples now enqueues the parquet bytes and triggers a serialized drain that processes oldest-first. Failed steps stay in the backlog and are reattempted on the next log_samples tick instead of being silently dropped. - Log the full exception via loguru opt(exception=True) when an upload is parked in the backlog so the actual error type is visible instead of an opaque OSError Errno 0.

JannikSt · 2026-05-05T23:51:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72990286c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex review on #2425 caught a race: backlog eviction runs without the upload lock, so a concurrent log_samples() call could popleft the head the drain coroutine was peeking at and currently uploading. The drain's own success-path popleft would then remove a different (un-uploaded) step, silently dropping its data. Fix: popleft the in-flight item into a local variable before awaiting the HTTP call. The in-flight item is no longer in the queue, so concurrent eviction cannot touch it. On failure we appendleft to preserve ordering for the next drain tick.

JannikSt · 2026-05-06T00:01:06Z

@codex review

Two new tests: 1. Failure-then-retry: a step that fails upload stays at the head of the backlog and is reattempted on the next log_samples tick. 2. Concurrent-eviction race (the codex P1 from PR review): start a drain that pops the in-flight item into a local var and awaits, then fire 5 concurrent log_samples calls that each evict from the queue. Verify the in-flight step survives and the correctly-oldest queued step is the one evicted on backlog overflow.

JannikSt · 2026-05-06T00:04:22Z

@codex review

chatgpt-codex-connector · 2026-05-06T00:06:42Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5bd8639. Configure here.}

…enqueuing Cursor Bugbot on #2425 caught that the outer drain's `except Exception` re-queued every failure, including the non-retryable 4xx errors that the inner tenacity layer (correctly) fails fast on. A permanently- failing step (e.g. 400 from /samples/presign, expired API key) would sit at the queue head and block every subsequent upload until backlog overflow eventually evicted it. Fix: split the drain's failure path by `_is_retryable_upload_error(e)`. Retryable → appendleft and stop draining (existing behavior, will retry next tick). Non-retryable → drop the step, clear from `_pending_sample_steps`, and `continue` so later items aren't blocked. Adds `test_non_retryable_sample_upload_failure_does_not_block_queue` which preloads the queue with a step that raises a 400, plus three healthy steps, and verifies that all four are attempted and the queue ends up empty (the 400 step dropped rather than re-enqueued).

JannikSt · 2026-05-06T00:14:23Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dff33d5cc8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…o-back retries Codex P2 on #2425: when R2/API is down, multiple log_samples() calls queue drain coroutines on the upload lock. After the first one finishes its full ~120s tenacity retry budget and appendlefts the failed head back, the next waiter immediately reacquires the lock and retries the same head, burning another full budget. With sample interval=10, this ties up the monitor event loop for several full retry budgets in a row during a sustained outage and delays distributions/metrics uploads on the same client. Fix: arm a `_retryable_cooldown_until` timestamp (60s) on retryable failure. Drain checks it at the top of each iteration and bails when active. Cleared on success. Updates `test_sample_upload_failure_keeps_step_in_backlog_and_retries_next_tick` to advance the fake clock past the cooldown before the second tick, and adds `test_retryable_failure_arms_cooldown_to_prevent_back_to_back_retries` which asserts three rapid-fire failed log_samples calls produce only one upload attempt, then verifies a successful tick after the cooldown.

JannikSt · 2026-05-06T05:30:57Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e88ee14aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Removes the tests added in 5bd8639, dff33d5, and 8e88ee1. The behavior they covered (failure-retry persistence, in-flight survives concurrent eviction, non-retryable drop, retryable cooldown) is still implemented and was verified locally; just keeping the diff small.

Codex P2 on #2425: a retryable failure on the last log_samples tick (or just before close) leaves the parquet bytes appendleft'd back into the backlog with the upload future already complete. _flush() only waits on pending futures, not on the queue, so close() proceeds to aclose the HTTP client and stop the loop while the queued samples sit in memory and are never retried — losing the very data the backlog was meant to preserve. Fix: extract the drain loop into _drain_sample_backlog and call it once more from _flush() with ignore_cooldown=True, bounded by the existing flush timeout. Any items still in the queue after that drain are logged as lost.

JannikSt · 2026-05-06T05:52:29Z

Conversation

JannikSt commented May 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

JannikSt commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

JannikSt commented May 6, 2026

Uh oh!

JannikSt commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JannikSt commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

JannikSt commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

JannikSt commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JannikSt commented May 5, 2026 •

edited by cursor Bot

Loading