Fix weight update race condition between trainer cleanup and orchestrator by cdreetz · Pull Request #1857 · PrimeIntellect-ai/prime-rl

cdreetz · 2026-02-23T07:05:09Z

current bug in prime-rl when having excessive event loop lag, it will cause the cleanup of old broadcast/step_n that wont happen immediately, then inference will check step_n and see its there, then the delete takes place and the following load fails because step_n is missing by the time it tries to load

The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run.

Fix:

Orchestrator writes a LOADED_STEP marker after each successful weight update so the trainer knows which checkpoints have been consumed.
Trainer only cleans broadcast directories the orchestrator has moved past (candidate_step < loaded_step).
Orchestrator catches update_weights failures and retries on the next poll as defense-in-depth.
Fix DefaultModelLoader import for vllm >=0.16 (module renamed from loader to default_loader).

Note

Medium Risk
Touches orchestrator↔trainer synchronization around checkpoint lifecycle; mistakes could still lead to missing checkpoints or disk bloat. Also downgrades core GPU/ML dependencies (Torch/Triton/NVIDIA libs), which can impact runtime stability and compatibility.

Overview
Prevents a race where the trainer deletes broadcast step_* directories before the orchestrator/inference pool has finished loading them.

The orchestrator now atomically writes a LOADED_STEP marker after a successful update_weights, and the trainer’s maybe_clean only deletes step_* directories up to min(candidate_step, loaded_step-1) (sweeping all eligible historical steps while respecting interval_to_keep).

Separately, uv.lock is updated to pin older versions of torch/torchaudio/torchvision and Linux triton, along with a nvidia-nvshmem-cu12 downgrade.

^{Written by Cursor Bugbot for commit 221cf70. This will update automatically on new commits. Configure here.}

…ator The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run. Fix: - Orchestrator writes a LOADED_STEP marker after each successful weight update so the trainer knows which checkpoints have been consumed. - Trainer only cleans broadcast directories the orchestrator has moved past (candidate_step < loaded_step). - Orchestrator catches update_weights failures and retries on the next poll as defense-in-depth. - Fix DefaultModelLoader import for vllm >=0.16 (module renamed from loader to default_loader). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

src/prime_rl/trainer/utils.py

src/prime_rl/orchestrator/scheduler.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

src/prime_rl/trainer/utils.py

src/prime_rl/inference/vllm/worker/filesystem.py

mikasenghaas

i think it should be fine? im a little lost with our ckpt logic (ie which ckpt goes where) since multi tenant but overall looks sensible

mikasenghaas · 2026-02-24T19:18:52Z

src/prime_rl/orchestrator/scheduler.py

            )

-            # Update weights on inference servers
+            # Update weights on inference servers.


mikasenghaas · 2026-02-24T19:25:44Z

src/prime_rl/trainer/utils.py

+        return
+
+    # Sweep all eligible historical steps so skipped candidates are eventually cleaned.
+    for step_dir in path.glob("step_*"):


this will break if we ever rename get_step_path but ig we wont haha so ig we're good?

cursor bot reviewed Feb 23, 2026

View reviewed changes

src/prime_rl/trainer/utils.py Outdated Show resolved Hide resolved

samsja reviewed Feb 23, 2026

View reviewed changes

src/prime_rl/orchestrator/scheduler.py Outdated Show resolved Hide resolved

remove tryexcept and add more cleanup

50db3fc

cursor bot reviewed Feb 23, 2026

View reviewed changes

src/prime_rl/trainer/utils.py Show resolved Hide resolved

src/prime_rl/inference/vllm/worker/filesystem.py Outdated Show resolved Hide resolved

root added 2 commits February 24, 2026 17:59

avoid reading/writing same file

101fabb

fix filesystem improts

7bb01ce

mikasenghaas reviewed Feb 24, 2026

View reviewed changes

update uv lock

221cf70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix weight update race condition between trainer cleanup and orchestrator #1857

Fix weight update race condition between trainer cleanup and orchestrator #1857
cdreetz wants to merge 5 commits intomainfrom
fix/weight-update-race-condition

cdreetz commented Feb 23, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

mikasenghaas left a comment

Uh oh!

mikasenghaas Feb 24, 2026

Uh oh!

mikasenghaas Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cdreetz commented Feb 23, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cdreetz commented Feb 23, 2026 •

edited by cursor bot

Loading