Fix weight update race condition between trainer cleanup and orchestrator #1857
Open
Fix weight update race condition between trainer cleanup and orchestrator #1857
Conversation
…ator The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run. Fix: - Orchestrator writes a LOADED_STEP marker after each successful weight update so the trainer knows which checkpoints have been consumed. - Trainer only cleans broadcast directories the orchestrator has moved past (candidate_step < loaded_step). - Orchestrator catches update_weights failures and retries on the next poll as defense-in-depth. - Fix DefaultModelLoader import for vllm >=0.16 (module renamed from loader to default_loader). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
samsja
reviewed
Feb 23, 2026
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Member
mikasenghaas
left a comment
There was a problem hiding this comment.
i think it should be fine? im a little lost with our ckpt logic (ie which ckpt goes where) since multi tenant but overall looks sensible
| ) | ||
|
|
||
| # Update weights on inference servers | ||
| # Update weights on inference servers. |
| return | ||
|
|
||
| # Sweep all eligible historical steps so skipped candidates are eventually cleaned. | ||
| for step_dir in path.glob("step_*"): |
Member
There was a problem hiding this comment.
this will break if we ever rename get_step_path but ig we wont haha so ig we're good?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
current bug in prime-rl when having excessive event loop lag, it will cause the cleanup of old broadcast/step_n that wont happen immediately, then inference will check step_n and see its there, then the delete takes place and the following load fails because step_n is missing by the time it tries to load
The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run.
Fix:
Note
Medium Risk
Touches orchestrator↔trainer synchronization around checkpoint lifecycle; mistakes could still lead to missing checkpoints or disk bloat. Also downgrades core GPU/ML dependencies (Torch/Triton/NVIDIA libs), which can impact runtime stability and compatibility.
Overview
Prevents a race where the trainer deletes broadcast
step_*directories before the orchestrator/inference pool has finished loading them.The orchestrator now atomically writes a
LOADED_STEPmarker after a successfulupdate_weights, and the trainer’smaybe_cleanonly deletesstep_*directories up tomin(candidate_step, loaded_step-1)(sweeping all eligible historical steps while respectinginterval_to_keep).Separately,
uv.lockis updated to pin older versions oftorch/torchaudio/torchvisionand Linuxtriton, along with anvidia-nvshmem-cu12downgrade.Written by Cursor Bugbot for commit 221cf70. This will update automatically on new commits. Configure here.