Conversation
mikasenghaas
left a comment
There was a problem hiding this comment.
nice, yea lets do some testing on this to verify that we dont have any async race conditions anywhere but directionally looks good.
i think mid-term we want to move away from the verifiers env group for training envs and make our abstractions "multi-env" aware by default, e.g. smth like having a buffer and scheduler per env goverened a "scheduling" component on top or smth bc i think we will want more and more fine-grained control over how each env behaves (e.g. here whether or not to use gruop scheduling) and its always awkward to code this in an abstraction that handles multiple cases where you need conditional everywhere
b33268a to
f788c40
Compare
|
lets wait for @mikasenghaas review before merging |
Co-authored-by: faresobeid <faresobeid@users.noreply.github.com>
| raise ValueError( | ||
| "max_inflight_rollouts conflicts with oversampling_factor * batch_size" | ||
| ) | ||
| raise ValueError("max_inflight_rollouts conflicts with oversampling_factor * batch_size") |
There was a problem hiding this comment.
imo, could just deprectea oversampling_factor, no?
There was a problem hiding this comment.
i agree but @samsja mentioned to keep for backwards compatibility (although im hating this lol)
Signed-off-by: faresobeid <111092724+faresobeid@users.noreply.github.com>
Resolve conflicts in scheduler.py. Adopt from main: - safe_cancel/safe_cancel_all for proper async task cleanup - update_policy_task management inside generate_batch() - stop() method for clean shutdown - maybe_update_policy rename - compute_eval_ckpt_step, prev_ckpt_step tracking in orchestrator Preserve branch behavior: - Individual rollouts via run_rollout (not run_group) - GroupState, groups dict, drop_group for deferred group scoring - _fill_inflight_requests / _schedule_next_request loop Made-with: Cursor
Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Co-authored-by: faresobeid <faresobeid@users.noreply.github.com>
Re-write the scheduler to do rollouts at the rollout level instead of group level. This way any long tails within groups are handled, so when a rollout within a group finishes it can move on to another group.
Currently if any env needs group scoring, we go back to previous behaviour of group rollouts although ideally we make it so verifiers is closer here and we can do scoring within prime-rl
Note
Medium Risk
Rewrites the training rollout scheduler and changes verification/scoring configuration, which can affect throughput, reward computation, and cancellation behavior during training.
Overview
Refactors rollout scheduling from group-level to individual rollout tasks to reduce long-tail stalls: the scheduler now tracks per-example “groups” internally while keeping
max_inflight_rolloutsas the primary concurrency limit, and updates batch inflight/off-policy metrics accordingly.Adds deferred group scoring support for tasks whose rubrics require group-level reward functions: training envs run with
score_rollouts=Falseand the scheduler callsrubric.score_group()once all rollouts for an example complete.Replaces
orchestrator.buffer.skip_verificationwith a new top-levelorchestrator.verification.enabledswitch (with validation preventing reward-dependent buffer features when disabled), updates docs/changelog, and adds a unit test covering off-policy update behavior during interleaved group cancellation.Written by Cursor Bugbot for commit 72d5b87. This will update automatically on new commits. Configure here.