[DYNAMO] feat: weight_broadcast.keep_recent retention knob for LoRA by AmeenP · Pull Request #2395 · PrimeIntellect-ai/prime-rl

AmeenP · 2026-05-02T21:18:49Z

Summary

Adds an optional weight_broadcast.keep_recent knob that decouples broadcast-dir retention from max_async_level. Default behavior is unchanged (when unset, retention falls back to max_async_level exactly as before).

Problem

maybe_clean previously deleted step_{N - max_async_level - 1}. That's the right bound for full-weight broadcasts, but not for LoRA. vLLM lazy-loads the adapter inside _prepare_inputs, so a generate request admitted under an older adapter can page that adapter dir in after the orchestrator has moved on. The trainer can race ahead and rmtree the adapter directory just before the inference worker tries to read it.

Fix

New optional weight_broadcast.keep_recent: int | None (default None):

None → fall back to max_async_level (no behavior change for existing users).
Set to max_async_level + 1 or +2 for LoRA + filesystem broadcast.

A model_validator rejects keep_recent < max_async_level with a message that names the failure mode (LoRAAdapterNotFoundError), since a smaller-than-staleness retention window is always a misconfiguration.

maybe_clean's int parameter is renamed async_level -> retention to reflect that it can now come from either knob.

Relationship to other PRs

A companion PR (#2391) addresses a different LoRA failure mode on the same code path — an NFS visibility race where the STABLE sentinel becomes visible before the adapter files it points to. The two are independent and can land in either order. Together they cover both classes of LoRAAdapterNotFoundError we've seen on shared-filesystem deployments.

Decouples broadcast-dir retention from `max_async_level`. The previous code deleted `step_{N - max_async_level - 1}` -- correct for full-weight broadcasts but wrong for LoRA: vLLM lazy-loads the adapter inside `_prepare_inputs`, so a generate request admitted under an older adapter can page that dir in *after* the orchestrator has moved on. The trainer can race ahead and `rmtree` the adapter directory just before the inference worker tries to read it. Adds `weight_broadcast.keep_recent: int | None` (default `None`, falls back to `max_async_level` so existing behavior is preserved). For LoRA + filesystem broadcast, set to `max_async_level + 1` or `+2`. A `model_validator` rejects `keep_recent < max_async_level` with a message that names the failure mode. `maybe_clean`'s int parameter is renamed `async_level` -> `retention` to reflect that it can now come from either knob.

AmeenP mentioned this pull request May 2, 2026

[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters #2391

Draft

AmeenP changed the title ~~feat: weight_broadcast.keep_recent retention knob for LoRA~~ [DYNAMO] feat: weight_broadcast.keep_recent retention knob for LoRA May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DYNAMO] feat: weight_broadcast.keep_recent retention knob for LoRA#2395

[DYNAMO] feat: weight_broadcast.keep_recent retention knob for LoRA#2395
AmeenP wants to merge 1 commit intomainfrom
feat/lora-broadcast-keep-recent

AmeenP commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmeenP commented May 2, 2026

Summary

Problem

Fix

Relationship to other PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant