[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters by AmeenP · Pull Request #2391 · PrimeIntellect-ai/prime-rl

AmeenP · 2026-05-02T21:09:59Z

Summary

Fixes a visibility-ordering race in the filesystem-based weight broadcast that surfaces on shared filesystems (NFS / CephFS / Kubernetes PVC volumes). On a shared filesystem, a write on the trainer node is not immediately visible on the orchestrator/inference node. The previous code touched STABLE immediately after save_state_dict, so the orchestrator could observe the sentinel before adapter_model.safetensors / adapter_config.json were readable cluster-wide — causing LoRAAdapterNotFoundError when /v1/rl/load_lora_adapter was called.

Fix: fsync each adapter file, fsync the directory's dentries, touch STABLE, then fsync the directory once more so the sentinel itself is durable.

_fsync_path is a small helper. OSError is swallowed deliberately — fsync is a hint to the kernel, not the correctness gate on its own. The actual correctness gate is the ordering: write all data → flush → publish sentinel. On filesystems that don't support directory fsync, the OSError is harmless; on filesystems that do, the call is load-bearing.

Scope

One file changed, 32 lines added. No new config knobs, no behavior change for local-filesystem deployments.

A companion PR (#2395) adds the weight_broadcast.keep_recent config knob that addresses a separate LoRA failure mode on the same code path (the trainer racing ahead and rmtree-ing an adapter dir while a vLLM generate request is still lazily loading it). The two are independent and can land in either order. Together they cover both classes of LoRAAdapterNotFoundError we've seen on shared-filesystem deployments.

On shared filesystems (NFS / CephFS / PVC-backed volumes), a write on one node isn't immediately visible on another. The previous code touched STABLE immediately after `save_state_dict`, so the orchestrator could observe the sentinel before the adapter files (`adapter_model.safetensors`, `adapter_config.json`) were readable cluster-wide -- causing `LoRAAdapterNotFoundError` when `/v1/rl/load_lora_adapter` was called. Fix: fsync each adapter file, fsync the directory's dentries, touch STABLE, then fsync the directory once more so the sentinel itself is durable. `_fsync_path` is a small helper; OSError is swallowed because fsync is a hint to the kernel -- STABLE ordering is the actual correctness gate. Default behavior is unchanged for local-filesystem deployments.

AmeenP mentioned this pull request May 2, 2026

[DYNAMO] feat: example k8s manifests + local smoke tests #2394

Draft

AmeenP force-pushed the fix/lora-broadcast-fsync-and-retention branch from be602ad to 60be9e3 Compare May 2, 2026 21:17

AmeenP changed the title ~~fix: NFS-safe filesystem broadcast + keep_recent retention for LoRA~~ fix: NFS-safe filesystem broadcast for LoRA adapters May 2, 2026

AmeenP mentioned this pull request May 2, 2026

[DYNAMO] feat: weight_broadcast.keep_recent retention knob for LoRA #2395

Draft

AmeenP changed the title ~~fix: NFS-safe filesystem broadcast for LoRA adapters~~ [DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters#2391

[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters#2391
AmeenP wants to merge 1 commit intomainfrom
fix/lora-broadcast-fsync-and-retention

AmeenP commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmeenP commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AmeenP commented May 2, 2026 •

edited

Loading