Skip to content

[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters#2391

Draft
AmeenP wants to merge 1 commit intomainfrom
fix/lora-broadcast-fsync-and-retention
Draft

[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters#2391
AmeenP wants to merge 1 commit intomainfrom
fix/lora-broadcast-fsync-and-retention

Conversation

@AmeenP
Copy link
Copy Markdown
Contributor

@AmeenP AmeenP commented May 2, 2026

Summary

Fixes a visibility-ordering race in the filesystem-based weight broadcast that surfaces on shared filesystems (NFS / CephFS / Kubernetes PVC volumes). On a shared filesystem, a write on the trainer node is not immediately visible on the orchestrator/inference node. The previous code touched STABLE immediately after save_state_dict, so the orchestrator could observe the sentinel before adapter_model.safetensors / adapter_config.json were readable cluster-wide — causing LoRAAdapterNotFoundError when /v1/rl/load_lora_adapter was called.

Fix: fsync each adapter file, fsync the directory's dentries, touch STABLE, then fsync the directory once more so the sentinel itself is durable.

_fsync_path is a small helper. OSError is swallowed deliberately — fsync is a hint to the kernel, not the correctness gate on its own. The actual correctness gate is the ordering: write all data → flush → publish sentinel. On filesystems that don't support directory fsync, the OSError is harmless; on filesystems that do, the call is load-bearing.

Scope

One file changed, 32 lines added. No new config knobs, no behavior change for local-filesystem deployments.

A companion PR (#2395) adds the weight_broadcast.keep_recent config knob that addresses a separate LoRA failure mode on the same code path (the trainer racing ahead and rmtree-ing an adapter dir while a vLLM generate request is still lazily loading it). The two are independent and can land in either order. Together they cover both classes of LoRAAdapterNotFoundError we've seen on shared-filesystem deployments.

On shared filesystems (NFS / CephFS / PVC-backed volumes), a write on
one node isn't immediately visible on another. The previous code touched
STABLE immediately after `save_state_dict`, so the orchestrator could
observe the sentinel before the adapter files
(`adapter_model.safetensors`, `adapter_config.json`) were readable
cluster-wide -- causing `LoRAAdapterNotFoundError` when
`/v1/rl/load_lora_adapter` was called.

Fix: fsync each adapter file, fsync the directory's dentries, touch
STABLE, then fsync the directory once more so the sentinel itself is
durable. `_fsync_path` is a small helper; OSError is swallowed because
fsync is a hint to the kernel -- STABLE ordering is the actual
correctness gate.

Default behavior is unchanged for local-filesystem deployments.
@AmeenP AmeenP force-pushed the fix/lora-broadcast-fsync-and-retention branch from be602ad to 60be9e3 Compare May 2, 2026 21:17
@AmeenP AmeenP changed the title fix: NFS-safe filesystem broadcast + keep_recent retention for LoRA fix: NFS-safe filesystem broadcast for LoRA adapters May 2, 2026
@AmeenP AmeenP changed the title fix: NFS-safe filesystem broadcast for LoRA adapters [DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant