[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters#2391
Draft
[DYNAMO] fix: NFS-safe filesystem broadcast for LoRA adapters#2391
Conversation
On shared filesystems (NFS / CephFS / PVC-backed volumes), a write on one node isn't immediately visible on another. The previous code touched STABLE immediately after `save_state_dict`, so the orchestrator could observe the sentinel before the adapter files (`adapter_model.safetensors`, `adapter_config.json`) were readable cluster-wide -- causing `LoRAAdapterNotFoundError` when `/v1/rl/load_lora_adapter` was called. Fix: fsync each adapter file, fsync the directory's dentries, touch STABLE, then fsync the directory once more so the sentinel itself is durable. `_fsync_path` is a small helper; OSError is swallowed because fsync is a hint to the kernel -- STABLE ordering is the actual correctness gate. Default behavior is unchanged for local-filesystem deployments.
be602ad to
60be9e3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a visibility-ordering race in the filesystem-based weight broadcast that surfaces on shared filesystems (NFS / CephFS / Kubernetes PVC volumes). On a shared filesystem, a write on the trainer node is not immediately visible on the orchestrator/inference node. The previous code touched
STABLEimmediately aftersave_state_dict, so the orchestrator could observe the sentinel beforeadapter_model.safetensors/adapter_config.jsonwere readable cluster-wide — causingLoRAAdapterNotFoundErrorwhen/v1/rl/load_lora_adapterwas called.Fix: fsync each adapter file, fsync the directory's dentries, touch
STABLE, then fsync the directory once more so the sentinel itself is durable._fsync_pathis a small helper.OSErroris swallowed deliberately — fsync is a hint to the kernel, not the correctness gate on its own. The actual correctness gate is the ordering: write all data → flush → publish sentinel. On filesystems that don't support directory fsync, theOSErroris harmless; on filesystems that do, the call is load-bearing.Scope
One file changed, 32 lines added. No new config knobs, no behavior change for local-filesystem deployments.
A companion PR (#2395) adds the
weight_broadcast.keep_recentconfig knob that addresses a separate LoRA failure mode on the same code path (the trainer racing ahead andrmtree-ing an adapter dir while a vLLM generate request is still lazily loading it). The two are independent and can land in either order. Together they cover both classes ofLoRAAdapterNotFoundErrorwe've seen on shared-filesystem deployments.