Conversation
The existing JSON read-modify-write patterns had two problems: 1. conda_environment.py used fcntl.flock on the data file + seek/truncate. This serializes correctly but isn't crash-safe: a crash between seek(0) and finishing json.dump leaves a truncated/corrupt file. 2. production_token.py had no locking at all — concurrent step-functions deploys for different flows could clobber each other's tokens. New approach: atomic_json_update() in metaflow/util.py uses a separate .lock file for flock (stable inode, so the lock actually serializes) combined with temp file + os.replace for the data (crash-safe: readers never see a half-written file). Both callers are updated to use the shared utility.
Greptile SummaryThis PR introduces
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant atomic_json_update
participant LockFile as ".lock file (stable inode)"
participant TmpFile as "tmp file (mkstemp)"
participant DataFile as "JSON data file"
Caller->>atomic_json_update: atomic_json_update(path, updater_fn)
atomic_json_update->>LockFile: open(path + ".lock", "a+")
atomic_json_update->>LockFile: fcntl.flock(LOCK_EX) [blocks until held]
atomic_json_update->>DataFile: open(path, "r") — read current dict
atomic_json_update->>atomic_json_update: d = updater_fn(d)
atomic_json_update->>TmpFile: mkstemp() → write JSON + flush + fsync
atomic_json_update->>DataFile: os.replace(tmp, path) [atomic swap]
atomic_json_update->>LockFile: fcntl.flock(LOCK_UN)
atomic_json_update-->>Caller: return updated dict
|
Add parameterized tests that run the old broken patterns (flock on data file + os.replace, and no locking at all) under high contention and verify they lose writes or corrupt data. These contrast with the atomic_json_update and flock+seek/truncate tests which must always preserve all writes. The broken-pattern tests skip (rather than fail) if the race doesn't trigger, since thread scheduling is non-deterministic.
- Move fcntl import inside function body so util.py stays importable on non-POSIX platforms (fcntl is Linux/macOS only) - Guard against empty dirname when path has no directory component (e.g. bare "config.json") — both makedirs and mkstemp would fail - Document that .lock files are intentionally never deleted (their stable inode is required for correct flock serialization)
One test that failed with the old code (lost writes under contention) and now passes with atomic_json_update. 50 threads x 20 writes = 1000 entries must all survive. Confirmed: old pattern loses 970/1000 writes, new pattern loses 0.
|
Netflix internal testing[1699] @ d4fe183 |
Summary
atomic_json_update()utility tometaflow/util.pythat combines lock-file-based flock serialization with atomic temp-file + os.replace for crash safetyconda_environment.pymanifest writes (was flock on data file + seek/truncate — serialized but not crash-safe)production_token.pytoken writes (had no locking at all — concurrent deploys could clobber each other)Problem
Two distinct issues with JSON read-modify-write in the codebase:
flock on data file + seek/truncate (conda manifest): Serializes writers correctly but a crash between
seek(0)and finishingjson.dump()leaves a truncated/corrupt file.No locking at all (production token): Concurrent
step-functions createcalls for different flows sharing the same token prefix silently lose writes.Both stem from the same root cause: there was no shared utility for safe concurrent JSON updates, so each callsite rolled its own pattern (or didn't).
Fix
atomic_json_update(path, updater_fn)uses a separate.lockfile for flock (stable inode ensures serialization) + temp file +os.replacefor the data file (crash-safe: readers never see a half-written file).Key insight: you can't flock the data file and os.replace it — replacing the file creates a new inode, so other processes opening the file get a different inode and their own independent lock. The lock must be on a separate stable file.
Test plan
test_concurrent_writes_no_corruptionruns 50 threads x 20 writes each (1000 total) against a single JSON file and asserts all 1000 entries survive.Before (old code): This test fails — the old patterns lose writes under contention. Confirmed locally: 970 of 1000 writes lost with the no-locking pattern, and the flock+replace pattern also loses writes due to the inode bug.
After (this PR): This test passes —
atomic_json_updatepreserves all 1000 writes. 10/10 consecutive runs pass with zero lost writes.Additional unit tests:
test_creates_file_if_missing— creates new file from empty dicttest_updates_existing_file— merges into existing datatest_crash_safety_no_partial_writes— updater raises, original file untouched