Skip to content

fix: crash-safe atomic JSON read-modify-write#3006

Open
npow wants to merge 5 commits intomasterfrom
npow/atomic-json-update
Open

fix: crash-safe atomic JSON read-modify-write#3006
npow wants to merge 5 commits intomasterfrom
npow/atomic-json-update

Conversation

@npow
Copy link
Collaborator

@npow npow commented Mar 11, 2026

Summary

  • Adds atomic_json_update() utility to metaflow/util.py that combines lock-file-based flock serialization with atomic temp-file + os.replace for crash safety
  • Fixes conda_environment.py manifest writes (was flock on data file + seek/truncate — serialized but not crash-safe)
  • Fixes production_token.py token writes (had no locking at all — concurrent deploys could clobber each other)

Problem

Two distinct issues with JSON read-modify-write in the codebase:

  1. flock on data file + seek/truncate (conda manifest): Serializes writers correctly but a crash between seek(0) and finishing json.dump() leaves a truncated/corrupt file.

  2. No locking at all (production token): Concurrent step-functions create calls for different flows sharing the same token prefix silently lose writes.

Both stem from the same root cause: there was no shared utility for safe concurrent JSON updates, so each callsite rolled its own pattern (or didn't).

Fix

atomic_json_update(path, updater_fn) uses a separate .lock file for flock (stable inode ensures serialization) + temp file + os.replace for the data file (crash-safe: readers never see a half-written file).

Key insight: you can't flock the data file and os.replace it — replacing the file creates a new inode, so other processes opening the file get a different inode and their own independent lock. The lock must be on a separate stable file.

Test plan

test_concurrent_writes_no_corruption runs 50 threads x 20 writes each (1000 total) against a single JSON file and asserts all 1000 entries survive.

Before (old code): This test fails — the old patterns lose writes under contention. Confirmed locally: 970 of 1000 writes lost with the no-locking pattern, and the flock+replace pattern also loses writes due to the inode bug.

After (this PR): This test passes — atomic_json_update preserves all 1000 writes. 10/10 consecutive runs pass with zero lost writes.

Additional unit tests:

  • test_creates_file_if_missing — creates new file from empty dict
  • test_updates_existing_file — merges into existing data
  • test_crash_safety_no_partial_writes — updater raises, original file untouched

The existing JSON read-modify-write patterns had two problems:

1. conda_environment.py used fcntl.flock on the data file + seek/truncate.
   This serializes correctly but isn't crash-safe: a crash between seek(0)
   and finishing json.dump leaves a truncated/corrupt file.

2. production_token.py had no locking at all — concurrent step-functions
   deploys for different flows could clobber each other's tokens.

New approach: atomic_json_update() in metaflow/util.py uses a separate
.lock file for flock (stable inode, so the lock actually serializes)
combined with temp file + os.replace for the data (crash-safe: readers
never see a half-written file).

Both callers are updated to use the shared utility.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR introduces atomic_json_update() in metaflow/util.py and migrates two unsafe JSON write patterns onto it: the flock-on-data-file + seek/truncate in conda_environment.py (serialized but not crash-safe) and the completely unlocked write in production_token.py (lost writes under concurrent deploys). The new implementation is correct — a separate .lock file is used for flock so the data inode can be atomically replaced via os.replace, the fcntl import is scoped inside the function to avoid breaking non-POSIX importers of util.py, and a return-type guard on updater_fn prevents silent null corruption.

  • metaflow/util.pyatomic_json_update: clean implementation; separate lock file, temp-file/os.replace, dir_name guard, in-function fcntl import, and updater_fn type check are all correctly handled.
  • metaflow/plugins/aws/step_functions/production_token.pystore_token is now safe, but the _makedirs private helper (lines 20–29) is no longer called and should be removed.
  • metaflow/plugins/pypi/conda_environment.py — old flock pattern is gone, but import errno and import fcntl at the top of the file were not removed and are now unused.
  • test/plugins/conda/test_conda_environment_unit.py — 50-thread × 20-write concurrency test plus create/update/crash-safety unit tests provide solid coverage of the new utility.

Confidence Score: 4/5

  • Safe to merge; the core atomic-write logic is correct and the only issues are two trivial cleanup items (dead helper function, orphaned imports).
  • The fundamental correctness of the separate-lockfile + os.replace pattern is sound, all previously flagged issues have been addressed, and the concurrency tests validate the fix end-to-end. A score of 4 rather than 5 reflects the two small leftover hygiene issues (_makedirs dead code and stale fcntl/errno imports) that could confuse future readers.
  • production_token.py (dead _makedirs helper) and conda_environment.py (unused fcntl/errno imports).

Important Files Changed

Filename Overview
metaflow/util.py Adds atomic_json_update(): correct separate-lockfile + temp-file/os.replace pattern with dir guard, in-function fcntl import, and updater_fn return-type check; no new issues found here.
metaflow/plugins/aws/step_functions/production_token.py Unsafe unguarded write replaced with atomic_json_update; _makedirs helper is now dead code and should be removed.
metaflow/plugins/pypi/conda_environment.py flock+seek/truncate replaced with atomic_json_update; import errno and import fcntl at the top of the file are now unused and should be removed.
test/plugins/conda/test_conda_environment_unit.py Good coverage of atomic_json_update (concurrent writes, create-if-missing, update-existing, crash safety); concurrency test correctly validates all 1000 writes survive.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant atomic_json_update
    participant LockFile as ".lock file (stable inode)"
    participant TmpFile as "tmp file (mkstemp)"
    participant DataFile as "JSON data file"

    Caller->>atomic_json_update: atomic_json_update(path, updater_fn)
    atomic_json_update->>LockFile: open(path + ".lock", "a+")
    atomic_json_update->>LockFile: fcntl.flock(LOCK_EX)  [blocks until held]
    atomic_json_update->>DataFile: open(path, "r") — read current dict
    atomic_json_update->>atomic_json_update: d = updater_fn(d)
    atomic_json_update->>TmpFile: mkstemp() → write JSON + flush + fsync
    atomic_json_update->>DataFile: os.replace(tmp, path)  [atomic swap]
    atomic_json_update->>LockFile: fcntl.flock(LOCK_UN)
    atomic_json_update-->>Caller: return updated dict
Loading

Comments Outside Diff (2)

  1. metaflow/plugins/aws/step_functions/production_token.py, line 20-28 (link)

    _makedirs is now dead code

    After migrating store_token to atomic_json_update, the _makedirs helper defined here is no longer called anywhere in this file. Directory creation is now handled inside atomic_json_update via os.makedirs(dir_name, exist_ok=True). Leaving the function in place is misleading for future maintainers and also preserves a Python 2 compatibility comment (# this is for python2 compatibility) that no longer applies.

    (Remove lines 20–29 entirely.)

  2. metaflow/plugins/pypi/conda_environment.py, line 2-3 (link)

    Unused imports left behind after refactor

    errno and fcntl were exclusively used in the old write_to_environment_manifest implementation (for errno.EEXIST, errno.EAGAIN, and fcntl.flock). Both of those usages were removed as part of this PR, so these two imports are now orphaned. Most linters will flag them, and they give readers a false impression that platform-specific locking still happens in this file.

Last reviewed commit: d4fe183

Nissan Pow added 2 commits March 11, 2026 12:49
Add parameterized tests that run the old broken patterns (flock on
data file + os.replace, and no locking at all) under high contention
and verify they lose writes or corrupt data. These contrast with the
atomic_json_update and flock+seek/truncate tests which must always
preserve all writes.

The broken-pattern tests skip (rather than fail) if the race doesn't
trigger, since thread scheduling is non-deterministic.
- Move fcntl import inside function body so util.py stays importable on
  non-POSIX platforms (fcntl is Linux/macOS only)
- Guard against empty dirname when path has no directory component
  (e.g. bare "config.json") — both makedirs and mkstemp would fail
- Document that .lock files are intentionally never deleted (their
  stable inode is required for correct flock serialization)
Nissan Pow added 2 commits March 11, 2026 13:07
One test that failed with the old code (lost writes under contention)
and now passes with atomic_json_update. 50 threads x 20 writes = 1000
entries must all survive. Confirmed: old pattern loses 970/1000 writes,
new pattern loses 0.
@metaflow-oss-trigger
Copy link

Netflix internal testing[1699] @ d4fe183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant