fix: crash-safe atomic JSON read-modify-write by npow · Pull Request #3006 · Netflix/metaflow

npow · 2026-03-11T19:40:54Z

Summary

Adds atomic_json_update() utility to metaflow/util.py that combines lock-file-based flock serialization with atomic temp-file + os.replace for crash safety
Fixes conda_environment.py manifest writes (was flock on data file + seek/truncate — serialized but not crash-safe)
Fixes production_token.py token writes (had no locking at all — concurrent deploys could clobber each other)

Problem

Two distinct issues with JSON read-modify-write in the codebase:

flock on data file + seek/truncate (conda manifest): Serializes writers correctly but a crash between seek(0) and finishing json.dump() leaves a truncated/corrupt file.
No locking at all (production token): Concurrent step-functions create calls for different flows sharing the same token prefix silently lose writes.

Both stem from the same root cause: there was no shared utility for safe concurrent JSON updates, so each callsite rolled its own pattern (or didn't).

Fix

atomic_json_update(path, updater_fn) uses a separate .lock file for flock (stable inode ensures serialization) + temp file + os.replace for the data file (crash-safe: readers never see a half-written file).

Key insight: you can't flock the data file and os.replace it — replacing the file creates a new inode, so other processes opening the file get a different inode and their own independent lock. The lock must be on a separate stable file.

Test plan

test_concurrent_writes_no_corruption runs 50 threads x 20 writes each (1000 total) against a single JSON file and asserts all 1000 entries survive.

Before (old code): This test fails — the old patterns lose writes under contention. Confirmed locally: 970 of 1000 writes lost with the no-locking pattern, and the flock+replace pattern also loses writes due to the inode bug.

After (this PR): This test passes — atomic_json_update preserves all 1000 writes. 10/10 consecutive runs pass with zero lost writes.

Additional unit tests:

test_creates_file_if_missing — creates new file from empty dict
test_updates_existing_file — merges into existing data
test_crash_safety_no_partial_writes — updater raises, original file untouched

The existing JSON read-modify-write patterns had two problems: 1. conda_environment.py used fcntl.flock on the data file + seek/truncate. This serializes correctly but isn't crash-safe: a crash between seek(0) and finishing json.dump leaves a truncated/corrupt file. 2. production_token.py had no locking at all — concurrent step-functions deploys for different flows could clobber each other's tokens. New approach: atomic_json_update() in metaflow/util.py uses a separate .lock file for flock (stable inode, so the lock actually serializes) combined with temp file + os.replace for the data (crash-safe: readers never see a half-written file). Both callers are updated to use the shared utility.

greptile-apps · 2026-03-11T19:45:12Z

Greptile Summary

This PR introduces atomic_json_update() in metaflow/util.py and migrates two unsafe JSON write patterns onto it: the flock-on-data-file + seek/truncate in conda_environment.py (serialized but not crash-safe) and the completely unlocked write in production_token.py (lost writes under concurrent deploys). The new implementation is correct — a separate .lock file is used for flock so the data inode can be atomically replaced via os.replace, the fcntl import is scoped inside the function to avoid breaking non-POSIX importers of util.py, and a return-type guard on updater_fn prevents silent null corruption.

metaflow/util.py — atomic_json_update: clean implementation; separate lock file, temp-file/os.replace, dir_name guard, in-function fcntl import, and updater_fn type check are all correctly handled.
metaflow/plugins/aws/step_functions/production_token.py — store_token is now safe, but the _makedirs private helper (lines 20–29) is no longer called and should be removed.
metaflow/plugins/pypi/conda_environment.py — old flock pattern is gone, but import errno and import fcntl at the top of the file were not removed and are now unused.
test/plugins/conda/test_conda_environment_unit.py — 50-thread × 20-write concurrency test plus create/update/crash-safety unit tests provide solid coverage of the new utility.

Confidence Score: 4/5

Safe to merge; the core atomic-write logic is correct and the only issues are two trivial cleanup items (dead helper function, orphaned imports).
The fundamental correctness of the separate-lockfile + os.replace pattern is sound, all previously flagged issues have been addressed, and the concurrency tests validate the fix end-to-end. A score of 4 rather than 5 reflects the two small leftover hygiene issues (_makedirs dead code and stale fcntl/errno imports) that could confuse future readers.
production_token.py (dead _makedirs helper) and conda_environment.py (unused fcntl/errno imports).

Important Files Changed

Filename	Overview
metaflow/util.py	Adds `atomic_json_update()`: correct separate-lockfile + temp-file/os.replace pattern with dir guard, in-function fcntl import, and updater_fn return-type check; no new issues found here.
metaflow/plugins/aws/step_functions/production_token.py	Unsafe unguarded write replaced with `atomic_json_update`; `_makedirs` helper is now dead code and should be removed.
metaflow/plugins/pypi/conda_environment.py	flock+seek/truncate replaced with `atomic_json_update`; `import errno` and `import fcntl` at the top of the file are now unused and should be removed.
test/plugins/conda/test_conda_environment_unit.py	Good coverage of `atomic_json_update` (concurrent writes, create-if-missing, update-existing, crash safety); concurrency test correctly validates all 1000 writes survive.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant atomic_json_update
    participant LockFile as ".lock file (stable inode)"
    participant TmpFile as "tmp file (mkstemp)"
    participant DataFile as "JSON data file"

    Caller->>atomic_json_update: atomic_json_update(path, updater_fn)
    atomic_json_update->>LockFile: open(path + ".lock", "a+")
    atomic_json_update->>LockFile: fcntl.flock(LOCK_EX)  [blocks until held]
    atomic_json_update->>DataFile: open(path, "r") — read current dict
    atomic_json_update->>atomic_json_update: d = updater_fn(d)
    atomic_json_update->>TmpFile: mkstemp() → write JSON + flush + fsync
    atomic_json_update->>DataFile: os.replace(tmp, path)  [atomic swap]
    atomic_json_update->>LockFile: fcntl.flock(LOCK_UN)
    atomic_json_update-->>Caller: return updated dict

Comments Outside Diff (2)

metaflow/plugins/aws/step_functions/production_token.py, line 20-28 (link)

_makedirs is now dead code

After migrating store_token to atomic_json_update, the _makedirs helper defined here is no longer called anywhere in this file. Directory creation is now handled inside atomic_json_update via os.makedirs(dir_name, exist_ok=True). Leaving the function in place is misleading for future maintainers and also preserves a Python 2 compatibility comment (# this is for python2 compatibility) that no longer applies.

(Remove lines 20–29 entirely.)
metaflow/plugins/pypi/conda_environment.py, line 2-3 (link)

Unused imports left behind after refactor

errno and fcntl were exclusively used in the old write_to_environment_manifest implementation (for errno.EEXIST, errno.EAGAIN, and fcntl.flock). Both of those usages were removed as part of this PR, so these two imports are now orphaned. Most linters will flag them, and they give readers a false impression that platform-specific locking still happens in this file.

_{Last reviewed commit: d4fe183}

metaflow/util.py

test/plugins/conda/test_conda_environment_unit.py

Add parameterized tests that run the old broken patterns (flock on data file + os.replace, and no locking at all) under high contention and verify they lose writes or corrupt data. These contrast with the atomic_json_update and flock+seek/truncate tests which must always preserve all writes. The broken-pattern tests skip (rather than fail) if the race doesn't trigger, since thread scheduling is non-deterministic.

- Move fcntl import inside function body so util.py stays importable on non-POSIX platforms (fcntl is Linux/macOS only) - Guard against empty dirname when path has no directory component (e.g. bare "config.json") — both makedirs and mkstemp would fail - Document that .lock files are intentionally never deleted (their stable inode is required for correct flock serialization)

metaflow/util.py

One test that failed with the old code (lost writes under contention) and now passes with atomic_json_update. 50 threads x 20 writes = 1000 entries must all survive. Confirmed: old pattern loses 970/1000 writes, new pattern loses 0.

metaflow-oss-trigger · 2026-03-12T19:11:26Z

Netflix internal testing[1699] @ d4fe183

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

metaflow/util.py Outdated Show resolved Hide resolved

metaflow/util.py Outdated Show resolved Hide resolved

metaflow/util.py Show resolved Hide resolved

test/plugins/conda/test_conda_environment_unit.py Outdated Show resolved Hide resolved

Nissan Pow added 2 commits March 11, 2026 12:49

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

metaflow/util.py Outdated Show resolved Hide resolved

metaflow/util.py Show resolved Hide resolved

Nissan Pow added 2 commits March 11, 2026 13:07

fix: remove unused errno import, guard against updater_fn returning None

d4fe183

npow requested a review from romain-intel March 11, 2026 20:10

npow added testable ok-to-test labels Mar 12, 2026

metaflow-oss-trigger bot added testable and removed testable labels Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: crash-safe atomic JSON read-modify-write#3006

fix: crash-safe atomic JSON read-modify-write#3006
npow wants to merge 5 commits intomasterfrom
npow/atomic-json-update

npow commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

metaflow-oss-trigger bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

npow commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Test plan

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

metaflow-oss-trigger bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

npow commented Mar 11, 2026 •

edited

Loading

greptile-apps bot commented Mar 11, 2026 •

edited

Loading