Skip to content

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#374

Closed
russfellows wants to merge 28 commits into
mlcommons:mainfrom
russfellows:main
Closed

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#374
russfellows wants to merge 28 commits into
mlcommons:mainfrom
russfellows:main

Conversation

@russfellows
Copy link
Copy Markdown
Contributor

PR Summary: branch-3-0-2/bug-fixes-perf-enhancements

Branch: branch-3-0-2/bug-fixes-perf-enhancements
Base: main (mlcommons/storage)
Date: May 13, 2026
Tests: 127 passed, 0 failed (was 112 passed, 13 failed on clean main)


Issues Addressed

Of the 7 most recent open issues on mlcommons/storage, 6 are fixed by this branch.
Issue #369 was determined to be an environment/OpenMPI configuration problem with
no code fix applicable.

Issue Title Status Fix location
#362 Training stuck at epoch 1, no NVMe reads ✅ Fixed dlio_benchmarkreader_factory.py
#363 collect_cluster_info() missing required results_dir ✅ Fixed benchmarks/base.py
#364 Flux AU limited by Parquet deserialization throughput ✅ Fixed dlio_benchmarkreader_factory.py + s3dlio
#365 Checkpointing split-phase reports wrong operation counts ✅ Fixed benchmarks/base.py
#367 reportgen crashes with AttributeError on Namespace.file ✅ Fixed cli_parser.py
#369 orte_init failed — No permission (-17) ⚪ Not a code bug OpenMPI environment/permissions issue
#371 --params storage.storage_type=direct_fs silently uses pagecache ✅ Fixed dlio_benchmarkpytorch_checkpointing.py
#372 32 GB hard cap blocks large-memory runs ✅ Fixed (pending commit) dlio_benchmarkutils/config.py

Commit History (above main)

Commit 1 — 022820b

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: cli_parser: guard --file/--object consolidation for non-benchmark subcommands
Fixes: #367
Cherry-picked from: PR #368

Problem: The reportgen, history, and lockfile subcommands do not call
add_storage_type_arguments(), so their Namespace objects have no .file or
.object attribute. The unconditional read and del in parse_arguments()
crashed with AttributeError.

Changesmlpstorage_py/cli_parser.py:

  • Guard the --file/--object consolidation block with
    if hasattr(parsed_args, "file") or hasattr(parsed_args, "object"):
  • Use getattr(parsed_args, "file", False) instead of direct attribute access
  • Replace bare del parsed_args.file / del parsed_args.object with a
    for _attr in ("file", "object"): if hasattr(...): delattr(...) loop
    so neither attribute is required to be present

Also includes new unit tests in tests/unit/test_cli.py covering the
parser behaviour for all subcommand types.


Commit 2 — 03765a2

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: Remove unwanted file
Cherry-picked from: PR #368

Removes a requirements.txt that was accidentally included in the
previous commit.


Commit 3 — 7e4245b

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: Fix #363: pass results_dir to collect_cluster_info
Fixes: #363
Cherry-picked from: PR #366

Problem: Benchmark._collect_cluster_information() called
collect_cluster_info() without the required positional argument
results_dir. This caused a TypeError at runtime:

WARNING: MPI cluster info collection failed: collect_cluster_info()
missing 1 required positional argument: 'results_dir'

The missing cluster info then propagated as None into reportgen,
causing a downstream crash:

[INVALID] None: Check check_num_files_train failed with error:
'NoneType' object has no attribute 'total_memory_bytes'

Changesmlpstorage_py/benchmarks/base.py:

  • Extract ssh_username and shared_staging_dir from self.args via
    getattr(..., None) before the call
  • Pass results_dir=self.run_result_output (the benchmark's computed
    output directory) to collect_cluster_info()
  • Pass shared_staging_dir=shared_staging_dir and
    ssh_username=ssh_username so SSH-based collection uses the correct
    credentials and staging path

Changesmlpstorage_py/tests/test_benchmarks.py:

  • Set benchmark.run_result_output = '/tmp/results/run-001' in the
    test fixture (previously missing; the call site needs this attribute)
  • Update assert_called_once_with to expect results_dir,
    shared_staging_dir, and ssh_username
  • Add TestCollectClusterInfoSignatureBinding regression test class (2
    new tests) that binds the actual kwargs against inspect.signature()
    of the real collect_cluster_info function, so future signature drift
    is caught at unit-test time rather than at runtime

Commit 4 — 2431011

Author: Russell Fellows
Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>
Message: Fix #365, #372: metadata override propagation, test suite fixes, env lock
Fixes: #365

Fix #365 — CLI override_parameters not reflected in metadata.json

Problem: The submission checker reads num_checkpoints_write /
num_checkpoints_read from metadata['parameters'] (the YAML
defaults). For split-phase submissions (write-only or read-only runs),
the correct counts are passed as CLI overrides such as:

override_parameters.num_checkpoints_write=10

These overrides landed in metadata['override_parameters'] only, which
the checker ignores. As a result, a 10-write + 10-read split-phase run
would aggregate to 20 writes + 20 reads and be marked INVALID.

Changesmlpstorage_py/benchmarks/base.py:

  • Add _apply_dotted_overrides(params, overrides) static method that
    deep-copies params and merges dotted-key overrides into the nested
    dict structure
  • In the metadata property, call _apply_dotted_overrides() so
    metadata['parameters'] reflects the effective runtime configuration
  • metadata['override_parameters'] is still emitted unchanged for a
    full audit trail

Note: PR #370 (crossmeta/zettalane) addresses the same root cause.
That PR is blocked pending CLA signature from @zettalane. This
implementation is carried independently; the two fixes are
functionally equivalent.

Fix — DLIOResultParser system info fallback

Problem: When a DLIO summary.json does not contain a system_info
block (e.g. runs from older DLIO versions), DLIOResultParser.parse()
returned None for ClusterInformation, breaking BenchmarkRun
validation.

Changesmlpstorage_py/rules/models.py:

  • DLIOResultParser.parse() now accepts an optional metadata kwarg
  • When ClusterInformation.from_dlio_summary_json() returns None,
    fall back to metadata['cluster_information'] if present and
    reconstruct via ClusterInformation.from_dict()
  • BenchmarkRun.__init__ passes the run's metadata object to
    parser.parse() to enable the fallback

Fix — 13 pre-existing test failures

mlpstorage_py/tests/test_cluster_collector.py (10 tests):

  • All MPIClusterCollector(...) constructor calls and
    collect_cluster_info(...) call sites in failing tests were missing
    the now-required results_dir argument — added results_dir='/tmp'
    to all 10 affected call sites
  • test_collector_returns_valid_data_without_error_marker: rewrote to
    use the current shared_staging_dir=tmpdir pattern instead of the
    obsolete UUID-based staging directory approach

mlpstorage_py/tests/test_rules.py (3 tests):

  • TestBenchmarkRunSystemInfoFallback tests were failing with
    ValueError: No summary.json found in /tmp/test_run because they
    attempted real filesystem I/O
  • Patched DLIOResultParser._load_summary and
    DLIOResultParser._load_hydra_configs to return in-memory mock data,
    removing the filesystem dependency

pyproject.toml / uv.lock

  • Add [tool.uv] environments = ["sys_platform == 'linux'"] to
    pyproject.toml so uv lock does not attempt to resolve non-Linux
    platform markers (s3dlio only publishes Linux wheels)
  • Regenerate uv.lock accordingly

dlio_benchmark Fixes (russfellows/dlio_benchmark — feat/parquet-dgen-streaming)

The following fixes are in the dlio_benchmark fork that is pinned by this
branch's pyproject.toml. They are already committed in the fork; issue #372
has an additional local change that is pending commit/push.


Fix #362 / #364 — Training stuck at epoch 1; Flux AU limited by CPU Parquet deserialization

Files: dlio_benchmark/reader/reader_factory.py,
dlio_benchmark/reader/parquet_reader_file_iterable.py (new),
dlio_benchmark/reader/parquet_reader_s3dlio.py
Commit: 1635b79 (feat: s3dlio-gen streaming, iterable dataloader, file iterable reader)

Issue #362 — Stuck at epoch 1, no NVMe reads:
reader_factory.py routed LOCAL_FS + Parquet to the legacy ParquetReader,
which calls pf.read_row_group() — full PyArrow deserialization on every read.
This is entirely CPU-bound and saturates the Python GIL, starving DLIO's
DataLoader workers of CPU time. Observed symptom: benchmark reaches
"Starting epoch 1" and then makes no measurable NVMe I/O while CPU pegs at
88-95%.

Issue #364 — Flux AU limited by per-process Parquet deserialization:
Same root cause. Even on a 192-vCPU Zen 4 machine, PyArrow's
read_row_group(use_threads=True) spawns additional decode threads per call.
Under DLIO's model (e.g. 4 MPI × 8 read_threads = 32 workers), hundreds of
threads contend on the GIL. AU on Skylake with data in tmpfs (zero I/O latency):
21% — storage is provably not the bottleneck; CPU decode is.

Fix: reader_factory.py now routes LOCAL_FS + Parquet to
ParquetReaderFileIterable — a new reader that performs raw byte-range reads
via a 64-thread ThreadPoolExecutor without any PyArrow decode. Data is
returned as raw bytes to the training loop. For S3/object storage, the s3dlio
Rust-based reader (ParquetReaderS3dlio) is used, which similarly bypasses
Python-side decode.

# Before (reader_factory.py):
# LOCAL_FS + Parquet → ParquetReader → pf.read_row_group() — full PyArrow decode

# After:
elif _args.storage_type in (StorageType.LOCAL_FS,):
    from dlio_benchmark.reader.parquet_reader_file_iterable import ParquetReaderFileIterable
    return ParquetReaderFileIterable(dataset_type, thread_index, epoch_number)

Result (from issue #364 testing, c6in.16xlarge, data on tmpfs):

Accelerators use_threads AU Throughput Result
4 True (before) 54.38% ~77 MB/s ❌ FAIL
4 False (workaround) 99.79% 141.80 MB/s ✅ PASS
8 False (workaround) 99.68% 283.07 MB/s ✅ PASS

The ParquetReaderFileIterable path goes further — no decode at all — giving
even better scaling on older CPU generations (Skylake, Cascade Lake) that lack
AVX-512 Parquet acceleration.


Fix #371--params storage.storage_type=direct_fs silently uses page cache

File: dlio_benchmark/checkpointing/pytorch_checkpointing.py
Commit: present in fork on branch feat/parquet-dgen-streaming

Problem: After PR #359 renamed the Python package from mlpstorage
mlpstorage_py, one import path in dlio_benchmark was missed:

# Before (bug — old package name):
try:
    from mlpstorage.checkpointing import StreamingCheckpointing as _SC  # always fails
except ImportError:
    from dlio_benchmark.checkpointing.simple_streaming_checkpointing import (
        SimpleStreamingCheckpointing as _SC,   # silently falls back here
    )

SimpleStreamingCheckpointing ignores the backend='direct_fs' argument
entirely and uses plain open(path, "wb"). The result: when a user passes
--params storage.storage_type=direct_fs, page cache is never bypassed.
This was confirmed with free -h showing page cache growing during the write
phase and Lustre client cache filling up on a Lustre-backed mount.

Fix (one line):

# After:
from mlpstorage_py.checkpointing import StreamingCheckpointing as _SC

This ensures direct_fs checkpointing correctly uses O_DIRECT via s3dlio's
direct:// URI scheme, bypassing the page cache as intended.


Fix #372 — 32 GB hard cap blocks large-memory runs

File: dlio_benchmark/utils/config.py
Status: Modified locally in russfellows/dlio_benchmarkpending commit/push

Problem: BUDGET_MB was hard-coded to 32 * 1024 (32 GB). On hosts with
more than 32 GB of RAM this cap artificially constrains the number of DataLoader
workers. The error manifests as:

Exception: Memory budget exceeded: reader.read_threads=2 x comm_size=64 = 128
worker processes, estimated ~64 GB (hard cap: 32 GB). Reduce reader.read_threads
to at most 1 for this run.

On a 377 GB host trying to run 64 accelerators × 2 read_threads, the cap
prevents any run above 32 B200 ranks × 2 threads = 32 GB, limiting throughput
to ~2.3 GB/s regardless of storage capability (well below a Gen5 NVMe's 14 GB/s).

Fix:

# Before:
BUDGET_MB = 32 * 1024  # 32 GB hard cap

# After:
BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024)  # actual host RAM

The budget now scales with actual installed RAM, which is the correct
upper bound for in-memory dataset caching.


Issue #369orte_init failed: No permission (-17) (No code fix)

Problem: OpenMPI orte_init fails with getting local rank failed → Returned value No permission (-17). This occurs when MPI processes are
launched as root without passing --allow-run-as-root to mpirun, or
when running inside a container with restricted Linux namespaces that
prevent OpenMPI's process management layer from initializing.

Assessment: This is an environment and OpenMPI configuration issue,
not a bug in mlpstorage or dlio_benchmark. The fix is to add
--allow-run-as-root to the mpirun invocation, or to configure the
container/namespace permissions to allow OpenMPI's process manager. No
code change is warranted.


Test Results

Before (clean main):  112 passed, 13 failed
After  (this branch): 127 passed,  0 failed

The net gain of 15 passing tests breaks down as:

russfellows and others added 28 commits April 27, 2026 15:37
- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files)
- Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files)
- Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml
  (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML)
- Add .env.example documenting all runtime parameters
- Update 22 shell scripts: pip/venv setup -> uv sync pattern
- Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports
- Update tests/object-store/README.md:
  - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...'
  - Update Library Selection section: YAML key -> runtime --param approach
  - Remove s3torchconnector from library selection table (keep historical results)
  - Update prerequisites: source .venv + source .env -> uv sync

Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)
- Extract --file/--object from add_universal_arguments into new
  add_storage_type_arguments() function; VectorDB/KVCache parsers
  no longer require it; training/checkpointing parsers call it
- Update training/checkpointing tests to pass --file in parse_args
- Wrap _collect_cluster_start/_collect_cluster_end with
  progress_context to show spinner during SSH/MPI collection
- Pass validate_structure=False to ReportGenerator in test fixtures
  that use empty temporary directories
- Change logger.error -> logger.warning for nonexistent results dir
  in get_runs_files; skip dirs with multiple metadata files
- Add _uri_for_filename alias to ParquetReaderS3Iterable
- Make --file/--object optional (required=False) so ALL benchmark
  parsers can carry the flag; VectorDB and KV-cache parsers now
  include it so the argument is available everywhere
- Fix progress.py: replace logger.status() (non-existent Logger
  method) with logger.info() in both progress_context and
  create_stage_progress non-interactive fallback paths
- Update tests to assert logger.info() instead of logger.status()

dlio_benchmark changes (local fork + installed venv):
- Replace broken \r-in-logger progress() with a Rich-based
  implementation using SpinnerColumn + BarColumn; falls back
  to plain stdout writes if Rich is unavailable
…rams

Reduce tests/object-store/ from 30+ files to 4 clean tests:
  - run_training.sh      — datagen + training via mlpstorage CLI
  - run_checkpointing.sh — checkpoint write + read via dlio_benchmark
  - test_s3lib_get_bench.py      — GET throughput benchmark (updated)
  - test_direct_write_comparison.py — native write/read benchmark (updated)

All runtime parameters (bucket, endpoint, storage library, credentials)
now come exclusively from environment variables or .env — no hardcoded
site-specific values remain in any test script or config file.

Changes:
- Archive 26 per-library scripts and result docs to old-archive/
- Archive 3 per-library checkpoint YAMLs to old-archive/
- Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only
  YAML with all storage runtime params supplied via Hydra CLI overrides
- run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable
- run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable
- test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio);
  fail fast with clear error if bucket not set
- test_direct_write_comparison.py: use BUCKET env var as shared default;
  add validation error if required buckets not set
- Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests

Unit tests: 905 passed, 4 skipped (no regressions)
…ore tests

- pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev,
  which contains minio connection-pool fix and s3torchconnector bool fix
- uv.lock: regenerated after pyproject.toml change (resolved b1696e1)
- configs/dlio/workload: remove 17 library-specific YAML files (minio,
  s3dlio, s3torch variants) — all storage params are now supplied via
  --params CLI overrides from .env; generic YAMLs remain
- configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field
- tests/object-store/README.md: complete rewrite with accurate instructions
- tests/object-store/run_training.sh: add s3torchconnector support,
  spawn multiprocessing, disable checkpoint in training tests
- tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector
- tests/object-store/run_datagen.sh: new helper script
- tests/object-store/run_cleanup.sh: new helper script
- tests/object-store/old-archive/: archive stale test utility files
…d parquet loading

Object storage (dlio.py):
- _apply_object_storage_params() now logs the .env file path it loads
- Raises FileNotFoundError with actionable message if --object mode finds no .env

Config (config.py):
- DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir

Main (main.py):
- Add import os (was missing after tempdir warning addition)
- Warn at startup when results will be written to system temp dir

Checkpointing (streaming_checkpoint.py):
- IPC Queue/Event created from same multiprocessing context as child process
- Fixes SemLock fork/spawn mismatch on non-fork start methods

MPI (utils.py):
- Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults

Dependencies (pyproject.toml, uv.lock):
- s3dlio >= 0.9.95
- python-dotenv >= 1.0.0
- dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming

Security (.gitignore):
- Block .env.* credential files; keep .env.example

Unit tests (933 passing, 4 skipped):
- tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior
- tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark()
- tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params()
- tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API
  (cache stores int byte-count not Table; no LRU eviction; close() is no-op)

Docs:
- docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/
- README.md, docs/README.md, tests/README.md: cross-reference links updated

Benchmark results and analysis (new in tests/):
- tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy)
- tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test
- tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md
- tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md
- tests/DLIO-optimization-analysis-2026-04-25.md
…iles

tests/unit/test_benchmarks_vectordb.py:
- Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*)
- Add _validate_vdb_dependencies mock to all 14 tests that instantiate
  VectorDBBenchmark; that method runs in __init__ before verify_benchmark
  and raises DependencyError when optional packages (pymilvus, tabulate)
  are not installed in the base uv env

tests/unit/test_cli.py:
- Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser,
  mlpstorage.config → mlpstorage_py.*)
- Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved

All 15 previously-failing upstream tests now pass.
Full suite: 949 passed, 4 skipped.
…nhancements

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading
- flux_datagen.yaml: add use_s3dlio_gen: true, row_group_size: 48
- dlrm_b200.yaml: tune prefetch_size/read_threads for benchmark accuracy
- pyproject.toml: s3dlio>=0.9.100; dlio-benchmark from russfellows fork
  (feat/parquet-dgen-streaming); local s3dlio wheel NOTE comment
- tests/DLRM_test_results.md: direct DLIO benchmark reader comparison results
- docs/Flux_NP_ReadThreads_Scaling_Results.md: new -- NP in {1,2,4,8} x
  RT in {1,2,4,8} scaling sweep results, CPU threshold analysis,
  computation_time impact at 0.5s and 1.35s, samp/s/GPU column
- tests/object-store/: add bench/gen/run scripts for Flux and DLRM workloads
- .gitignore: ignore sweep_logs/, sweep_*.sh, sim_*.tsv*, results/
…ecture docs

── 1. DLRM workload config fixes (configs/dlio/workload/) ───────────────

dlrm_b200.yaml, dlrm_datagen.yaml:
  Reduce num_samples_per_file from 4,718,592 to 1,536,000.
  1,536,000 = 250 row groups x 6,144 rows/RG. This keeps the Parquet
  footer under the s3-ultra 4 MiB single-object GET limit. The previous
  value produced a footer exceeding 4 MiB, causing s3-ultra to reject
  the GET and fall back to a multi-part read, distorting latency.
  Also enables use_s3dlio_gen: true and aligns row_group_size to
  batch_size (6,144) for optimal row-group cache hit rate.

── 2. UNet3D B200 workload config (configs/dlio/workload/unet3d_b200.yaml) ─

New config for UNet3D benchmarking on B200-class hardware.
  - computation_time: 0.162 s (H100 baseline / 2 for B200 throughput target)
  - 7,200 NPZ files, ~140 MiB each, s3dlio storage library
  - batch_size: 4, read_threads: 4

── 3. UNet3D NP sweep scripts (tests/object-store/) ─────────────────────

sweep_unet3d_np.sh:
  Automated NP=1/2/4 scaling sweep for the UNet3D B200 workload.
  Each run writes results to results/unet3d_np_sweep/<timestamp>/.
  Appends a TSV summary row and auto-generates docs/UNet3D_NP_Scaling_Results.md
  at sweep completion. NP=8 excluded -- s3-ultra saturates at NP>=4.

gen_unet3d_npz.sh:
  Generates the 984 GiB UNet3D NPZ dataset on s3-ultra (mlp-unet3d bucket)
  using dlio_benchmark's NPZGenerator fast path (s3dlio generate_npz_bytes(),
  zero Python-side copies, hardware CRC32, Rayon parallel fill).

test_unet3d.sh:
  Single-run smoke test for the UNet3D B200 config (NP=1, 1 epoch).

── 4. DLRM sweep scripts (tests/object-store/) ──────────────────────────

sweep_dlrm_np.sh:      NP=1/2/4 scaling sweep for DLRM Parquet workload.
sweep_dlrm_compute.sh: Compute-time sensitivity sweep for DLRM.

── 5. DataLoader architecture documentation (docs/) ─────────────────────

docs/DATALOADER_ARCHITECTURE.md (new):
  Comprehensive reference covering two major topics:

  Part 1 -- Map-style vs. iterable DataLoaders on S3:
    Why "iterable is better for large datasets" originates from HDD seek
    patterns and does not apply to object storage. The real argument for
    iterable is pipeline depth: TorchIterableDatasetSimple achieves
    64 x num_workers in-flight GETs (vs 1 x num_workers with map-style).
    Covers TorchIterableDatasetSimple implementation mechanics, known
    limitations (per-epoch shuffle propagation, prefetch memory bounds,
    drop-last), and a summary comparison table.

  Part 2 -- O_DIRECT on local NVMe (two independent paths):
    Why O_DIRECT is required for accurate NVMe benchmarking (page cache
    problem). Detailed description and comparison of both available paths:
      - odirect: true  -- Python os.open+os.readv, map-style, 1 read/worker
      - storage_library: direct -- Rust/Tokio O_DIRECT, iterable, 64/worker
    12-property comparison table. Guidance on using both paths together
    to isolate I/O concurrency depth and GIL contention as independent
    variables. Includes TOC with anchor links to all sections.

docs/UNet3D_NP_Scaling_Results.md (new):
  NP=1/2/4 benchmark results for UNet3D B200 on s3-ultra.
  Generated by sweep_unet3d_np.sh.

docs/DLRM_NP_Scaling_Results.md (new):
  NP=1/2/4 benchmark results for DLRM Parquet on s3-ultra.

docs/Flux_NP_ReadThreads_Scaling_Results.md (updated):
  Additional read_threads sweep results appended.

docs/README.md (updated):
  - New "Where to Start" row: Benchmark NVMe with O_DIRECT pointing to
    DATALOADER_ARCHITECTURE.md#o_direct-local-storage-two-independent-paths
  - DATALOADER_ARCHITECTURE.md entry expanded to summarise both parts
    (S3 iterable DataLoader and O_DIRECT NVMe paths) with anchor link.

── 6. pyproject.toml / uv.lock ──────────────────────────────────────────

Switch dlio-benchmark dependency from git branch reference to local
editable path (../dlio_benchmark). Allows iterating on dlio_benchmark
and mlp-storage together without tagging intermediate git commits.
uv.lock updated accordingly.

── 7. .gitignore additions ──────────────────────────────────────────────

Add patterns for runtime artifacts that should never be committed:
  hydra_log/          -- Hydra config output written to cwd during runs
  sweep_unet3d_*.log  -- Timestamped sweep run logs written to repo root
  sweep_dlrm_*.log    -- Timestamped sweep run logs written to repo root
  sweep_flux_*.log    -- Timestamped sweep run logs written to repo root
uv.lock: bump s3dlio wheel to 0.9.100 (skip_head HEAD optimisation,
  PyDataset.from_uris(), items(), collect_batch())

tests/object-store/test_retinanet.sh: end-to-end retinanet 3-epoch benchmark
tests/object-store/gen_retinanet_jpeg.sh: generate retinanet JPEG dataset
tests/object-store/sweep_retinanet_np.sh: sweep concurrency parameters for NP workload
…3dlio 0.9.100)

Benchmark results from 2026-05-12 sweep on co-located 24 vCPU / 48 GB host.
50,000 JPEG files × ~315 KiB/file, 8 epochs, batch=24, read_threads=8.
DataLoader: TorchIterableDatasetSimple + _s3_stream_next() pipelined chunking.
dlio_benchmark commit: fc92d7f (feat/parquet-dgen-streaming).
pyproject.toml:
- dlio-benchmark: local editable -> GitHub rev 3667a0e (v3.0.2)
- s3dlio: local wheel source removed (now resolves from PyPI via >=0.9.100 pin)
- [tool.uv] environments = ['sys_platform == linux'] added (s3dlio Linux-only)

uv.lock:
- dlio-benchmark 3.0.1 -> 3.0.2 from russfellows/dlio_benchmark@3667a0e
- s3dlio 0.9.100 from local wheel -> pypi.org/simple
- mlpstorage 2.0.0b1 -> 3.0.2
- Removed colorama + tzdata (Windows-only, no longer resolved)
…ve historical analysis

Deleted from old-archive/ (31 files):
- All per-library dlio_minio_*.sh, dlio_s3dlio_*.sh, dlio_s3torch_*.sh
  (superseded by unified run_datagen/training/checkpointing/cleanup.sh)
- demo_streaming_checkpoint.sh, test_minio_checkpoint.py,
  test_s3dlio_checkpoint.py, test_s3torch_checkpoint.py
  (superseded by run_checkpointing.sh)
- test_dlio_direct_s3dlio.sh, test_dlio_multilib_demo.py,
  test_mlp_minio/s3dlio/s3torch.sh, test_s3dlio_multilib.sh,
  test_training_mpi_sweep.py (superseded by sweep_*.sh)
- llama3_8b_checkpoint_*.yaml (configs now in configs/dlio/)
- dlio_mpi_object_results.md, Object_Perf_Results.md,
  s3dlio_performance_analysis.md (stale; issues since resolved)

Moved from top-level to old-archive/ (historical reference):
- bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py
- bench-results-retinanet-20260425.md

Remaining old-archive/ contains 10 reference files:
- test_direct_write_comparison.py, test_s3dlio_direct.py,
  test_s3dlio_formats.py/.sh, test_s3lib_get_bench.py,
  S3library_review_21-Mar.md (library API/concurrency reference)
- bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py
  (historical optimization analysis)
- bench-results-retinanet-20260425.md (historical benchmark results)
…ts, add sweeps/

Deleted:
- test_dlrm.sh, test_flux.sh — redundant one-liners; run_dlrm_bench.sh and
  run_flux_bench.sh are the proper scripts (full result parsing, env handling)
- gen_flux_parquet.py — non-standard one-off that bypassed mlpstorage datagen;
  confusing next to the .sh generators; can be replaced with gen_flux_parquet.sh

Moved to old-archive/ (Apr-27, ~16 days old, superseded):
- run_datagen.sh, run_training.sh — generic multi-model wrappers replaced by
  model-specific run_*_bench.sh scripts
- test_multi_endpoint_s3dlio.py — demo script, not a test

New sweeps/ subdirectory:
- sweep_dlrm_compute.sh, sweep_dlrm_np.sh, sweep_flux.sh,
  sweep_retinanet_np.sh, sweep_unet3d_np.sh

Also removed sweep_flux.sh from .gitignore (it was excluded as a scratch
script; now tracked properly under sweeps/)
Replace old run_datagen/run_training-centric docs with:
- Structure diagram showing 4 model types × 1 generator + 1 benchmark each
- Quick Start showing the 3-command flow per model
- Table mapping model → format → generator → benchmark script
- Updated Archived Tests section listing what's in old-archive/

Removed: detailed parameter tables for run_datagen.sh and run_training.sh
(both scripts moved to old-archive in previous commit)
Deleted (superseded by May 12 sweep results in docs/):
- tests/object-store/NPZ-OPTIMIZATION-ANALYSIS.md  (bug now fixed, stale)
- tests/object-store/scaling-analysis-2026-04-25.md (s3dlio v0.9.86 era)
- tests/object-store/s3ultra-test-results-20260425.md (s3dlio v0.9.86 era)

README.md: added Performance Results section linking to current docs/:
- docs/DLRM_NP_Scaling_Results.md
- docs/Flux_NP_ReadThreads_Scaling_Results.md
- docs/RetinaNet_NP_Scaling_Results.md
- docs/UNet3D_NP_Scaling_Results.md
…commands

   reports/history/lockfile subparsers do not call add_storage_type_arguments(),
   so their Namespace has no .file or .object attribute. The unconditional
   read and delete in parse_arguments() crashed with AttributeError. Gate the
   consolidation on attribute presence; downstream code already uses
   getattr(args, 'data_access_protocol', None).

   Fixes mlcommons#367

Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
… suite fixes, env lock

Fix mlcommons#365: apply CLI override_parameters into metadata.json parameters
  Add _apply_dotted_overrides() static method to Benchmark base class.
  At metadata serialization time, dotted-key CLI overrides are merged into
  the nested parameters dict so the submission checker sees the effective
  config (e.g. split-phase num_checkpoints_write/read). override_parameters
  is still emitted unchanged for full audit trail.
  This addresses the same root cause as PR mlcommons#370 (crossmeta/zettalane);
  that PR is pending CLA so this implementation is carried here independently.

Fix rules/models.py: system info fallback in DLIOResultParser
  When a DLIO summary.json lacks system_info, fall back to
  cluster_information from the run metadata dict. Fixes the
  TestBenchmarkRunSystemInfoFallback test class (3 tests).

Fix test suite: resolve 13 pre-existing test failures
  test_cluster_collector.py: add missing results_dir argument to all
    MPIClusterCollector constructor and collect_cluster_info() call sites
    (10 tests). Update test_collector_returns_valid_data_without_error_marker
    to use current shared_staging_dir=tmpdir pattern.
  test_rules.py: patch DLIOResultParser._load_summary and
    _load_hydra_configs in TestBenchmarkRunSystemInfoFallback tests so
    they use in-memory mock data instead of hitting /tmp/test_run (3 tests).
  All 127 tests now pass (125 pre-existing + 2 added by PR mlcommons#366).

pyproject.toml/uv.lock: pin uv environments to Linux
  s3dlio only publishes Linux wheels; lock the uv environment selector to
  sys_platform == 'linux' so cross-platform lock generation does not fail.

Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>
…nhancements

Branch 3-0-2/bug fixes perf enhancements
@russfellows russfellows requested a review from a team May 13, 2026 16:32
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@russfellows
Copy link
Copy Markdown
Contributor Author

Closing — opened in error, not ready for upstream review.

@github-actions github-actions Bot locked and limited conversation to collaborators May 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.