Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading by russfellows · Pull Request #374 · mlcommons/storage

russfellows · 2026-05-13T16:32:07Z

PR Summary: branch-3-0-2/bug-fixes-perf-enhancements

Branch: branch-3-0-2/bug-fixes-perf-enhancements
Base: main (mlcommons/storage)
Date: May 13, 2026
Tests: 127 passed, 0 failed (was 112 passed, 13 failed on clean main)

Issues Addressed

Of the 7 most recent open issues on mlcommons/storage, 6 are fixed by this branch.
Issue #369 was determined to be an environment/OpenMPI configuration problem with
no code fix applicable.

Issue	Title	Status	Fix location
#362	Training stuck at epoch 1, no NVMe reads	✅ Fixed	`dlio_benchmark` — `reader_factory.py`
#363	`collect_cluster_info()` missing required `results_dir`	✅ Fixed	`benchmarks/base.py`
#364	Flux AU limited by Parquet deserialization throughput	✅ Fixed	`dlio_benchmark` — `reader_factory.py` + s3dlio
#365	Checkpointing split-phase reports wrong operation counts	✅ Fixed	`benchmarks/base.py`
#367	`reportgen` crashes with `AttributeError` on `Namespace.file`	✅ Fixed	`cli_parser.py`
#369	`orte_init` failed — No permission (-17)	⚪ Not a code bug	OpenMPI environment/permissions issue
#371	`--params storage.storage_type=direct_fs` silently uses pagecache	✅ Fixed	`dlio_benchmark` — `pytorch_checkpointing.py`
#372	32 GB hard cap blocks large-memory runs	✅ Fixed (pending commit)	`dlio_benchmark` — `utils/config.py`

Commit History (above `main`)

Commit 1 — `022820b`

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: cli_parser: guard --file/--object consolidation for non-benchmark subcommands
Fixes: #367
Cherry-picked from: PR #368

Problem: The reportgen, history, and lockfile subcommands do not call
add_storage_type_arguments(), so their Namespace objects have no .file or
.object attribute. The unconditional read and del in parse_arguments()
crashed with AttributeError.

Changes — mlpstorage_py/cli_parser.py:

Guard the --file/--object consolidation block with
if hasattr(parsed_args, "file") or hasattr(parsed_args, "object"):
Use getattr(parsed_args, "file", False) instead of direct attribute access
Replace bare del parsed_args.file / del parsed_args.object with a
for _attr in ("file", "object"): if hasattr(...): delattr(...) loop
so neither attribute is required to be present

Also includes new unit tests in tests/unit/test_cli.py covering the
parser behaviour for all subcommand types.

Commit 2 — `03765a2`

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: Remove unwanted file
Cherry-picked from: PR #368

Removes a requirements.txt that was accidentally included in the
previous commit.

Commit 3 — `7e4245b`

Author: Devasena Inupakutika <devasena.i@samsung.com>
Message: Fix #363: pass results_dir to collect_cluster_info
Fixes: #363
Cherry-picked from: PR #366

Problem: Benchmark._collect_cluster_information() called
collect_cluster_info() without the required positional argument
results_dir. This caused a TypeError at runtime:

WARNING: MPI cluster info collection failed: collect_cluster_info()
missing 1 required positional argument: 'results_dir'

The missing cluster info then propagated as None into reportgen,
causing a downstream crash:

[INVALID] None: Check check_num_files_train failed with error:
'NoneType' object has no attribute 'total_memory_bytes'

Changes — mlpstorage_py/benchmarks/base.py:

Extract ssh_username and shared_staging_dir from self.args via
getattr(..., None) before the call
Pass results_dir=self.run_result_output (the benchmark's computed
output directory) to collect_cluster_info()
Pass shared_staging_dir=shared_staging_dir and
ssh_username=ssh_username so SSH-based collection uses the correct
credentials and staging path

Changes — mlpstorage_py/tests/test_benchmarks.py:

Set benchmark.run_result_output = '/tmp/results/run-001' in the
test fixture (previously missing; the call site needs this attribute)
Update assert_called_once_with to expect results_dir,
shared_staging_dir, and ssh_username
Add TestCollectClusterInfoSignatureBinding regression test class (2
new tests) that binds the actual kwargs against inspect.signature()
of the real collect_cluster_info function, so future signature drift
is caught at unit-test time rather than at runtime

Commit 4 — `2431011`

Author: Russell Fellows
Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>
Message: Fix #365, #372: metadata override propagation, test suite fixes, env lock
Fixes: #365

Fix #365 — CLI override_parameters not reflected in metadata.json

Problem: The submission checker reads num_checkpoints_write /
num_checkpoints_read from metadata['parameters'] (the YAML
defaults). For split-phase submissions (write-only or read-only runs),
the correct counts are passed as CLI overrides such as:

override_parameters.num_checkpoints_write=10

These overrides landed in metadata['override_parameters'] only, which
the checker ignores. As a result, a 10-write + 10-read split-phase run
would aggregate to 20 writes + 20 reads and be marked INVALID.

Changes — mlpstorage_py/benchmarks/base.py:

Add _apply_dotted_overrides(params, overrides) static method that
deep-copies params and merges dotted-key overrides into the nested
dict structure
In the metadata property, call _apply_dotted_overrides() so
metadata['parameters'] reflects the effective runtime configuration
metadata['override_parameters'] is still emitted unchanged for a
full audit trail

Note: PR #370 (crossmeta/zettalane) addresses the same root cause.
That PR is blocked pending CLA signature from @zettalane. This
implementation is carried independently; the two fixes are
functionally equivalent.

Fix — DLIOResultParser system info fallback

Problem: When a DLIO summary.json does not contain a system_info
block (e.g. runs from older DLIO versions), DLIOResultParser.parse()
returned None for ClusterInformation, breaking BenchmarkRun
validation.

Changes — mlpstorage_py/rules/models.py:

DLIOResultParser.parse() now accepts an optional metadata kwarg
When ClusterInformation.from_dlio_summary_json() returns None,
fall back to metadata['cluster_information'] if present and
reconstruct via ClusterInformation.from_dict()
BenchmarkRun.__init__ passes the run's metadata object to
parser.parse() to enable the fallback

Fix — 13 pre-existing test failures

mlpstorage_py/tests/test_cluster_collector.py (10 tests):

All MPIClusterCollector(...) constructor calls and
collect_cluster_info(...) call sites in failing tests were missing
the now-required results_dir argument — added results_dir='/tmp'
to all 10 affected call sites
test_collector_returns_valid_data_without_error_marker: rewrote to
use the current shared_staging_dir=tmpdir pattern instead of the
obsolete UUID-based staging directory approach

mlpstorage_py/tests/test_rules.py (3 tests):

TestBenchmarkRunSystemInfoFallback tests were failing with
ValueError: No summary.json found in /tmp/test_run because they
attempted real filesystem I/O
Patched DLIOResultParser._load_summary and
DLIOResultParser._load_hydra_configs to return in-memory mock data,
removing the filesystem dependency

pyproject.toml / uv.lock

Add [tool.uv] environments = ["sys_platform == 'linux'"] to
pyproject.toml so uv lock does not attempt to resolve non-Linux
platform markers (s3dlio only publishes Linux wheels)
Regenerate uv.lock accordingly

dlio_benchmark Fixes (russfellows/dlio_benchmark — feat/parquet-dgen-streaming)

The following fixes are in the dlio_benchmark fork that is pinned by this
branch's pyproject.toml. They are already committed in the fork; issue #372
has an additional local change that is pending commit/push.

Fix #362 / #364 — Training stuck at epoch 1; Flux AU limited by CPU Parquet deserialization

Files: dlio_benchmark/reader/reader_factory.py,
dlio_benchmark/reader/parquet_reader_file_iterable.py (new),
dlio_benchmark/reader/parquet_reader_s3dlio.py
Commit: 1635b79 (feat: s3dlio-gen streaming, iterable dataloader, file iterable reader)

Issue #362 — Stuck at epoch 1, no NVMe reads:
reader_factory.py routed LOCAL_FS + Parquet to the legacy ParquetReader,
which calls pf.read_row_group() — full PyArrow deserialization on every read.
This is entirely CPU-bound and saturates the Python GIL, starving DLIO's
DataLoader workers of CPU time. Observed symptom: benchmark reaches
"Starting epoch 1" and then makes no measurable NVMe I/O while CPU pegs at
88-95%.

Issue #364 — Flux AU limited by per-process Parquet deserialization:
Same root cause. Even on a 192-vCPU Zen 4 machine, PyArrow's
read_row_group(use_threads=True) spawns additional decode threads per call.
Under DLIO's model (e.g. 4 MPI × 8 read_threads = 32 workers), hundreds of
threads contend on the GIL. AU on Skylake with data in tmpfs (zero I/O latency):
21% — storage is provably not the bottleneck; CPU decode is.

Fix: reader_factory.py now routes LOCAL_FS + Parquet to
ParquetReaderFileIterable — a new reader that performs raw byte-range reads
via a 64-thread ThreadPoolExecutor without any PyArrow decode. Data is
returned as raw bytes to the training loop. For S3/object storage, the s3dlio
Rust-based reader (ParquetReaderS3dlio) is used, which similarly bypasses
Python-side decode.

# Before (reader_factory.py):
# LOCAL_FS + Parquet → ParquetReader → pf.read_row_group() — full PyArrow decode

# After:
elif _args.storage_type in (StorageType.LOCAL_FS,):
    from dlio_benchmark.reader.parquet_reader_file_iterable import ParquetReaderFileIterable
    return ParquetReaderFileIterable(dataset_type, thread_index, epoch_number)

Result (from issue #364 testing, c6in.16xlarge, data on tmpfs):

Accelerators	`use_threads`	AU	Throughput	Result
4	True (before)	54.38%	~77 MB/s	❌ FAIL
4	False (workaround)	99.79%	141.80 MB/s	✅ PASS
8	False (workaround)	99.68%	283.07 MB/s	✅ PASS

The ParquetReaderFileIterable path goes further — no decode at all — giving
even better scaling on older CPU generations (Skylake, Cascade Lake) that lack
AVX-512 Parquet acceleration.

Fix #371 — `--params storage.storage_type=direct_fs` silently uses page cache

File: dlio_benchmark/checkpointing/pytorch_checkpointing.py
Commit: present in fork on branch feat/parquet-dgen-streaming

Problem: After PR #359 renamed the Python package from mlpstorage →
mlpstorage_py, one import path in dlio_benchmark was missed:

# Before (bug — old package name):
try:
    from mlpstorage.checkpointing import StreamingCheckpointing as _SC  # always fails
except ImportError:
    from dlio_benchmark.checkpointing.simple_streaming_checkpointing import (
        SimpleStreamingCheckpointing as _SC,   # silently falls back here
    )

SimpleStreamingCheckpointing ignores the backend='direct_fs' argument
entirely and uses plain open(path, "wb"). The result: when a user passes
--params storage.storage_type=direct_fs, page cache is never bypassed.
This was confirmed with free -h showing page cache growing during the write
phase and Lustre client cache filling up on a Lustre-backed mount.

Fix (one line):

# After:
from mlpstorage_py.checkpointing import StreamingCheckpointing as _SC

This ensures direct_fs checkpointing correctly uses O_DIRECT via s3dlio's
direct:// URI scheme, bypassing the page cache as intended.

Fix #372 — 32 GB hard cap blocks large-memory runs

File: dlio_benchmark/utils/config.py
Status: Modified locally in russfellows/dlio_benchmark — pending commit/push

Problem: BUDGET_MB was hard-coded to 32 * 1024 (32 GB). On hosts with
more than 32 GB of RAM this cap artificially constrains the number of DataLoader
workers. The error manifests as:

Exception: Memory budget exceeded: reader.read_threads=2 x comm_size=64 = 128
worker processes, estimated ~64 GB (hard cap: 32 GB). Reduce reader.read_threads
to at most 1 for this run.

On a 377 GB host trying to run 64 accelerators × 2 read_threads, the cap
prevents any run above 32 B200 ranks × 2 threads = 32 GB, limiting throughput
to ~2.3 GB/s regardless of storage capability (well below a Gen5 NVMe's 14 GB/s).

Fix:

# Before:
BUDGET_MB = 32 * 1024  # 32 GB hard cap

# After:
BUDGET_MB = psutil.virtual_memory().total // (1024 * 1024)  # actual host RAM

The budget now scales with actual installed RAM, which is the correct
upper bound for in-memory dataset caching.

Issue #369 — `orte_init` failed: No permission (-17) (No code fix)

Problem: OpenMPI orte_init fails with getting local rank failed → Returned value No permission (-17). This occurs when MPI processes are
launched as root without passing --allow-run-as-root to mpirun, or
when running inside a container with restricted Linux namespaces that
prevent OpenMPI's process management layer from initializing.

Assessment: This is an environment and OpenMPI configuration issue,
not a bug in mlpstorage or dlio_benchmark. The fix is to add
--allow-run-as-root to the mpirun invocation, or to configure the
container/namespace permissions to allow OpenMPI's process manager. No
code change is warranted.

Test Results

Before (clean main):  112 passed, 13 failed
After  (this branch): 127 passed,  0 failed

The net gain of 15 passing tests breaks down as:

+13 pre-existing failures fixed (test_cluster_collector: 10, test_rules: 3)
+2 new regression tests added by PR Fix #363: pass results_dir to collect_cluster_info #366 (TestCollectClusterInfoSignatureBinding)

- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files) - Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files) - Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML) - Add .env.example documenting all runtime parameters - Update 22 shell scripts: pip/venv setup -> uv sync pattern - Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports - Update tests/object-store/README.md: - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...' - Update Library Selection section: YAML key -> runtime --param approach - Remove s3torchconnector from library selection table (keep historical results) - Update prerequisites: source .venv + source .env -> uv sync Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)

- Extract --file/--object from add_universal_arguments into new add_storage_type_arguments() function; VectorDB/KVCache parsers no longer require it; training/checkpointing parsers call it - Update training/checkpointing tests to pass --file in parse_args - Wrap _collect_cluster_start/_collect_cluster_end with progress_context to show spinner during SSH/MPI collection - Pass validate_structure=False to ReportGenerator in test fixtures that use empty temporary directories - Change logger.error -> logger.warning for nonexistent results dir in get_runs_files; skip dirs with multiple metadata files - Add _uri_for_filename alias to ParquetReaderS3Iterable

- Make --file/--object optional (required=False) so ALL benchmark parsers can carry the flag; VectorDB and KV-cache parsers now include it so the argument is available everywhere - Fix progress.py: replace logger.status() (non-existent Logger method) with logger.info() in both progress_context and create_stage_progress non-interactive fallback paths - Update tests to assert logger.info() instead of logger.status() dlio_benchmark changes (local fork + installed venv): - Replace broken \r-in-logger progress() with a Rich-based implementation using SpinnerColumn + BarColumn; falls back to plain stdout writes if Rich is unavailable

…rams Reduce tests/object-store/ from 30+ files to 4 clean tests: - run_training.sh — datagen + training via mlpstorage CLI - run_checkpointing.sh — checkpoint write + read via dlio_benchmark - test_s3lib_get_bench.py — GET throughput benchmark (updated) - test_direct_write_comparison.py — native write/read benchmark (updated) All runtime parameters (bucket, endpoint, storage library, credentials) now come exclusively from environment variables or .env — no hardcoded site-specific values remain in any test script or config file. Changes: - Archive 26 per-library scripts and result docs to old-archive/ - Archive 3 per-library checkpoint YAMLs to old-archive/ - Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only YAML with all storage runtime params supplied via Hydra CLI overrides - run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable - run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable - test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio); fail fast with clear error if bucket not set - test_direct_write_comparison.py: use BUCKET env var as shared default; add validation error if required buckets not set - Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests Unit tests: 905 passed, 4 skipped (no regressions)

…ore tests - pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev, which contains minio connection-pool fix and s3torchconnector bool fix - uv.lock: regenerated after pyproject.toml change (resolved b1696e1) - configs/dlio/workload: remove 17 library-specific YAML files (minio, s3dlio, s3torch variants) — all storage params are now supplied via --params CLI overrides from .env; generic YAMLs remain - configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field - tests/object-store/README.md: complete rewrite with accurate instructions - tests/object-store/run_training.sh: add s3torchconnector support, spawn multiprocessing, disable checkpoint in training tests - tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector - tests/object-store/run_datagen.sh: new helper script - tests/object-store/run_cleanup.sh: new helper script - tests/object-store/old-archive/: archive stale test utility files

…d parquet loading Object storage (dlio.py): - _apply_object_storage_params() now logs the .env file path it loads - Raises FileNotFoundError with actionable message if --object mode finds no .env Config (config.py): - DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir Main (main.py): - Add import os (was missing after tempdir warning addition) - Warn at startup when results will be written to system temp dir Checkpointing (streaming_checkpoint.py): - IPC Queue/Event created from same multiprocessing context as child process - Fixes SemLock fork/spawn mismatch on non-fork start methods MPI (utils.py): - Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults Dependencies (pyproject.toml, uv.lock): - s3dlio >= 0.9.95 - python-dotenv >= 1.0.0 - dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming Security (.gitignore): - Block .env.* credential files; keep .env.example Unit tests (933 passing, 4 skipped): - tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior - tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark() - tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params() - tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API (cache stores int byte-count not Table; no LRU eviction; close() is no-op) Docs: - docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/ - README.md, docs/README.md, tests/README.md: cross-reference links updated Benchmark results and analysis (new in tests/): - tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy) - tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test - tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md - tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md - tests/DLIO-optimization-analysis-2026-04-25.md

…iles tests/unit/test_benchmarks_vectordb.py: - Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*) - Add _validate_vdb_dependencies mock to all 14 tests that instantiate VectorDBBenchmark; that method runs in __init__ before verify_benchmark and raises DependencyError when optional packages (pymilvus, tabulate) are not installed in the base uv env tests/unit/test_cli.py: - Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser, mlpstorage.config → mlpstorage_py.*) - Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved All 15 previously-failing upstream tests now pass. Full suite: 949 passed, 4 skipped.

…nhancements Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading

- flux_datagen.yaml: add use_s3dlio_gen: true, row_group_size: 48 - dlrm_b200.yaml: tune prefetch_size/read_threads for benchmark accuracy - pyproject.toml: s3dlio>=0.9.100; dlio-benchmark from russfellows fork (feat/parquet-dgen-streaming); local s3dlio wheel NOTE comment - tests/DLRM_test_results.md: direct DLIO benchmark reader comparison results - docs/Flux_NP_ReadThreads_Scaling_Results.md: new -- NP in {1,2,4,8} x RT in {1,2,4,8} scaling sweep results, CPU threshold analysis, computation_time impact at 0.5s and 1.35s, samp/s/GPU column - tests/object-store/: add bench/gen/run scripts for Flux and DLRM workloads - .gitignore: ignore sweep_logs/, sweep_*.sh, sim_*.tsv*, results/

…ecture docs ── 1. DLRM workload config fixes (configs/dlio/workload/) ─────────────── dlrm_b200.yaml, dlrm_datagen.yaml: Reduce num_samples_per_file from 4,718,592 to 1,536,000. 1,536,000 = 250 row groups x 6,144 rows/RG. This keeps the Parquet footer under the s3-ultra 4 MiB single-object GET limit. The previous value produced a footer exceeding 4 MiB, causing s3-ultra to reject the GET and fall back to a multi-part read, distorting latency. Also enables use_s3dlio_gen: true and aligns row_group_size to batch_size (6,144) for optimal row-group cache hit rate. ── 2. UNet3D B200 workload config (configs/dlio/workload/unet3d_b200.yaml) ─ New config for UNet3D benchmarking on B200-class hardware. - computation_time: 0.162 s (H100 baseline / 2 for B200 throughput target) - 7,200 NPZ files, ~140 MiB each, s3dlio storage library - batch_size: 4, read_threads: 4 ── 3. UNet3D NP sweep scripts (tests/object-store/) ───────────────────── sweep_unet3d_np.sh: Automated NP=1/2/4 scaling sweep for the UNet3D B200 workload. Each run writes results to results/unet3d_np_sweep/<timestamp>/. Appends a TSV summary row and auto-generates docs/UNet3D_NP_Scaling_Results.md at sweep completion. NP=8 excluded -- s3-ultra saturates at NP>=4. gen_unet3d_npz.sh: Generates the 984 GiB UNet3D NPZ dataset on s3-ultra (mlp-unet3d bucket) using dlio_benchmark's NPZGenerator fast path (s3dlio generate_npz_bytes(), zero Python-side copies, hardware CRC32, Rayon parallel fill). test_unet3d.sh: Single-run smoke test for the UNet3D B200 config (NP=1, 1 epoch). ── 4. DLRM sweep scripts (tests/object-store/) ────────────────────────── sweep_dlrm_np.sh: NP=1/2/4 scaling sweep for DLRM Parquet workload. sweep_dlrm_compute.sh: Compute-time sensitivity sweep for DLRM. ── 5. DataLoader architecture documentation (docs/) ───────────────────── docs/DATALOADER_ARCHITECTURE.md (new): Comprehensive reference covering two major topics: Part 1 -- Map-style vs. iterable DataLoaders on S3: Why "iterable is better for large datasets" originates from HDD seek patterns and does not apply to object storage. The real argument for iterable is pipeline depth: TorchIterableDatasetSimple achieves 64 x num_workers in-flight GETs (vs 1 x num_workers with map-style). Covers TorchIterableDatasetSimple implementation mechanics, known limitations (per-epoch shuffle propagation, prefetch memory bounds, drop-last), and a summary comparison table. Part 2 -- O_DIRECT on local NVMe (two independent paths): Why O_DIRECT is required for accurate NVMe benchmarking (page cache problem). Detailed description and comparison of both available paths: - odirect: true -- Python os.open+os.readv, map-style, 1 read/worker - storage_library: direct -- Rust/Tokio O_DIRECT, iterable, 64/worker 12-property comparison table. Guidance on using both paths together to isolate I/O concurrency depth and GIL contention as independent variables. Includes TOC with anchor links to all sections. docs/UNet3D_NP_Scaling_Results.md (new): NP=1/2/4 benchmark results for UNet3D B200 on s3-ultra. Generated by sweep_unet3d_np.sh. docs/DLRM_NP_Scaling_Results.md (new): NP=1/2/4 benchmark results for DLRM Parquet on s3-ultra. docs/Flux_NP_ReadThreads_Scaling_Results.md (updated): Additional read_threads sweep results appended. docs/README.md (updated): - New "Where to Start" row: Benchmark NVMe with O_DIRECT pointing to DATALOADER_ARCHITECTURE.md#o_direct-local-storage-two-independent-paths - DATALOADER_ARCHITECTURE.md entry expanded to summarise both parts (S3 iterable DataLoader and O_DIRECT NVMe paths) with anchor link. ── 6. pyproject.toml / uv.lock ────────────────────────────────────────── Switch dlio-benchmark dependency from git branch reference to local editable path (../dlio_benchmark). Allows iterating on dlio_benchmark and mlp-storage together without tagging intermediate git commits. uv.lock updated accordingly. ── 7. .gitignore additions ────────────────────────────────────────────── Add patterns for runtime artifacts that should never be committed: hydra_log/ -- Hydra config output written to cwd during runs sweep_unet3d_*.log -- Timestamped sweep run logs written to repo root sweep_dlrm_*.log -- Timestamped sweep run logs written to repo root sweep_flux_*.log -- Timestamped sweep run logs written to repo root

uv.lock: bump s3dlio wheel to 0.9.100 (skip_head HEAD optimisation, PyDataset.from_uris(), items(), collect_batch()) tests/object-store/test_retinanet.sh: end-to-end retinanet 3-epoch benchmark tests/object-store/gen_retinanet_jpeg.sh: generate retinanet JPEG dataset tests/object-store/sweep_retinanet_np.sh: sweep concurrency parameters for NP workload

…3dlio 0.9.100) Benchmark results from 2026-05-12 sweep on co-located 24 vCPU / 48 GB host. 50,000 JPEG files × ~315 KiB/file, 8 epochs, batch=24, read_threads=8. DataLoader: TorchIterableDatasetSimple + _s3_stream_next() pipelined chunking. dlio_benchmark commit: fc92d7f (feat/parquet-dgen-streaming).

pyproject.toml: - dlio-benchmark: local editable -> GitHub rev 3667a0e (v3.0.2) - s3dlio: local wheel source removed (now resolves from PyPI via >=0.9.100 pin) - [tool.uv] environments = ['sys_platform == linux'] added (s3dlio Linux-only) uv.lock: - dlio-benchmark 3.0.1 -> 3.0.2 from russfellows/dlio_benchmark@3667a0e - s3dlio 0.9.100 from local wheel -> pypi.org/simple - mlpstorage 2.0.0b1 -> 3.0.2 - Removed colorama + tzdata (Windows-only, no longer resolved)

…ve historical analysis Deleted from old-archive/ (31 files): - All per-library dlio_minio_*.sh, dlio_s3dlio_*.sh, dlio_s3torch_*.sh (superseded by unified run_datagen/training/checkpointing/cleanup.sh) - demo_streaming_checkpoint.sh, test_minio_checkpoint.py, test_s3dlio_checkpoint.py, test_s3torch_checkpoint.py (superseded by run_checkpointing.sh) - test_dlio_direct_s3dlio.sh, test_dlio_multilib_demo.py, test_mlp_minio/s3dlio/s3torch.sh, test_s3dlio_multilib.sh, test_training_mpi_sweep.py (superseded by sweep_*.sh) - llama3_8b_checkpoint_*.yaml (configs now in configs/dlio/) - dlio_mpi_object_results.md, Object_Perf_Results.md, s3dlio_performance_analysis.md (stale; issues since resolved) Moved from top-level to old-archive/ (historical reference): - bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py - bench-results-retinanet-20260425.md Remaining old-archive/ contains 10 reference files: - test_direct_write_comparison.py, test_s3dlio_direct.py, test_s3dlio_formats.py/.sh, test_s3lib_get_bench.py, S3library_review_21-Mar.md (library API/concurrency reference) - bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py (historical optimization analysis) - bench-results-retinanet-20260425.md (historical benchmark results)

…ts, add sweeps/ Deleted: - test_dlrm.sh, test_flux.sh — redundant one-liners; run_dlrm_bench.sh and run_flux_bench.sh are the proper scripts (full result parsing, env handling) - gen_flux_parquet.py — non-standard one-off that bypassed mlpstorage datagen; confusing next to the .sh generators; can be replaced with gen_flux_parquet.sh Moved to old-archive/ (Apr-27, ~16 days old, superseded): - run_datagen.sh, run_training.sh — generic multi-model wrappers replaced by model-specific run_*_bench.sh scripts - test_multi_endpoint_s3dlio.py — demo script, not a test New sweeps/ subdirectory: - sweep_dlrm_compute.sh, sweep_dlrm_np.sh, sweep_flux.sh, sweep_retinanet_np.sh, sweep_unet3d_np.sh Also removed sweep_flux.sh from .gitignore (it was excluded as a scratch script; now tracked properly under sweeps/)

Replace old run_datagen/run_training-centric docs with: - Structure diagram showing 4 model types × 1 generator + 1 benchmark each - Quick Start showing the 3-command flow per model - Table mapping model → format → generator → benchmark script - Updated Archived Tests section listing what's in old-archive/ Removed: detailed parameter tables for run_datagen.sh and run_training.sh (both scripts moved to old-archive in previous commit)

Deleted (superseded by May 12 sweep results in docs/): - tests/object-store/NPZ-OPTIMIZATION-ANALYSIS.md (bug now fixed, stale) - tests/object-store/scaling-analysis-2026-04-25.md (s3dlio v0.9.86 era) - tests/object-store/s3ultra-test-results-20260425.md (s3dlio v0.9.86 era) README.md: added Performance Results section linking to current docs/: - docs/DLRM_NP_Scaling_Results.md - docs/Flux_NP_ReadThreads_Scaling_Results.md - docs/RetinaNet_NP_Scaling_Results.md - docs/UNet3D_NP_Scaling_Results.md

…commands reports/history/lockfile subparsers do not call add_storage_type_arguments(), so their Namespace has no .file or .object attribute. The unconditional read and delete in parse_arguments() crashed with AttributeError. Gate the consolidation on attribute presence; downstream code already uses getattr(args, 'data_access_protocol', None). Fixes mlcommons#367 Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>

Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>

… suite fixes, env lock Fix mlcommons#365: apply CLI override_parameters into metadata.json parameters Add _apply_dotted_overrides() static method to Benchmark base class. At metadata serialization time, dotted-key CLI overrides are merged into the nested parameters dict so the submission checker sees the effective config (e.g. split-phase num_checkpoints_write/read). override_parameters is still emitted unchanged for full audit trail. This addresses the same root cause as PR mlcommons#370 (crossmeta/zettalane); that PR is pending CLA so this implementation is carried here independently. Fix rules/models.py: system info fallback in DLIOResultParser When a DLIO summary.json lacks system_info, fall back to cluster_information from the run metadata dict. Fixes the TestBenchmarkRunSystemInfoFallback test class (3 tests). Fix test suite: resolve 13 pre-existing test failures test_cluster_collector.py: add missing results_dir argument to all MPIClusterCollector constructor and collect_cluster_info() call sites (10 tests). Update test_collector_returns_valid_data_without_error_marker to use current shared_staging_dir=tmpdir pattern. test_rules.py: patch DLIOResultParser._load_summary and _load_hydra_configs in TestBenchmarkRunSystemInfoFallback tests so they use in-memory mock data instead of hitting /tmp/test_run (3 tests). All 127 tests now pass (125 pre-existing + 2 added by PR mlcommons#366). pyproject.toml/uv.lock: pin uv environments to Linux s3dlio only publishes Linux wheels; lock the uv environment selector to sys_platform == 'linux' so cross-platform lock generation does not fail. Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>

…nhancements Branch 3-0-2/bug fixes perf enhancements

…hanges

….S3DLIO not in installed package)

github-actions · 2026-05-13T16:32:18Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

russfellows · 2026-05-13T16:33:52Z

Closing — opened in error, not ready for upstream review.

russfellows and others added 28 commits April 27, 2026 15:37

fix: switch dlio-benchmark ref from deleted dev branch to main

aa8de4b

chore: update uv.lock to dlio_benchmark f58903c (PRs #9 and #10)

217ac6e

Merge pull request #28 from russfellows/branch-3-0-1/bug-fixes-perf-e…

64165f7

…nhancements Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading

chore: bump version to 3.0.2

b1dc6e0

docs: add Recommended Hardware section to tests/object-store/README.md

08eb039

Remove unwanted file

03765a2

Fix mlcommons#363: pass results_dir to collect_cluster_info

7e4245b

Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>

Merge pull request #29 from russfellows/branch-3-0-2/bug-fixes-perf-e…

4534ae4

…nhancements Branch 3-0-2/bug fixes perf enhancements

chore: merge upstream main (39e657d) — our code supersedes upstream c…

fa55107

…hanges

fix: exclude test_dlio_storage.py from pytest collection (StorageType…

3a5195e

….S3DLIO not in installed package)

russfellows requested a review from a team May 13, 2026 16:32

russfellows closed this May 13, 2026

github-actions Bot locked and limited conversation to collaborators May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#374

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#374
russfellows wants to merge 28 commits into
mlcommons:mainfrom
russfellows:main

russfellows commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

russfellows commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

russfellows commented May 13, 2026

PR Summary: branch-3-0-2/bug-fixes-perf-enhancements

Issues Addressed

Commit History (above main)

Commit 1 — 022820b

Commit 2 — 03765a2

Commit 3 — 7e4245b

Commit 4 — 2431011

Fix #365 — CLI override_parameters not reflected in metadata.json

Fix — DLIOResultParser system info fallback

Fix — 13 pre-existing test failures

pyproject.toml / uv.lock

dlio_benchmark Fixes (russfellows/dlio_benchmark — feat/parquet-dgen-streaming)

Fix #362 / #364 — Training stuck at epoch 1; Flux AU limited by CPU Parquet deserialization

Fix #371 — --params storage.storage_type=direct_fs silently uses page cache

Fix #372 — 32 GB hard cap blocks large-memory runs

Issue #369 — orte_init failed: No permission (-17) (No code fix)

Test Results

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

russfellows commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Commit History (above `main`)

Commit 1 — `022820b`

Commit 2 — `03765a2`

Commit 3 — `7e4245b`

Commit 4 — `2431011`

Fix #371 — `--params storage.storage_type=direct_fs` silently uses page cache

Issue #369 — `orte_init` failed: No permission (-17) (No code fix)