Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#374
Closed
russfellows wants to merge 28 commits into
Closed
Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#374russfellows wants to merge 28 commits into
russfellows wants to merge 28 commits into
Conversation
- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files) - Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files) - Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML) - Add .env.example documenting all runtime parameters - Update 22 shell scripts: pip/venv setup -> uv sync pattern - Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports - Update tests/object-store/README.md: - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...' - Update Library Selection section: YAML key -> runtime --param approach - Remove s3torchconnector from library selection table (keep historical results) - Update prerequisites: source .venv + source .env -> uv sync Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)
- Extract --file/--object from add_universal_arguments into new add_storage_type_arguments() function; VectorDB/KVCache parsers no longer require it; training/checkpointing parsers call it - Update training/checkpointing tests to pass --file in parse_args - Wrap _collect_cluster_start/_collect_cluster_end with progress_context to show spinner during SSH/MPI collection - Pass validate_structure=False to ReportGenerator in test fixtures that use empty temporary directories - Change logger.error -> logger.warning for nonexistent results dir in get_runs_files; skip dirs with multiple metadata files - Add _uri_for_filename alias to ParquetReaderS3Iterable
- Make --file/--object optional (required=False) so ALL benchmark parsers can carry the flag; VectorDB and KV-cache parsers now include it so the argument is available everywhere - Fix progress.py: replace logger.status() (non-existent Logger method) with logger.info() in both progress_context and create_stage_progress non-interactive fallback paths - Update tests to assert logger.info() instead of logger.status() dlio_benchmark changes (local fork + installed venv): - Replace broken \r-in-logger progress() with a Rich-based implementation using SpinnerColumn + BarColumn; falls back to plain stdout writes if Rich is unavailable
…rams Reduce tests/object-store/ from 30+ files to 4 clean tests: - run_training.sh — datagen + training via mlpstorage CLI - run_checkpointing.sh — checkpoint write + read via dlio_benchmark - test_s3lib_get_bench.py — GET throughput benchmark (updated) - test_direct_write_comparison.py — native write/read benchmark (updated) All runtime parameters (bucket, endpoint, storage library, credentials) now come exclusively from environment variables or .env — no hardcoded site-specific values remain in any test script or config file. Changes: - Archive 26 per-library scripts and result docs to old-archive/ - Archive 3 per-library checkpoint YAMLs to old-archive/ - Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only YAML with all storage runtime params supplied via Hydra CLI overrides - run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable - run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable - test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio); fail fast with clear error if bucket not set - test_direct_write_comparison.py: use BUCKET env var as shared default; add validation error if required buckets not set - Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests Unit tests: 905 passed, 4 skipped (no regressions)
…ore tests - pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev, which contains minio connection-pool fix and s3torchconnector bool fix - uv.lock: regenerated after pyproject.toml change (resolved b1696e1) - configs/dlio/workload: remove 17 library-specific YAML files (minio, s3dlio, s3torch variants) — all storage params are now supplied via --params CLI overrides from .env; generic YAMLs remain - configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field - tests/object-store/README.md: complete rewrite with accurate instructions - tests/object-store/run_training.sh: add s3torchconnector support, spawn multiprocessing, disable checkpoint in training tests - tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector - tests/object-store/run_datagen.sh: new helper script - tests/object-store/run_cleanup.sh: new helper script - tests/object-store/old-archive/: archive stale test utility files
…d parquet loading Object storage (dlio.py): - _apply_object_storage_params() now logs the .env file path it loads - Raises FileNotFoundError with actionable message if --object mode finds no .env Config (config.py): - DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir Main (main.py): - Add import os (was missing after tempdir warning addition) - Warn at startup when results will be written to system temp dir Checkpointing (streaming_checkpoint.py): - IPC Queue/Event created from same multiprocessing context as child process - Fixes SemLock fork/spawn mismatch on non-fork start methods MPI (utils.py): - Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults Dependencies (pyproject.toml, uv.lock): - s3dlio >= 0.9.95 - python-dotenv >= 1.0.0 - dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming Security (.gitignore): - Block .env.* credential files; keep .env.example Unit tests (933 passing, 4 skipped): - tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior - tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark() - tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params() - tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API (cache stores int byte-count not Table; no LRU eviction; close() is no-op) Docs: - docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/ - README.md, docs/README.md, tests/README.md: cross-reference links updated Benchmark results and analysis (new in tests/): - tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy) - tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test - tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md - tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md - tests/DLIO-optimization-analysis-2026-04-25.md
…iles tests/unit/test_benchmarks_vectordb.py: - Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*) - Add _validate_vdb_dependencies mock to all 14 tests that instantiate VectorDBBenchmark; that method runs in __init__ before verify_benchmark and raises DependencyError when optional packages (pymilvus, tabulate) are not installed in the base uv env tests/unit/test_cli.py: - Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser, mlpstorage.config → mlpstorage_py.*) - Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved All 15 previously-failing upstream tests now pass. Full suite: 949 passed, 4 skipped.
…nhancements Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading
- flux_datagen.yaml: add use_s3dlio_gen: true, row_group_size: 48
- dlrm_b200.yaml: tune prefetch_size/read_threads for benchmark accuracy
- pyproject.toml: s3dlio>=0.9.100; dlio-benchmark from russfellows fork
(feat/parquet-dgen-streaming); local s3dlio wheel NOTE comment
- tests/DLRM_test_results.md: direct DLIO benchmark reader comparison results
- docs/Flux_NP_ReadThreads_Scaling_Results.md: new -- NP in {1,2,4,8} x
RT in {1,2,4,8} scaling sweep results, CPU threshold analysis,
computation_time impact at 0.5s and 1.35s, samp/s/GPU column
- tests/object-store/: add bench/gen/run scripts for Flux and DLRM workloads
- .gitignore: ignore sweep_logs/, sweep_*.sh, sim_*.tsv*, results/
…ecture docs
── 1. DLRM workload config fixes (configs/dlio/workload/) ───────────────
dlrm_b200.yaml, dlrm_datagen.yaml:
Reduce num_samples_per_file from 4,718,592 to 1,536,000.
1,536,000 = 250 row groups x 6,144 rows/RG. This keeps the Parquet
footer under the s3-ultra 4 MiB single-object GET limit. The previous
value produced a footer exceeding 4 MiB, causing s3-ultra to reject
the GET and fall back to a multi-part read, distorting latency.
Also enables use_s3dlio_gen: true and aligns row_group_size to
batch_size (6,144) for optimal row-group cache hit rate.
── 2. UNet3D B200 workload config (configs/dlio/workload/unet3d_b200.yaml) ─
New config for UNet3D benchmarking on B200-class hardware.
- computation_time: 0.162 s (H100 baseline / 2 for B200 throughput target)
- 7,200 NPZ files, ~140 MiB each, s3dlio storage library
- batch_size: 4, read_threads: 4
── 3. UNet3D NP sweep scripts (tests/object-store/) ─────────────────────
sweep_unet3d_np.sh:
Automated NP=1/2/4 scaling sweep for the UNet3D B200 workload.
Each run writes results to results/unet3d_np_sweep/<timestamp>/.
Appends a TSV summary row and auto-generates docs/UNet3D_NP_Scaling_Results.md
at sweep completion. NP=8 excluded -- s3-ultra saturates at NP>=4.
gen_unet3d_npz.sh:
Generates the 984 GiB UNet3D NPZ dataset on s3-ultra (mlp-unet3d bucket)
using dlio_benchmark's NPZGenerator fast path (s3dlio generate_npz_bytes(),
zero Python-side copies, hardware CRC32, Rayon parallel fill).
test_unet3d.sh:
Single-run smoke test for the UNet3D B200 config (NP=1, 1 epoch).
── 4. DLRM sweep scripts (tests/object-store/) ──────────────────────────
sweep_dlrm_np.sh: NP=1/2/4 scaling sweep for DLRM Parquet workload.
sweep_dlrm_compute.sh: Compute-time sensitivity sweep for DLRM.
── 5. DataLoader architecture documentation (docs/) ─────────────────────
docs/DATALOADER_ARCHITECTURE.md (new):
Comprehensive reference covering two major topics:
Part 1 -- Map-style vs. iterable DataLoaders on S3:
Why "iterable is better for large datasets" originates from HDD seek
patterns and does not apply to object storage. The real argument for
iterable is pipeline depth: TorchIterableDatasetSimple achieves
64 x num_workers in-flight GETs (vs 1 x num_workers with map-style).
Covers TorchIterableDatasetSimple implementation mechanics, known
limitations (per-epoch shuffle propagation, prefetch memory bounds,
drop-last), and a summary comparison table.
Part 2 -- O_DIRECT on local NVMe (two independent paths):
Why O_DIRECT is required for accurate NVMe benchmarking (page cache
problem). Detailed description and comparison of both available paths:
- odirect: true -- Python os.open+os.readv, map-style, 1 read/worker
- storage_library: direct -- Rust/Tokio O_DIRECT, iterable, 64/worker
12-property comparison table. Guidance on using both paths together
to isolate I/O concurrency depth and GIL contention as independent
variables. Includes TOC with anchor links to all sections.
docs/UNet3D_NP_Scaling_Results.md (new):
NP=1/2/4 benchmark results for UNet3D B200 on s3-ultra.
Generated by sweep_unet3d_np.sh.
docs/DLRM_NP_Scaling_Results.md (new):
NP=1/2/4 benchmark results for DLRM Parquet on s3-ultra.
docs/Flux_NP_ReadThreads_Scaling_Results.md (updated):
Additional read_threads sweep results appended.
docs/README.md (updated):
- New "Where to Start" row: Benchmark NVMe with O_DIRECT pointing to
DATALOADER_ARCHITECTURE.md#o_direct-local-storage-two-independent-paths
- DATALOADER_ARCHITECTURE.md entry expanded to summarise both parts
(S3 iterable DataLoader and O_DIRECT NVMe paths) with anchor link.
── 6. pyproject.toml / uv.lock ──────────────────────────────────────────
Switch dlio-benchmark dependency from git branch reference to local
editable path (../dlio_benchmark). Allows iterating on dlio_benchmark
and mlp-storage together without tagging intermediate git commits.
uv.lock updated accordingly.
── 7. .gitignore additions ──────────────────────────────────────────────
Add patterns for runtime artifacts that should never be committed:
hydra_log/ -- Hydra config output written to cwd during runs
sweep_unet3d_*.log -- Timestamped sweep run logs written to repo root
sweep_dlrm_*.log -- Timestamped sweep run logs written to repo root
sweep_flux_*.log -- Timestamped sweep run logs written to repo root
uv.lock: bump s3dlio wheel to 0.9.100 (skip_head HEAD optimisation, PyDataset.from_uris(), items(), collect_batch()) tests/object-store/test_retinanet.sh: end-to-end retinanet 3-epoch benchmark tests/object-store/gen_retinanet_jpeg.sh: generate retinanet JPEG dataset tests/object-store/sweep_retinanet_np.sh: sweep concurrency parameters for NP workload
…3dlio 0.9.100) Benchmark results from 2026-05-12 sweep on co-located 24 vCPU / 48 GB host. 50,000 JPEG files × ~315 KiB/file, 8 epochs, batch=24, read_threads=8. DataLoader: TorchIterableDatasetSimple + _s3_stream_next() pipelined chunking. dlio_benchmark commit: fc92d7f (feat/parquet-dgen-streaming).
pyproject.toml: - dlio-benchmark: local editable -> GitHub rev 3667a0e (v3.0.2) - s3dlio: local wheel source removed (now resolves from PyPI via >=0.9.100 pin) - [tool.uv] environments = ['sys_platform == linux'] added (s3dlio Linux-only) uv.lock: - dlio-benchmark 3.0.1 -> 3.0.2 from russfellows/dlio_benchmark@3667a0e - s3dlio 0.9.100 from local wheel -> pypi.org/simple - mlpstorage 2.0.0b1 -> 3.0.2 - Removed colorama + tzdata (Windows-only, no longer resolved)
…ve historical analysis Deleted from old-archive/ (31 files): - All per-library dlio_minio_*.sh, dlio_s3dlio_*.sh, dlio_s3torch_*.sh (superseded by unified run_datagen/training/checkpointing/cleanup.sh) - demo_streaming_checkpoint.sh, test_minio_checkpoint.py, test_s3dlio_checkpoint.py, test_s3torch_checkpoint.py (superseded by run_checkpointing.sh) - test_dlio_direct_s3dlio.sh, test_dlio_multilib_demo.py, test_mlp_minio/s3dlio/s3torch.sh, test_s3dlio_multilib.sh, test_training_mpi_sweep.py (superseded by sweep_*.sh) - llama3_8b_checkpoint_*.yaml (configs now in configs/dlio/) - dlio_mpi_object_results.md, Object_Perf_Results.md, s3dlio_performance_analysis.md (stale; issues since resolved) Moved from top-level to old-archive/ (historical reference): - bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py - bench-results-retinanet-20260425.md Remaining old-archive/ contains 10 reference files: - test_direct_write_comparison.py, test_s3dlio_direct.py, test_s3dlio_formats.py/.sh, test_s3lib_get_bench.py, S3library_review_21-Mar.md (library API/concurrency reference) - bench_npz_build.py, bench_parquet_rg_flux.py, bench_wholefile_get.py (historical optimization analysis) - bench-results-retinanet-20260425.md (historical benchmark results)
…ts, add sweeps/ Deleted: - test_dlrm.sh, test_flux.sh — redundant one-liners; run_dlrm_bench.sh and run_flux_bench.sh are the proper scripts (full result parsing, env handling) - gen_flux_parquet.py — non-standard one-off that bypassed mlpstorage datagen; confusing next to the .sh generators; can be replaced with gen_flux_parquet.sh Moved to old-archive/ (Apr-27, ~16 days old, superseded): - run_datagen.sh, run_training.sh — generic multi-model wrappers replaced by model-specific run_*_bench.sh scripts - test_multi_endpoint_s3dlio.py — demo script, not a test New sweeps/ subdirectory: - sweep_dlrm_compute.sh, sweep_dlrm_np.sh, sweep_flux.sh, sweep_retinanet_np.sh, sweep_unet3d_np.sh Also removed sweep_flux.sh from .gitignore (it was excluded as a scratch script; now tracked properly under sweeps/)
Replace old run_datagen/run_training-centric docs with: - Structure diagram showing 4 model types × 1 generator + 1 benchmark each - Quick Start showing the 3-command flow per model - Table mapping model → format → generator → benchmark script - Updated Archived Tests section listing what's in old-archive/ Removed: detailed parameter tables for run_datagen.sh and run_training.sh (both scripts moved to old-archive in previous commit)
Deleted (superseded by May 12 sweep results in docs/): - tests/object-store/NPZ-OPTIMIZATION-ANALYSIS.md (bug now fixed, stale) - tests/object-store/scaling-analysis-2026-04-25.md (s3dlio v0.9.86 era) - tests/object-store/s3ultra-test-results-20260425.md (s3dlio v0.9.86 era) README.md: added Performance Results section linking to current docs/: - docs/DLRM_NP_Scaling_Results.md - docs/Flux_NP_ReadThreads_Scaling_Results.md - docs/RetinaNet_NP_Scaling_Results.md - docs/UNet3D_NP_Scaling_Results.md
…commands reports/history/lockfile subparsers do not call add_storage_type_arguments(), so their Namespace has no .file or .object attribute. The unconditional read and delete in parse_arguments() crashed with AttributeError. Gate the consolidation on attribute presence; downstream code already uses getattr(args, 'data_access_protocol', None). Fixes mlcommons#367 Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
Signed-off-by: Devasena Inupakutika <devasena.i@samsung.com>
… suite fixes, env lock Fix mlcommons#365: apply CLI override_parameters into metadata.json parameters Add _apply_dotted_overrides() static method to Benchmark base class. At metadata serialization time, dotted-key CLI overrides are merged into the nested parameters dict so the submission checker sees the effective config (e.g. split-phase num_checkpoints_write/read). override_parameters is still emitted unchanged for full audit trail. This addresses the same root cause as PR mlcommons#370 (crossmeta/zettalane); that PR is pending CLA so this implementation is carried here independently. Fix rules/models.py: system info fallback in DLIOResultParser When a DLIO summary.json lacks system_info, fall back to cluster_information from the run metadata dict. Fixes the TestBenchmarkRunSystemInfoFallback test class (3 tests). Fix test suite: resolve 13 pre-existing test failures test_cluster_collector.py: add missing results_dir argument to all MPIClusterCollector constructor and collect_cluster_info() call sites (10 tests). Update test_collector_returns_valid_data_without_error_marker to use current shared_staging_dir=tmpdir pattern. test_rules.py: patch DLIOResultParser._load_summary and _load_hydra_configs in TestBenchmarkRunSystemInfoFallback tests so they use in-memory mock data instead of hitting /tmp/test_run (3 tests). All 127 tests now pass (125 pre-existing + 2 added by PR mlcommons#366). pyproject.toml/uv.lock: pin uv environments to Linux s3dlio only publishes Linux wheels; lock the uv environment selector to sys_platform == 'linux' so cross-platform lock generation does not fail. Co-authored-by: Devasena Inupakutika <devasena.i@samsung.com>
…nhancements Branch 3-0-2/bug fixes perf enhancements
….S3DLIO not in installed package)
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
Author
|
Closing — opened in error, not ready for upstream review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Summary: branch-3-0-2/bug-fixes-perf-enhancements
Branch:
branch-3-0-2/bug-fixes-perf-enhancementsBase:
main(mlcommons/storage)Date: May 13, 2026
Tests: 127 passed, 0 failed (was 112 passed, 13 failed on clean
main)Issues Addressed
Of the 7 most recent open issues on mlcommons/storage, 6 are fixed by this branch.
Issue #369 was determined to be an environment/OpenMPI configuration problem with
no code fix applicable.
dlio_benchmark—reader_factory.pycollect_cluster_info()missing requiredresults_dirbenchmarks/base.pydlio_benchmark—reader_factory.py+ s3dliobenchmarks/base.pyreportgencrashes withAttributeErroronNamespace.filecli_parser.pyorte_initfailed — No permission (-17)--params storage.storage_type=direct_fssilently uses pagecachedlio_benchmark—pytorch_checkpointing.pydlio_benchmark—utils/config.pyCommit History (above
main)Commit 1 —
022820bAuthor: Devasena Inupakutika
<devasena.i@samsung.com>Message:
cli_parser: guard --file/--object consolidation for non-benchmark subcommandsFixes: #367
Cherry-picked from: PR #368
Problem: The
reportgen,history, andlockfilesubcommands do not calladd_storage_type_arguments(), so theirNamespaceobjects have no.fileor.objectattribute. The unconditional read anddelinparse_arguments()crashed with
AttributeError.Changes —
mlpstorage_py/cli_parser.py:--file/--objectconsolidation block withif hasattr(parsed_args, "file") or hasattr(parsed_args, "object"):getattr(parsed_args, "file", False)instead of direct attribute accessdel parsed_args.file / del parsed_args.objectwith afor _attr in ("file", "object"): if hasattr(...): delattr(...)loopso neither attribute is required to be present
Also includes new unit tests in
tests/unit/test_cli.pycovering theparser behaviour for all subcommand types.
Commit 2 —
03765a2Author: Devasena Inupakutika
<devasena.i@samsung.com>Message:
Remove unwanted fileCherry-picked from: PR #368
Removes a
requirements.txtthat was accidentally included in theprevious commit.
Commit 3 —
7e4245bAuthor: Devasena Inupakutika
<devasena.i@samsung.com>Message:
Fix #363: pass results_dir to collect_cluster_infoFixes: #363
Cherry-picked from: PR #366
Problem:
Benchmark._collect_cluster_information()calledcollect_cluster_info()without the required positional argumentresults_dir. This caused aTypeErrorat runtime:The missing cluster info then propagated as
Noneintoreportgen,causing a downstream crash:
Changes —
mlpstorage_py/benchmarks/base.py:ssh_usernameandshared_staging_dirfromself.argsviagetattr(..., None)before the callresults_dir=self.run_result_output(the benchmark's computedoutput directory) to
collect_cluster_info()shared_staging_dir=shared_staging_dirandssh_username=ssh_usernameso SSH-based collection uses the correctcredentials and staging path
Changes —
mlpstorage_py/tests/test_benchmarks.py:benchmark.run_result_output = '/tmp/results/run-001'in thetest fixture (previously missing; the call site needs this attribute)
assert_called_once_withto expectresults_dir,shared_staging_dir, andssh_usernameTestCollectClusterInfoSignatureBindingregression test class (2new tests) that binds the actual kwargs against
inspect.signature()of the real
collect_cluster_infofunction, so future signature driftis caught at unit-test time rather than at runtime
Commit 4 —
2431011Author: Russell Fellows
Co-authored-by: Devasena Inupakutika
<devasena.i@samsung.com>Message:
Fix #365, #372: metadata override propagation, test suite fixes, env lockFixes: #365
Fix #365 — CLI override_parameters not reflected in metadata.json
Problem: The submission checker reads
num_checkpoints_write/num_checkpoints_readfrommetadata['parameters'](the YAMLdefaults). For split-phase submissions (write-only or read-only runs),
the correct counts are passed as CLI overrides such as:
These overrides landed in
metadata['override_parameters']only, whichthe checker ignores. As a result, a 10-write + 10-read split-phase run
would aggregate to 20 writes + 20 reads and be marked INVALID.
Changes —
mlpstorage_py/benchmarks/base.py:_apply_dotted_overrides(params, overrides)static method thatdeep-copies
paramsand merges dotted-key overrides into the nesteddict structure
metadataproperty, call_apply_dotted_overrides()sometadata['parameters']reflects the effective runtime configurationmetadata['override_parameters']is still emitted unchanged for afull audit trail
Fix — DLIOResultParser system info fallback
Problem: When a DLIO
summary.jsondoes not contain asystem_infoblock (e.g. runs from older DLIO versions),
DLIOResultParser.parse()returned
NoneforClusterInformation, breakingBenchmarkRunvalidation.
Changes —
mlpstorage_py/rules/models.py:DLIOResultParser.parse()now accepts an optionalmetadatakwargClusterInformation.from_dlio_summary_json()returnsNone,fall back to
metadata['cluster_information']if present andreconstruct via
ClusterInformation.from_dict()BenchmarkRun.__init__passes the run'smetadataobject toparser.parse()to enable the fallbackFix — 13 pre-existing test failures
mlpstorage_py/tests/test_cluster_collector.py(10 tests):MPIClusterCollector(...)constructor calls andcollect_cluster_info(...)call sites in failing tests were missingthe now-required
results_dirargument — addedresults_dir='/tmp'to all 10 affected call sites
test_collector_returns_valid_data_without_error_marker: rewrote touse the current
shared_staging_dir=tmpdirpattern instead of theobsolete UUID-based staging directory approach
mlpstorage_py/tests/test_rules.py(3 tests):TestBenchmarkRunSystemInfoFallbacktests were failing withValueError: No summary.json found in /tmp/test_runbecause theyattempted real filesystem I/O
DLIOResultParser._load_summaryandDLIOResultParser._load_hydra_configsto return in-memory mock data,removing the filesystem dependency
pyproject.toml / uv.lock
[tool.uv] environments = ["sys_platform == 'linux'"]topyproject.tomlsouv lockdoes not attempt to resolve non-Linuxplatform markers (s3dlio only publishes Linux wheels)
uv.lockaccordinglydlio_benchmark Fixes (russfellows/dlio_benchmark — feat/parquet-dgen-streaming)
The following fixes are in the
dlio_benchmarkfork that is pinned by thisbranch's
pyproject.toml. They are already committed in the fork; issue #372has an additional local change that is pending commit/push.
Fix #362 / #364 — Training stuck at epoch 1; Flux AU limited by CPU Parquet deserialization
Files:
dlio_benchmark/reader/reader_factory.py,dlio_benchmark/reader/parquet_reader_file_iterable.py(new),dlio_benchmark/reader/parquet_reader_s3dlio.pyCommit:
1635b79(feat: s3dlio-gen streaming, iterable dataloader, file iterable reader)Issue #362 — Stuck at epoch 1, no NVMe reads:
reader_factory.pyroutedLOCAL_FS+ Parquet to the legacyParquetReader,which calls
pf.read_row_group()— full PyArrow deserialization on every read.This is entirely CPU-bound and saturates the Python GIL, starving DLIO's
DataLoader workers of CPU time. Observed symptom: benchmark reaches
"Starting epoch 1" and then makes no measurable NVMe I/O while CPU pegs at
88-95%.
Issue #364 — Flux AU limited by per-process Parquet deserialization:
Same root cause. Even on a 192-vCPU Zen 4 machine, PyArrow's
read_row_group(use_threads=True)spawns additional decode threads per call.Under DLIO's model (e.g. 4 MPI × 8
read_threads= 32 workers), hundreds ofthreads contend on the GIL. AU on Skylake with data in tmpfs (zero I/O latency):
21% — storage is provably not the bottleneck; CPU decode is.
Fix:
reader_factory.pynow routesLOCAL_FS+ Parquet toParquetReaderFileIterable— a new reader that performs raw byte-range readsvia a 64-thread
ThreadPoolExecutorwithout any PyArrow decode. Data isreturned as raw bytes to the training loop. For S3/object storage, the s3dlio
Rust-based reader (
ParquetReaderS3dlio) is used, which similarly bypassesPython-side decode.
Result (from issue #364 testing, c6in.16xlarge, data on tmpfs):
use_threadsThe
ParquetReaderFileIterablepath goes further — no decode at all — givingeven better scaling on older CPU generations (Skylake, Cascade Lake) that lack
AVX-512 Parquet acceleration.
Fix #371 —
--params storage.storage_type=direct_fssilently uses page cacheFile:
dlio_benchmark/checkpointing/pytorch_checkpointing.pyCommit: present in fork on branch
feat/parquet-dgen-streamingProblem: After PR #359 renamed the Python package from
mlpstorage→mlpstorage_py, one import path indlio_benchmarkwas missed:SimpleStreamingCheckpointingignores thebackend='direct_fs'argumententirely and uses plain
open(path, "wb"). The result: when a user passes--params storage.storage_type=direct_fs, page cache is never bypassed.This was confirmed with
free -hshowing page cache growing during the writephase and Lustre client cache filling up on a Lustre-backed mount.
Fix (one line):
This ensures
direct_fscheckpointing correctly uses O_DIRECT via s3dlio'sdirect://URI scheme, bypassing the page cache as intended.Fix #372 — 32 GB hard cap blocks large-memory runs
File:
dlio_benchmark/utils/config.pyStatus: Modified locally in
russfellows/dlio_benchmark— pending commit/pushProblem:
BUDGET_MBwas hard-coded to32 * 1024(32 GB). On hosts withmore than 32 GB of RAM this cap artificially constrains the number of DataLoader
workers. The error manifests as:
On a 377 GB host trying to run 64 accelerators × 2
read_threads, the capprevents any run above 32 B200 ranks × 2 threads = 32 GB, limiting throughput
to ~2.3 GB/s regardless of storage capability (well below a Gen5 NVMe's 14 GB/s).
Fix:
The budget now scales with actual installed RAM, which is the correct
upper bound for in-memory dataset caching.
Issue #369 —
orte_initfailed: No permission (-17) (No code fix)Problem: OpenMPI
orte_initfails withgetting local rank failed → Returned value No permission (-17). This occurs when MPI processes arelaunched as root without passing
--allow-run-as-roottompirun, orwhen running inside a container with restricted Linux namespaces that
prevent OpenMPI's process management layer from initializing.
Assessment: This is an environment and OpenMPI configuration issue,
not a bug in mlpstorage or dlio_benchmark. The fix is to add
--allow-run-as-rootto thempiruninvocation, or to configure thecontainer/namespace permissions to allow OpenMPI's process manager. No
code change is warranted.
Test Results
The net gain of 15 passing tests breaks down as: