Skip to content

mlpstorage training run with --client-host-memory-in-gb 256 fails due to hard memory cap of 32 GB in dlio_benchmark #372

@litianqi00315

Description

@litianqi00315

Description
When running mlpstorage training run with --client-host-memory-in-gb 256 and --num-accelerators 64 (comm_size=64), the underlying dlio_benchmark validation fails because it calculates a memory budget of ~256 GB and compares it against a hard-coded 32 GB cap. The error suggests reducing reader.read_threads from 8 to at most 1, which would severely impact performance and does not respect the user‑supplied memory limit.

Steps to Reproduce
Run the following command (IPs and paths obfuscated slightly but structure preserved):

bash
mlpstorage training run
--hosts 10.1.100.117 ... 10.1.100.132
--model flux
--loops 1
--exec-type=mpi
--param dataset.num_files_train=35328
--client-host-memory-in-gb 256
--num-accelerators 64
--accelerator-type b200
--num-client-hosts 16
--data-dir /mnt/perf_urfuse/flux_f35328
--results-dir /path/to/results
--closed --file --oversubscribe --allow-run-as-root
Environment
MLPerf Storage tool version: MLPerfStorageV3_main (most recent commit on main branch)

Python: 3.12

OS: Ubuntu (assumed, based on paths and root user)

Number of hosts: 16, each with 4 accelerators (total 64 accelerators)

Client host memory: 256 GB (as passed via --client-host-memory-in-gb 256)

Error Log
text
Exception: Memory budget exceeded: reader.read_threads=8 x comm_size=64 = 512 worker processes, estimated ~256 GB (hard cap: 32 GB). Reduce reader.read_threads to at most 1 for this run.
Full stack trace:

python
Traceback (most recent call last):
File "/root/.venvs/MLPerfStorageV3_main/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
benchmark.initialize()
...
File "/root/.venvs/MLPerfStorageV3_main/lib/python3.12/site-packages/dlio_benchmark/utils/config.py", line 401, in validate
raise Exception("Memory budget exceeded: reader.read_threads=8 x comm_size=64 = 512 worker processes, estimated ~256 GB (hard cap: 32 GB). Reduce reader.read_threads to at most 1 for this run.")
Expected Behavior
The --client-host-memory-in-gb 256 argument should be respected, allowing the memory budget calculation to use 256 GB instead of a fixed 32 GB.

Alternatively, if the 32 GB hard cap is intentional, it should be configurable via a command‑line option or parameter override.

The run should proceed without forcing an unreasonably low reader.read_threads (e.g., 1).

Additional Context
The same command passed the environment validation step of mlpstorage itself.

A separate warning about collect_cluster_info() missing results_dir also appears, but it does not seem to be the root cause of the failure.

Possible Root Cause
dlio_benchmark’s configuration validation (in config.py) contains a hard‑coded memory cap of 32 GB, which does not take into account the --client-host-memory-in-gb value passed to mlpstorage training run. The mlpstorage wrapper should either:

Propagate the host memory limit to dlio_benchmark (e.g., via an override like ++workload.reader.read_threads or ++workload.memory_cap), or

Provide a mechanism to disable or adjust the memory cap from the command line.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions