mlpstorage training run with --client-host-memory-in-gb 256 fails due to hard memory cap of 32 GB in dlio_benchmark

Description
When running mlpstorage training run with --client-host-memory-in-gb 256 and --num-accelerators 64 (comm_size=64), the underlying dlio_benchmark validation fails because it calculates a memory budget of ~256 GB and compares it against a hard-coded 32 GB cap. The error suggests reducing reader.read_threads from 8 to at most 1, which would severely impact performance and does not respect the user‑supplied memory limit.

Steps to Reproduce
Run the following command (IPs and paths obfuscated slightly but structure preserved):

bash
mlpstorage training run \
  --hosts 10.1.100.117 ... 10.1.100.132 \
  --model flux \
  --loops 1 \
  --exec-type=mpi \
  --param dataset.num_files_train=35328 \
  --client-host-memory-in-gb 256 \
  --num-accelerators 64 \
  --accelerator-type b200 \
  --num-client-hosts 16 \
  --data-dir /mnt/perf_urfuse/flux_f35328 \
  --results-dir /path/to/results \
  --closed --file --oversubscribe --allow-run-as-root
Environment
MLPerf Storage tool version: MLPerfStorageV3_main (most recent commit on main branch)

Python: 3.12

OS: Ubuntu (assumed, based on paths and root user)

Number of hosts: 16, each with 4 accelerators (total 64 accelerators)

Client host memory: 256 GB (as passed via --client-host-memory-in-gb 256)

Error Log
text
Exception: Memory budget exceeded: reader.read_threads=8 x comm_size=64 = 512 worker processes, estimated ~256 GB (hard cap: 32 GB). Reduce reader.read_threads to at most 1 for this run.
Full stack trace:

python
Traceback (most recent call last):
  File "/root/.venvs/MLPerfStorageV3_main/lib/python3.12/site-packages/dlio_benchmark/main.py", line 517, in run_benchmark
    benchmark.initialize()
  ...
  File "/root/.venvs/MLPerfStorageV3_main/lib/python3.12/site-packages/dlio_benchmark/utils/config.py", line 401, in validate
    raise Exception("Memory budget exceeded: reader.read_threads=8 x comm_size=64 = 512 worker processes, estimated ~256 GB (hard cap: 32 GB). Reduce reader.read_threads to at most 1 for this run.")
Expected Behavior
The --client-host-memory-in-gb 256 argument should be respected, allowing the memory budget calculation to use 256 GB instead of a fixed 32 GB.

Alternatively, if the 32 GB hard cap is intentional, it should be configurable via a command‑line option or parameter override.

The run should proceed without forcing an unreasonably low reader.read_threads (e.g., 1).

Additional Context
The same command passed the environment validation step of mlpstorage itself.

A separate warning about collect_cluster_info() missing results_dir also appears, but it does not seem to be the root cause of the failure.

Possible Root Cause
dlio_benchmark’s configuration validation (in config.py) contains a hard‑coded memory cap of 32 GB, which does not take into account the --client-host-memory-in-gb value passed to mlpstorage training run. The mlpstorage wrapper should either:

Propagate the host memory limit to dlio_benchmark (e.g., via an override like ++workload.reader.read_threads or ++workload.memory_cap), or

Provide a mechanism to disable or adjust the memory cap from the command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlpstorage training run with --client-host-memory-in-gb 256 fails due to hard memory cap of 32 GB in dlio_benchmark #372

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

mlpstorage training run with --client-host-memory-in-gb 256 fails due to hard memory cap of 32 GB in dlio_benchmark #372

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions