Skip to content

Fixes #365 Apply CLI override_parameters into metadata.json parameters#370

Open
crossmeta wants to merge 1 commit into
mlcommons:mainfrom
zettalane-systems:fix/365-metadata-override-propagation
Open

Fixes #365 Apply CLI override_parameters into metadata.json parameters#370
crossmeta wants to merge 1 commit into
mlcommons:mainfrom
zettalane-systems:fix/365-metadata-override-propagation

Conversation

@crossmeta
Copy link
Copy Markdown

Fixes #365 (Checkpointing Benchmark: Invalid Submission Due to Incorrect Operation Count)

Two-phase checkpoint runs (separate write-only and read-only mlpstorage invocations) are flagged INVALID because the submission_checker counts each phase as 10W+10R and aggregates to 20W+20R.

Root cause:

    mlpstorage_py/benchmarks/dlio.py:339-340
      self.params_dict['checkpoint.num_checkpoints_read']  =
  args.num_checkpoints_read
      self.params_dict['checkpoint.num_checkpoints_write'] =
  args.num_checkpoints_write
      # CLI args correctly captured here

    mlpstorage_py/benchmarks/base.py
      metadata['parameters']          = self.combined_params   # YAML defaults only
      metadata['override_parameters'] = self.params_dict        # CLI args (correct)
      # Two separate fields - parameters does NOT include the CLI overrides

    mlpstorage_py/rules/submission_checkers/checkpointing.py:34-36
      checkpoint_params = run.parameters.get('checkpoint', {})
      num_writes += checkpoint_params.get('num_checkpoints_write', 0)
      # Reads from `parameters` only - sees YAML defaults (10/10) regardless of
      # what the CLI flags actually set

So even though --num-checkpoints-write=10 --num-checkpoints-read=0 drove the actual run correctly, metadata.json's parameters block recorded 10/10 and the checker tallied 20W+20R.

Fix

At metadata serialization time, apply override_parameters (dotted keys) into the nested parameters dict, so parameters reflects the effective run configuration. override_parameters is still emitted unchanged for audit.

Verification

Pre-patch on the test script from #365:

$ mlpstorage checkpointing run ... --num-checkpoints-write=10 --num-checkpoints-read=0
$ # cache clear between phases
$ mlpstorage checkpointing run ... --num-checkpoints-write=0 --num-checkpoints-read=10
$ mlpstorage reports reportgen --results-dir <dir> --file --closed

INVALID: Expected 10 total read operations, but found 20
INVALID: Expected 10 total write operations, but found 20

Post-patch:

    Summary: 2 runs analyzed
      CLOSED:  2 runs
      INVALID: 0 runs
    [CLOSED] Submissions (1) - No actionable issues

Fixes mlcommons#365 (Checkpointing Benchmark: Invalid Submission Due to Incorrect
Operation Count)

The submission_checker reads num_checkpoints_write/read from
metadata['parameters'] (YAML defaults), but split-phase submissions need
the CLI overrides reflected there. Previously the overrides only landed in
metadata['override_parameters'] which the checker ignores - causing 10W+10R
per phase to aggregate to 20W+20R and INVALID.

Fix: at metadata serialization time, apply override_parameters (dotted
keys) into the nested parameters dict, so parameters reflects the effective
config. override_parameters is still emitted unchanged for full audit.

Signed-off-by: sam sammandam <suprasam@zettalane.com>
@crossmeta crossmeta requested a review from a team May 12, 2026 05:53
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot:
Thank you very much for your submission; we really appreciate it. Before we can accept your contribution,
we ask that you sign the MLCommons CLA (Apache 2). Please submit your GitHub ID to our onboarding form to initiate
authorization. If you are from a MLCommons member organization, we will request that you be added to the CLA.
If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact
support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
@zettalane
You can retrigger this bot by commenting recheck in this Pull Request

Copy link
Copy Markdown
Contributor

@idevasena idevasena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Thank you!

@russfellows
Copy link
Copy Markdown
Contributor

russfellows commented May 12, 2026 via email

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Did you provide your github ID (crossmeta) to the "join" workflow when you joined the MLPerf Storage WG? If not, could you please go through the "join" workflow again (providing all the same inputs) and add crossmeta as your github ID?

One of the things that the "join" workflow does is to ask you to sign a "contributor license agreement" that grants to MLCommons a license to use any code contributions you make. With that complete and on file, we add your github ID to the whitelist of ID's that are permitted to check code into our github repo. Without that, we cannot accept your code change, and we'd really like to accept it, so...

@crossmeta
Copy link
Copy Markdown
Author

@FileSystemGuy Yes, crossmeta handle was provided while joining MLPerf Storage WG.
We do see the CLA form in the email. We will sign the CLA form and submit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checkpointing Benchmark: Invalid Submission Due to Incorrect Operation Count

5 participants