Skip to content

fix: gpt-oss ckpt saving#1501

Draft
akoumpa wants to merge 6 commits intomainfrom
akoumparouli/fix_gptoss_safetensors_dtype
Draft

fix: gpt-oss ckpt saving#1501
akoumpa wants to merge 6 commits intomainfrom
akoumparouli/fix_gptoss_safetensors_dtype

Conversation

@akoumpa
Copy link
Contributor

@akoumpa akoumpa commented Mar 9, 2026

What does this PR do ?

  1. Root cause fix in _maybe_build_consolidated_index: filters the final fqn_to_file_index_mapping to only include keys actually present in the state_dict being saved. This removes phantom _blocks/_scales keys from the base checkpoint index when saving dequantized bf16 weights.
  2. Defensive fix in _parse_input_metadata: after populating output file data from input shards, removes any FQNs that still have an empty dtype_str (meaning they weren't found in any input shard). Logs a warning to make the mismatch visible.
  3. Four new tests:
  • TestBuildConsolidatedIndexQuantizedBase::test_stale_quantized_keys_are_filtered - verifies _blocks/_scales keys don't leak into the mapping
  • TestBuildConsolidatedIndexQuantizedBase::test_new_keys_assigned_to_last_shard - verifies new keys get the default shard index
  • TestConsolidationWithPhantomKeys::test_consolidated_safetensors_loadable_with_phantom_keys - end-to-end: creates DCP shard files, runs consolidation with phantom keys, verifies the output is loadable with safe_open and contains correct tensors
  • TestConsolidationWithPhantomKeys::test_consolidated_safetensors_no_phantom_keys - regression test: consolidation without phantom keys works as before

Changelog

  • Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa
Copy link
Contributor Author

akoumpa commented Mar 9, 2026

/ok to test 4d4fc31

akoumpa added 2 commits March 9, 2026 16:06
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Mar 9, 2026

/ok to test 8c6826b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Mar 9, 2026

/ok to test b5f2606

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Mar 10, 2026

/ok to test f281b65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Add for cherry-pick into release branch r0.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant