Skip to content

[dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC).#2943

Merged
Victarry merged 85 commits intoNVIDIA:devfrom
jingqiny-99:jingqiny/feature-mHC
Mar 6, 2026
Merged

[dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC).#2943
Victarry merged 85 commits intoNVIDIA:devfrom
jingqiny-99:jingqiny/feature-mHC

Conversation

@jingqiny-99
Copy link

@jingqiny-99 jingqiny-99 commented Jan 14, 2026

What does this PR do ?

Phase 1 of design proposal in #2919

Main PR #3430

Tested with Qwen3-30B-A3B, TP=1, PP=4 in 8x8xH100 Nodes.

The loss curve below compares runs with and without mHC. The loss with mHC is slightly lower than without mHC, and the magnitude of the reduction is close to the results reported in the paper.

image

Figure 1: Two runs.

loss_diff_mHC_vs_nonmHC

Figure 2: Our results.

image

Figure 3: main results in mHC paper
⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

TODO:

  • Move hyper connection expand to embedding layer?
  • Make discard and output trigger in TransformerBlock
  • Make mhc recompute logics in TransformerBlock in one place
  • Refactor random.py for new save/load logics

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jingqiny-99 jingqiny-99 marked this pull request as ready for review January 14, 2026 09:28
@jingqiny-99 jingqiny-99 requested review from a team as code owners January 14, 2026 09:28
@Victarry Victarry linked an issue Jan 14, 2026 that may be closed by this pull request
@jingqiny-99 jingqiny-99 marked this pull request as draft January 14, 2026 09:49
@jingqiny-99
Copy link
Author

/ok to test d86d741

@jingqiny-99
Copy link
Author

/ok to test 18f66d1

@jingqiny-99
Copy link
Author

/ok to test 90f26ec


with self.bias_dropout_add_exec_handler():
hidden_states = self.cross_attn_bda(
self.training, self.config.bias_dropout_fusion, mhc_recompute_manager
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: mhc_recompute_manager is passed to cross_attn_bda even though cross-attention is explicitly not supported by hyper connections. This incorrectly enrols the cross-attention BDA into the mHC CheckpointManager, so it will be recomputed as part of the mHC recompute plan even though no HC pre/post ops wrap it. Should pass None here (or just omit the third argument to match the base-class call):

Suggested change
self.training, self.config.bias_dropout_fusion, mhc_recompute_manager
hidden_states = self.cross_attn_bda(
self.training, self.config.bias_dropout_fusion
)(attention_output_with_bias, residual, self.hidden_dropout)


# MTP does not support hyper connections yet; strip HC modules and
# downgrade the layer class to plain TransformerLayer.
transformer_layer_spec.submodules.self_attention_hyper_connection = IdentityOp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec submodules are mutated in-place here. If the caller holds a reference to the same spec / spec.layer_specs[-1] and reuses it (e.g., to build both the decoder block and the MTP block), the HC module slots and the layer class will have been silently stripped from the shared object. Consider doing a shallow copy of the submodules before patching:

import copy
transformer_layer_spec = copy.copy(transformer_layer_spec)
transformer_layer_spec.submodules = copy.copy(transformer_layer_spec.submodules)

@claude
Copy link

claude bot commented Mar 5, 2026

Review summary

Two bugs found, one pre-existing bug fixed.

1. Pre-existing bug fixed: _set_warmup_end never reset the flag
cuda_graphs.py: _set_warmup_end() previously had an empty body so _IS_GRAPH_WARMUP was never reset to False. Fixed correctly in this PR.

2. Bug: cross-attention BDA incorrectly enrolled in mHC recompute plan
See inline comment on transformer_layer.py. In HyperConnectionTransformerLayer._forward_attention, cross_attn_bda is called with mhc_recompute_manager as the third positional argument. Cross-attention has no HC pre/post ops wrapping it, so this silently registers the cross-attn BDA into the CheckpointManager and includes it in unified recompute at block boundaries. Latent for decoder-only GPT (cross_attn_bda is IdentityFuncOp), but will misfire for encoder-decoder models.

3. In-place spec mutation for MTP
See inline comment on gpt_layer_specs.py. get_gpt_mtp_block_spec_for_backend strips HC submodules in-place. If the caller reuses the same spec object for both decoder and MTP block construction, the HC slots and layer class will have been silently removed.

@Victarry
Copy link
Contributor

Victarry commented Mar 5, 2026

Hi @ericharper @jaredcasper , we have derived a new TransformerLayer class for hyper connection and leave the original TransformerLayer clean as we discussed before. There still some duplicated codes between the base TransformerLayer and HyperConnectionTransformerLayer and we will continue refactor it for the PR to the main branch.

We hope to merge this dev branch PR first to unblock the release of this feature to customers. What do you think of this?

@jingqiny-99
Copy link
Author

/ok to test dc917a6

@jingqiny-99
Copy link
Author

/ok to test e7e1a13

@svcnvidia-nemo-ci
Copy link

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22748741214

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Expert Review Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Design Proposal: mHC (Manifold-Constrained Hyper-Connections)

9 participants