[dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC). by jingqiny-99 · Pull Request #2943 · NVIDIA/Megatron-LM

jingqiny-99 · 2026-01-14T09:15:09Z

What does this PR do ?

Phase 1 of design proposal in #2919

Main PR #3430

Tested with Qwen3-30B-A3B, TP=1, PP=4 in 8x8xH100 Nodes.

The loss curve below compares runs with and without mHC. The loss with mHC is slightly lower than without mHC, and the magnitude of the reduction is close to the results reported in the paper.

Figure 1: Two runs.

Figure 2: Our results.

Figure 3: main results in mHC paper
⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

TODO:

Move hyper connection expand to embedding layer?
Make discard and output trigger in TransformerBlock
Make mhc recompute logics in TransformerBlock in one place
Refactor random.py for new save/load logics

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-01-14T09:15:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tests/unit_tests/tensor_parallel/test_random.py

jingqiny-99 · 2026-03-03T01:21:20Z

/ok to test d86d741

megatron/core/tensor_parallel/random.py

megatron/core/transformer/hyper_connection.py

megatron/core/transformer/transformer_config.py

megatron/core/transformer/transformer_layer.py

gpt_builders.py

local_test_mhc.sh

megatron/core/transformer/transformer_config.py

jingqiny-99 · 2026-03-04T10:37:42Z

/ok to test 18f66d1

jingqiny-99 · 2026-03-04T10:47:31Z

/ok to test 90f26ec

claude · 2026-03-05T01:11:03Z

megatron/core/transformer/transformer_layer.py

+
+        with self.bias_dropout_add_exec_handler():
+            hidden_states = self.cross_attn_bda(
+                self.training, self.config.bias_dropout_fusion, mhc_recompute_manager


Bug: mhc_recompute_manager is passed to cross_attn_bda even though cross-attention is explicitly not supported by hyper connections. This incorrectly enrols the cross-attention BDA into the mHC CheckpointManager, so it will be recomputed as part of the mHC recompute plan even though no HC pre/post ops wrap it. Should pass None here (or just omit the third argument to match the base-class call):

Suggested change

self.training, self.config.bias_dropout_fusion, mhc_recompute_manager

hidden_states = self.cross_attn_bda(

self.training, self.config.bias_dropout_fusion

)(attention_output_with_bias, residual, self.hidden_dropout)

claude · 2026-03-05T01:11:11Z

megatron/core/models/gpt/gpt_layer_specs.py


+    # MTP does not support hyper connections yet; strip HC modules and
+    # downgrade the layer class to plain TransformerLayer.
+    transformer_layer_spec.submodules.self_attention_hyper_connection = IdentityOp


The spec submodules are mutated in-place here. If the caller holds a reference to the same spec / spec.layer_specs[-1] and reuses it (e.g., to build both the decoder block and the MTP block), the HC module slots and the layer class will have been silently stripped from the shared object. Consider doing a shallow copy of the submodules before patching:

import copy transformer_layer_spec = copy.copy(transformer_layer_spec) transformer_layer_spec.submodules = copy.copy(transformer_layer_spec.submodules)

claude · 2026-03-05T01:11:53Z

Review summary

Two bugs found, one pre-existing bug fixed.

1. Pre-existing bug fixed: _set_warmup_end never reset the flag
cuda_graphs.py: _set_warmup_end() previously had an empty body so _IS_GRAPH_WARMUP was never reset to False. Fixed correctly in this PR.

2. Bug: cross-attention BDA incorrectly enrolled in mHC recompute plan
See inline comment on transformer_layer.py. In HyperConnectionTransformerLayer._forward_attention, cross_attn_bda is called with mhc_recompute_manager as the third positional argument. Cross-attention has no HC pre/post ops wrapping it, so this silently registers the cross-attn BDA into the CheckpointManager and includes it in unified recompute at block boundaries. Latent for decoder-only GPT (cross_attn_bda is IdentityFuncOp), but will misfire for encoder-decoder models.

3. In-place spec mutation for MTP
See inline comment on gpt_layer_specs.py. get_gpt_mtp_block_spec_for_backend strips HC submodules in-place. If the caller reuses the same spec object for both decoder and MTP block construction, the HC slots and layer class will have been silently removed.

Victarry · 2026-03-05T01:23:32Z

Hi @ericharper @jaredcasper , we have derived a new TransformerLayer class for hyper connection and leave the original TransformerLayer clean as we discussed before. There still some duplicated codes between the base TransformerLayer and HyperConnectionTransformerLayer and we will continue refactor it for the PR to the main branch.

We hope to merge this dev branch PR first to unblock the release of this feature to customers. What do you think of this?

jingqiny-99 · 2026-03-06T02:17:48Z

/ok to test dc917a6

jingqiny-99 · 2026-03-06T02:22:09Z

/ok to test e7e1a13

svcnvidia-nemo-ci · 2026-03-06T04:13:15Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22748741214

jingqiny-99 marked this pull request as ready for review January 14, 2026 09:28

jingqiny-99 requested review from a team as code owners January 14, 2026 09:28

Victarry linked an issue Jan 14, 2026 that may be closed by this pull request

Design Proposal: mHC (Manifold-Constrained Hyper-Connections) #2919

Open

jingqiny-99 marked this pull request as draft January 14, 2026 09:49

yanring reviewed Jan 21, 2026

View reviewed changes

tests/unit_tests/tensor_parallel/test_random.py Outdated Show resolved Hide resolved

Victarry self-requested a review January 22, 2026 02:02

yanring mentioned this pull request Jan 26, 2026

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Open

44 tasks

Victarry mentioned this pull request Jan 30, 2026

Design Proposal: mHC (Manifold-Constrained Hyper-Connections) #2919

Open

jingqiny-99 force-pushed the jingqiny/feature-mHC branch from 56a3531 to af54fcd Compare February 2, 2026 02:50

jingqiny-99 and others added 19 commits February 2, 2026 13:14

upd: resolve conflicts

720f39f

upd: resolve conflicts

daaefa1

upd: resolve conflicts

191112b

Update: remove useless codes

60fe478

Fix: remove hyper connections in mtp layers

5313101

Fix: bugs in hyper_connection.py

8753197

Feature: Block level checkpoint manager for mHC recomputations

8a6f06c

Fix: bugs in test_random.py

b5832df

Fix: bugs in test_random.py

0e28db1

update: Refine MHC Recompute Design Doc

c0157c9

update: refine design doc for MHC recomputation

004bfb5

upd: resolve conflicts

3e37187

upd: resolve conflicts

bc8bf78

update: Local merge

1d60007

Fix: bugs in hyper connection module UTs

eff5d8d

resolve conflicts

560ea12

Update: Unit test for transformer layer

e7e8520

Fix: Non tensor variables to checkpoint

15b46d3

upd: resolve conflicts

09e722a

Merge branch 'dev' into jingqiny/feature-mHC

5522ebf