[dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC).#2943
Conversation
56a3531 to
af54fcd
Compare
|
/ok to test d86d741 |
|
/ok to test 18f66d1 |
|
/ok to test 90f26ec |
|
|
||
| with self.bias_dropout_add_exec_handler(): | ||
| hidden_states = self.cross_attn_bda( | ||
| self.training, self.config.bias_dropout_fusion, mhc_recompute_manager |
There was a problem hiding this comment.
Bug: mhc_recompute_manager is passed to cross_attn_bda even though cross-attention is explicitly not supported by hyper connections. This incorrectly enrols the cross-attention BDA into the mHC CheckpointManager, so it will be recomputed as part of the mHC recompute plan even though no HC pre/post ops wrap it. Should pass None here (or just omit the third argument to match the base-class call):
| self.training, self.config.bias_dropout_fusion, mhc_recompute_manager | |
| hidden_states = self.cross_attn_bda( | |
| self.training, self.config.bias_dropout_fusion | |
| )(attention_output_with_bias, residual, self.hidden_dropout) |
|
|
||
| # MTP does not support hyper connections yet; strip HC modules and | ||
| # downgrade the layer class to plain TransformerLayer. | ||
| transformer_layer_spec.submodules.self_attention_hyper_connection = IdentityOp |
There was a problem hiding this comment.
The spec submodules are mutated in-place here. If the caller holds a reference to the same spec / spec.layer_specs[-1] and reuses it (e.g., to build both the decoder block and the MTP block), the HC module slots and the layer class will have been silently stripped from the shared object. Consider doing a shallow copy of the submodules before patching:
import copy
transformer_layer_spec = copy.copy(transformer_layer_spec)
transformer_layer_spec.submodules = copy.copy(transformer_layer_spec.submodules)
Review summaryTwo bugs found, one pre-existing bug fixed. 1. Pre-existing bug fixed: 2. Bug: cross-attention BDA incorrectly enrolled in mHC recompute plan 3. In-place spec mutation for MTP |
|
Hi @ericharper @jaredcasper , we have derived a new We hope to merge this dev branch PR first to unblock the release of this feature to customers. What do you think of this? |
|
/ok to test dc917a6 |
|
/ok to test e7e1a13 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22748741214 |
What does this PR do ?
Phase 1 of design proposal in #2919
Main PR #3430
Tested with Qwen3-30B-A3B, TP=1, PP=4 in 8x8xH100 Nodes.
The loss curve below compares runs with and without mHC. The loss with mHC is slightly lower than without mHC, and the magnitude of the reduction is close to the results reported in the paper.
Figure 1: Two runs.
Figure 2: Our results.
Figure 3: main results in mHC paper
⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.
TODO:
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.