[Community][Dev] feat(moe): Adding context parallel support to eager attention implementation#1859
Merged
ko3n1g merged 13 commits intoNVIDIA:devfrom Nov 18, 2025
Merged
[Community][Dev] feat(moe): Adding context parallel support to eager attention implementation#1859ko3n1g merged 13 commits intoNVIDIA:devfrom
ko3n1g merged 13 commits intoNVIDIA:devfrom
Conversation
10ddbd8 to
5c981b7
Compare
Contributor
|
/ok to test c2410fc |
Contributor
|
Thank you for your contribution! NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process. Thank you for your understanding. |
6 tasks
Contributor
|
/ok to test 158e53b |
hxbai
reviewed
Nov 4, 2025
| raise AssertionError("use_te_activation_func not compatible with using kitchen.") | ||
| else: | ||
| backend = TESpecProvider() | ||
| backend = TESpecProvider(fallback_to_eager_attn=fallback_to_eager_attn) |
Contributor
There was a problem hiding this comment.
I think it is better to handle this in get_attention_module_spec_for_backend rather than modify the TESpecProvider since we have other backends like Kitchen.
Similar to the code here
hxbai
reviewed
Nov 4, 2025
| f"the number of layers ({self.num_layers})" | ||
| ) | ||
|
|
||
| if self.fallback_to_eager_attn: |
Contributor
There was a problem hiding this comment.
Please also check --cp-comm-type to match the implementation if CP is enabled.
hxbai
reviewed
Nov 4, 2025
| return attn_output | ||
|
|
||
|
|
||
| def test_eager_attention_function(): |
Contributor
There was a problem hiding this comment.
Please modify this to a parallel version to test the CP and TPxCP cases.
fix pipeline
yanring
approved these changes
Nov 18, 2025
44 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sometimes, certain attention and mask implementations are difficult to write a fused / optimized implementation in a short period of time. However, we still need to run experiments to verify their effectiveness.
At such times, we need to fallback to the eager mode. Therefore, I added a switch to fallback to the eager implementation of attention:
Additionally, since Megatron Core's eager attention does not support context parallelism, I provided a distributed attention implementation similar to that described in the Llama 3 paper.