Skip to content

[Community][Dev] feat(moe): Adding context parallel support to eager attention implementation#1859

Merged
ko3n1g merged 13 commits intoNVIDIA:devfrom
nrailg:nrwu/eagercp
Nov 18, 2025
Merged

[Community][Dev] feat(moe): Adding context parallel support to eager attention implementation#1859
ko3n1g merged 13 commits intoNVIDIA:devfrom
nrailg:nrwu/eagercp

Conversation

@nrailg
Copy link

@nrailg nrailg commented Oct 13, 2025

Sometimes, certain attention and mask implementations are difficult to write a fused / optimized implementation in a short period of time. However, we still need to run experiments to verify their effectiveness.
At such times, we need to fallback to the eager mode. Therefore, I added a switch to fallback to the eager implementation of attention:

--fallback-to-eager-attn

Additionally, since Megatron Core's eager attention does not support context parallelism, I provided a distributed attention implementation similar to that described in the Llama 3 paper.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nrailg nrailg changed the title Add context parallel support to eager attention implementation Adding context parallel support to eager attention implementation Oct 13, 2025
@nrailg nrailg force-pushed the nrwu/eagercp branch 2 times, most recently from 10ddbd8 to 5c981b7 Compare October 14, 2025 08:48
@sbhavani sbhavani added the enhancement New feature or request label Oct 21, 2025
@yuzhongw-nvidia yuzhongw-nvidia changed the title Adding context parallel support to eager attention implementation [Dev] feat(moe): Adding context parallel support to eager attention implementation Oct 27, 2025
@yuzhongw-nvidia yuzhongw-nvidia requested review from a team as code owners October 27, 2025 07:11
@yanring yanring requested a review from hxbai October 27, 2025 07:12
@yuzhongw-nvidia yuzhongw-nvidia self-assigned this Oct 27, 2025
@yuzhongw-nvidia yuzhongw-nvidia removed their request for review October 27, 2025 07:36
@yuzhongw-nvidia
Copy link
Contributor

/ok to test c2410fc

@github-actions
Copy link
Contributor

Thank you for your contribution!

NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process.

Thank you for your understanding.

@yuzhongw-nvidia
Copy link
Contributor

/ok to test 158e53b

@NVIDIA NVIDIA deleted a comment from copy-pr-bot bot Oct 31, 2025
raise AssertionError("use_te_activation_func not compatible with using kitchen.")
else:
backend = TESpecProvider()
backend = TESpecProvider(fallback_to_eager_attn=fallback_to_eager_attn)
Copy link
Contributor

@hxbai hxbai Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to handle this in get_attention_module_spec_for_backend rather than modify the TESpecProvider since we have other backends like Kitchen.

Similar to the code here

module = TEFusedMLP if use_te_op_fuser else MLP

f"the number of layers ({self.num_layers})"
)

if self.fallback_to_eager_attn:
Copy link
Contributor

@hxbai hxbai Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check --cp-comm-type to match the implementation if CP is enabled.

return attn_output


def test_eager_attention_function():
Copy link
Contributor

@hxbai hxbai Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify this to a parallel version to test the CP and TPxCP cases.

@yanring yanring changed the title [Dev] feat(moe): Adding context parallel support to eager attention implementation [Community][Dev] feat(moe): Adding context parallel support to eager attention implementation Nov 5, 2025
@Victarry Victarry added the dev branch Dev branch related issues and development label Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request dev branch Dev branch related issues and development enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants