[Dev] Support packed seq in MTP by BestJuly · Pull Request #2043 · NVIDIA/Megatron-LM

BestJuly · 2025-10-30T06:39:17Z

What does this PR do ?

Support packed sequence in MTP module.

PR for Main branch: PR2173

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Signed-off-by: Li Tao <lit@nvidia.com>

copy-pr-bot · 2025-10-30T06:39:20Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Li Tao <lit@nvidia.com>

Signed-off-by: lit <lit@nvidia.com>

Signed-off-by: Li Tao <lit@nvidia.com>

yanring · 2025-11-25T14:19:15Z

megatron/core/transformer/multi_token_prediction.py

    return rolled_tensor, rolled_tensor.sum()


+def _roll_tensor_packed_seq(tensor, shifts, dims, packed_seq_params, cp_group=None):


Good addition! Please add a more detailed docstring explaining the CP data layout assumptions and why sequences must be processed independently. Future maintainers will thank you.

"""Roll packed sequences left while respecting sequence boundaries. In Multi-Token Prediction (MTP), we need to shift labels and loss_mask left by one position to align prediction targets. For packed sequences, a naive torch.roll would cause the last token of one sequence to "wrap around" to the beginning of the next sequence, creating cross-sequence contamination. This function avoids this by zeroing out values at each sequence boundary. Data Layout (CP mode): When Context Parallelism (CP) > 1, sequences are split across ranks following the get_batch_on_this_cp_rank() pattern. For example, with CP=2 and seq_len=8: Original sequence: [t0, t1, t2, t3, t4, t5, t6, t7] Split into 4 chunks: [t0,t1] [t2,t3] [t4,t5] [t6,t7] c0 c1 c2 c3 Mirrored distribution (for load balancing): - Rank 0 holds: [c0, c3] = [t0, t1, t6, t7] - Rank 1 holds: [c1, c2] = [t2, t3, t4, t5] This distribution balances causal attention workload across ranks. Algorithm: 1. When CP=1: - Execute torch.roll(shifts=-1) independently for each sequence - Zero out the last position of each sequence (the token that would wrap) 2. When CP>1: - Split local data into 2 chunks (corresponding to mirrored front/back halves) - Roll each chunk independently - Exchange boundary tokens via P2P communication: * Non-first rank: send chunk0 boundary to prev rank, recv chunk1 fill from prev rank * Non-last rank: recv chunk0 fill from next rank, send chunk1 boundary to next rank * First rank: chunk1 fill value is set to 0 (sequence start boundary) * Last rank: chunk0 fill value comes from chunk1 (intra-sequence continuity) - Fill received values into appropriate positions - Last rank must zero out the final position of each sequence Why sequences must be processed independently: A packed sequence contains multiple sequences of varying lengths, each with its own semantic boundary. For example, [seq1_tok0, seq1_tok1, seq1_tok2, seq2_tok0, seq2_tok1]: - Correct left-shift: [seq1_tok1, seq1_tok2, 0, seq2_tok1, 0] - Wrong global shift: [seq1_tok1, seq1_tok2, seq2_tok0, seq2_tok1, seq1_tok0] The latter incorrectly makes seq2_tok0 a prediction target for seq1. Notes: - cu_seqlens must contain global (pre-CP-partition) cumulative sequence lengths - Current implementation processes sequences one-by-one; future optimization could batch boundary communication across all sequences Args: tensor (Tensor): Input tensor with shape [..., seq_len] or [batch, seq_len]. shifts (int): Roll displacement, must be -1 (shift left by one). dims (int): Dimension to roll, must be -1 (last dim, i.e., sequence dimension). packed_seq_params (PackedSeqParams): Contains cu_seqlens_q field representing cumulative sequence lengths. E.g., cu_seqlens_q = [0, 3, 5] indicates two sequences with lengths 3 and 2 respectively. cp_group (ProcessGroup, optional): Context Parallel process group. When None or size=1, falls back to single-GPU mode. Returns: tuple[Tensor, Tensor]: - rolled_tensor: Rolled tensor with sequence boundaries zeroed out - sum_val: Sum of all elements in rolled_tensor (for loss normalization) Example: >>> # Two sequences: [1,2,3] and [4,5], packed as [1,2,3,4,5] >>> tensor = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32) >>> cu_seqlens = torch.tensor([0, 3, 5], dtype=torch.int32) >>> packed_seq_params = PackedSeqParams(cu_seqlens_q=cu_seqlens, ...) >>> rolled, _ = _roll_tensor_packed_seq(tensor, -1, -1, packed_seq_params) >>> # Result: [2, 3, 0, 5, 0] >>> # Seq1: [1,2,3] -> [2,3,0] (3 shifts out, boundary zeroed) >>> # Seq2: [4,5] -> [5,0] (5 shifts out, boundary zeroed) """

yanring · 2025-11-25T14:33:08Z

megatron/core/transformer/multi_token_prediction.py

+        dims == -1 or dims == tensor.dim() - 1
+    ), "Packed sequence roll only supports the last dimension."
+    assert shifts == -1, "Packed sequence roll only supports a single-token left shift."
+    cu_seqlens = packed_seq_params.cu_seqlens_q


The division by cp_size assumes cu_seqlens contains global (pre-CP-partition) sequence lengths. Please add an assertion and/or comment documenting this.

Support packed seq in MTP

25d85fa

Signed-off-by: Li Tao <lit@nvidia.com>

BestJuly requested a review from Victarry October 30, 2025 06:40

BestJuly self-assigned this Oct 30, 2025

Update copyright

8b949d2

Signed-off-by: Li Tao <lit@nvidia.com>

BestJuly added the core_dev_r0.15.0 label Oct 31, 2025

BestJuly marked this pull request as ready for review October 31, 2025 06:39

BestJuly requested review from a team as code owners October 31, 2025 06:39

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:40 Inactive

ko3n1g added this to the Core 0.16 milestone Oct 31, 2025

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:40 Inactive

Merge branch 'dev' into lit/mtp_packed_seq_dev

acd5622

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:41 Inactive

Fix linting

052bc7d

Signed-off-by: Li Tao <lit@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:48 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:49 Inactive

copy-pr-bot bot temporarily deployed to test October 31, 2025 06:49 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 31, 2025 06:56 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 5, 2025 04:12 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 5, 2025 04:30 Inactive

BestJuly added 2 commits November 5, 2025 01:40

Support packed seq with cp in MTP

57f9b16

Signed-off-by: lit <lit@nvidia.com>

Add ut for cp>1 case

ed2a5c9

Signed-off-by: lit <lit@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci November 5, 2025 12:00 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 5, 2025 12:01 Inactive

BestJuly added 2 commits November 7, 2025 15:59

Merge branch 'dev' into lit/mtp_packed_seq_dev

8546dc3

Minor fix

6b23e37

Signed-off-by: Li Tao <lit@nvidia.com>

BestJuly mentioned this pull request Nov 7, 2025

[Main] Support MTP packed-seq in main branch #2173

Merged

6 tasks

Merge branch 'dev' into lit/mtp_packed_seq_dev

fc9924e

yanring approved these changes Nov 25, 2025

View reviewed changes

yanring reviewed Nov 25, 2025

View reviewed changes

yanring mentioned this pull request Jan 26, 2026

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Open

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Support packed seq in MTP#2043

[Dev] Support packed seq in MTP#2043
yanring merged 11 commits intoNVIDIA:devfrom
BestJuly:lit/mtp_packed_seq_dev

BestJuly commented Oct 30, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

yanring Nov 25, 2025

Uh oh!

yanring Nov 25, 2025

Uh oh!

yanring Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return rolled_tensor, rolled_tensor.sum()


		def _roll_tensor_packed_seq(tensor, shifts, dims, packed_seq_params, cp_group=None):

Conversation

BestJuly commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

yanring Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

yanring Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

yanring Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BestJuly commented Oct 30, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`