Skip to content

[Dev] Support packed seq in MTP#2043

Merged
yanring merged 11 commits intoNVIDIA:devfrom
BestJuly:lit/mtp_packed_seq_dev
Dec 1, 2025
Merged

[Dev] Support packed seq in MTP#2043
yanring merged 11 commits intoNVIDIA:devfrom
BestJuly:lit/mtp_packed_seq_dev

Conversation

@BestJuly
Copy link
Contributor

@BestJuly BestJuly commented Oct 30, 2025

What does this PR do ?

Support packed sequence in MTP module.

PR for Main branch: PR2173

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Signed-off-by: Li Tao <lit@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 30, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@BestJuly BestJuly requested a review from Victarry October 30, 2025 06:40
@BestJuly BestJuly self-assigned this Oct 30, 2025
Signed-off-by: Li Tao <lit@nvidia.com>
@BestJuly BestJuly marked this pull request as ready for review October 31, 2025 06:39
@BestJuly BestJuly requested review from a team as code owners October 31, 2025 06:39
@ko3n1g ko3n1g added this to the Core 0.16 milestone Oct 31, 2025
Signed-off-by: Li Tao <lit@nvidia.com>
Signed-off-by: lit <lit@nvidia.com>
Signed-off-by: lit <lit@nvidia.com>
return rolled_tensor, rolled_tensor.sum()


def _roll_tensor_packed_seq(tensor, shifts, dims, packed_seq_params, cp_group=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition! Please add a more detailed docstring explaining the CP data layout assumptions and why sequences must be processed independently. Future maintainers will thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Roll packed sequences left while respecting sequence boundaries.

In Multi-Token Prediction (MTP), we need to shift labels and loss_mask left by one
position to align prediction targets. For packed sequences, a naive torch.roll would
cause the last token of one sequence to "wrap around" to the beginning of the next
sequence, creating cross-sequence contamination. This function avoids this by zeroing
out values at each sequence boundary.

Data Layout (CP mode):
    When Context Parallelism (CP) > 1, sequences are split across ranks following
    the get_batch_on_this_cp_rank() pattern. For example, with CP=2 and seq_len=8:

    Original sequence: [t0, t1, t2, t3, t4, t5, t6, t7]
    Split into 4 chunks: [t0,t1] [t2,t3] [t4,t5] [t6,t7]
                           c0      c1      c2      c3

    Mirrored distribution (for load balancing):
      - Rank 0 holds: [c0, c3] = [t0, t1, t6, t7]
      - Rank 1 holds: [c1, c2] = [t2, t3, t4, t5]

    This distribution balances causal attention workload across ranks.

Algorithm:
    1. When CP=1:
       - Execute torch.roll(shifts=-1) independently for each sequence
       - Zero out the last position of each sequence (the token that would wrap)

    2. When CP>1:
       - Split local data into 2 chunks (corresponding to mirrored front/back halves)
       - Roll each chunk independently
       - Exchange boundary tokens via P2P communication:
         * Non-first rank: send chunk0 boundary to prev rank, recv chunk1 fill from prev rank
         * Non-last rank: recv chunk0 fill from next rank, send chunk1 boundary to next rank
         * First rank: chunk1 fill value is set to 0 (sequence start boundary)
         * Last rank: chunk0 fill value comes from chunk1 (intra-sequence continuity)
       - Fill received values into appropriate positions
       - Last rank must zero out the final position of each sequence

Why sequences must be processed independently:
    A packed sequence contains multiple sequences of varying lengths, each with its
    own semantic boundary. For example, [seq1_tok0, seq1_tok1, seq1_tok2, seq2_tok0, seq2_tok1]:
    - Correct left-shift: [seq1_tok1, seq1_tok2, 0, seq2_tok1, 0]
    - Wrong global shift: [seq1_tok1, seq1_tok2, seq2_tok0, seq2_tok1, seq1_tok0]
    The latter incorrectly makes seq2_tok0 a prediction target for seq1.

Notes:
    - cu_seqlens must contain global (pre-CP-partition) cumulative sequence lengths
    - Current implementation processes sequences one-by-one; future optimization could
      batch boundary communication across all sequences

Args:
    tensor (Tensor): Input tensor with shape [..., seq_len] or [batch, seq_len].
    shifts (int): Roll displacement, must be -1 (shift left by one).
    dims (int): Dimension to roll, must be -1 (last dim, i.e., sequence dimension).
    packed_seq_params (PackedSeqParams): Contains cu_seqlens_q field representing
        cumulative sequence lengths. E.g., cu_seqlens_q = [0, 3, 5] indicates two
        sequences with lengths 3 and 2 respectively.
    cp_group (ProcessGroup, optional): Context Parallel process group.
        When None or size=1, falls back to single-GPU mode.

Returns:
    tuple[Tensor, Tensor]:
        - rolled_tensor: Rolled tensor with sequence boundaries zeroed out
        - sum_val: Sum of all elements in rolled_tensor (for loss normalization)

Example:
    >>> # Two sequences: [1,2,3] and [4,5], packed as [1,2,3,4,5]
    >>> tensor = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32)
    >>> cu_seqlens = torch.tensor([0, 3, 5], dtype=torch.int32)
    >>> packed_seq_params = PackedSeqParams(cu_seqlens_q=cu_seqlens, ...)
    >>> rolled, _ = _roll_tensor_packed_seq(tensor, -1, -1, packed_seq_params)
    >>> # Result: [2, 3, 0, 5, 0]
    >>> # Seq1: [1,2,3] -> [2,3,0]  (3 shifts out, boundary zeroed)
    >>> # Seq2: [4,5] -> [5,0]      (5 shifts out, boundary zeroed)
"""

dims == -1 or dims == tensor.dim() - 1
), "Packed sequence roll only supports the last dimension."
assert shifts == -1, "Packed sequence roll only supports a single-token left shift."
cu_seqlens = packed_seq_params.cu_seqlens_q
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The division by cp_size assumes cu_seqlens contains global (pre-CP-partition) sequence lengths. Please add an assertion and/or comment documenting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev branch Dev branch related issues and development Expert Review Apply this label to indicate that your PR is ready for expert review. module: moe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants