Merged
Conversation
Signed-off-by: Li Tao <lit@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Li Tao <lit@nvidia.com>
Signed-off-by: Li Tao <lit@nvidia.com>
Signed-off-by: lit <lit@nvidia.com>
Signed-off-by: lit <lit@nvidia.com>
6 tasks
yanring
approved these changes
Nov 25, 2025
| return rolled_tensor, rolled_tensor.sum() | ||
|
|
||
|
|
||
| def _roll_tensor_packed_seq(tensor, shifts, dims, packed_seq_params, cp_group=None): |
Contributor
There was a problem hiding this comment.
Good addition! Please add a more detailed docstring explaining the CP data layout assumptions and why sequences must be processed independently. Future maintainers will thank you.
Contributor
There was a problem hiding this comment.
"""Roll packed sequences left while respecting sequence boundaries.
In Multi-Token Prediction (MTP), we need to shift labels and loss_mask left by one
position to align prediction targets. For packed sequences, a naive torch.roll would
cause the last token of one sequence to "wrap around" to the beginning of the next
sequence, creating cross-sequence contamination. This function avoids this by zeroing
out values at each sequence boundary.
Data Layout (CP mode):
When Context Parallelism (CP) > 1, sequences are split across ranks following
the get_batch_on_this_cp_rank() pattern. For example, with CP=2 and seq_len=8:
Original sequence: [t0, t1, t2, t3, t4, t5, t6, t7]
Split into 4 chunks: [t0,t1] [t2,t3] [t4,t5] [t6,t7]
c0 c1 c2 c3
Mirrored distribution (for load balancing):
- Rank 0 holds: [c0, c3] = [t0, t1, t6, t7]
- Rank 1 holds: [c1, c2] = [t2, t3, t4, t5]
This distribution balances causal attention workload across ranks.
Algorithm:
1. When CP=1:
- Execute torch.roll(shifts=-1) independently for each sequence
- Zero out the last position of each sequence (the token that would wrap)
2. When CP>1:
- Split local data into 2 chunks (corresponding to mirrored front/back halves)
- Roll each chunk independently
- Exchange boundary tokens via P2P communication:
* Non-first rank: send chunk0 boundary to prev rank, recv chunk1 fill from prev rank
* Non-last rank: recv chunk0 fill from next rank, send chunk1 boundary to next rank
* First rank: chunk1 fill value is set to 0 (sequence start boundary)
* Last rank: chunk0 fill value comes from chunk1 (intra-sequence continuity)
- Fill received values into appropriate positions
- Last rank must zero out the final position of each sequence
Why sequences must be processed independently:
A packed sequence contains multiple sequences of varying lengths, each with its
own semantic boundary. For example, [seq1_tok0, seq1_tok1, seq1_tok2, seq2_tok0, seq2_tok1]:
- Correct left-shift: [seq1_tok1, seq1_tok2, 0, seq2_tok1, 0]
- Wrong global shift: [seq1_tok1, seq1_tok2, seq2_tok0, seq2_tok1, seq1_tok0]
The latter incorrectly makes seq2_tok0 a prediction target for seq1.
Notes:
- cu_seqlens must contain global (pre-CP-partition) cumulative sequence lengths
- Current implementation processes sequences one-by-one; future optimization could
batch boundary communication across all sequences
Args:
tensor (Tensor): Input tensor with shape [..., seq_len] or [batch, seq_len].
shifts (int): Roll displacement, must be -1 (shift left by one).
dims (int): Dimension to roll, must be -1 (last dim, i.e., sequence dimension).
packed_seq_params (PackedSeqParams): Contains cu_seqlens_q field representing
cumulative sequence lengths. E.g., cu_seqlens_q = [0, 3, 5] indicates two
sequences with lengths 3 and 2 respectively.
cp_group (ProcessGroup, optional): Context Parallel process group.
When None or size=1, falls back to single-GPU mode.
Returns:
tuple[Tensor, Tensor]:
- rolled_tensor: Rolled tensor with sequence boundaries zeroed out
- sum_val: Sum of all elements in rolled_tensor (for loss normalization)
Example:
>>> # Two sequences: [1,2,3] and [4,5], packed as [1,2,3,4,5]
>>> tensor = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32)
>>> cu_seqlens = torch.tensor([0, 3, 5], dtype=torch.int32)
>>> packed_seq_params = PackedSeqParams(cu_seqlens_q=cu_seqlens, ...)
>>> rolled, _ = _roll_tensor_packed_seq(tensor, -1, -1, packed_seq_params)
>>> # Result: [2, 3, 0, 5, 0]
>>> # Seq1: [1,2,3] -> [2,3,0] (3 shifts out, boundary zeroed)
>>> # Seq2: [4,5] -> [5,0] (5 shifts out, boundary zeroed)
"""
yanring
reviewed
Nov 25, 2025
| dims == -1 or dims == tensor.dim() - 1 | ||
| ), "Packed sequence roll only supports the last dimension." | ||
| assert shifts == -1, "Packed sequence roll only supports a single-token left shift." | ||
| cu_seqlens = packed_seq_params.cu_seqlens_q |
Contributor
There was a problem hiding this comment.
The division by cp_size assumes cu_seqlens contains global (pre-CP-partition) sequence lengths. Please add an assertion and/or comment documenting this.
44 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Support packed sequence in MTP module.
PR for Main branch: PR2173
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.