Support multimodule pipelining in 1F1B schedule by yashaswikarnati · Pull Request #3129 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-01-28T22:53:53Z

Summary

Adds support for multi-module pipeline parallelism (encoder + LLM) in the 1F1B schedule.

Changes:

Add MultiModuleProcessGroupCollection for managing process groups across modules
Support dict-based tensor format {module_name: tensor} in forward/backward
Handle 2D/3D tensor conversion for P2P and bridge communication
Add backward_step_multimodule to handle backward for multimodule cases

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

- Rename ProcessGroupCollectionWrapper to MultiModuleProcessGroupCollection - Rename language_model field to language_model_module_name for clarity - Add language_model_module_name param to backward_step_multimodule - Use functools.partial to bind param, keeping signature consistent - Add type hints to _ensure_3d_tensor and _restore_tensor_shape - Move is_multimodule check earlier for validation and backward selection

copy-pr-bot · 2026-01-28T22:53:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dimapihtar · 2026-01-29T21:28:18Z

/ok to test 2d7c176

megatron/core/pipeline_parallel/bridge_communicator.py

yaoyu-33 · 2026-02-04T00:19:09Z

megatron/core/pipeline_parallel/multimodule_communicator.py

+    Returns:
+        3D tensor (with singleton last dim if input was 2D), list of 3D tensors, or None.
+    """
+    if tensor is None:


can you assert fail if a 3d tensor passed in and its last_dim.size != 1

I think for 3D tensor its a no op. any 3d tensor is fine (last dim can be 1 and no assert is needed). will make it clear in the doc string. thx

if original 3d and last dim==1, _restore_tensor_from_comm will make it a 2d?
dim in = [a, b, 1], dim out will be [a, b]?

good call out! i added an assert in the prepare comm to prevent this ee189df

yaoyu-33 · 2026-02-04T00:26:43Z

megatron/core/pipeline_parallel/schedules.py

            input_tensors.append(input_tensor)
            output_tensors.append(output_tensor)
-            deallocate_output_tensor(output_tensor[0], config.deallocate_pipeline_outputs)
+            deallocate_output_tensor(output_tensor, config.deallocate_pipeline_outputs)


why [0] is removed here?

to consistently handle individual tensors or lists or dicts we have these checks in the deallocate output tensor. just passing the first element is indeed not needed

shifangx · 2026-02-14T01:09:59Z

Hi, @yaoyu-33, @yashaswikarnati, Could you help increase the priority of this PR?
As for as I know, #3129 is the last functionality PR for M4, and it is fundamental for DistTrain.

There are also other leftover prs for M4, but #3129 is the most importance one.

shifangx · 2026-02-14T01:11:16Z

/ok to test 597862e

shifangx · 2026-02-14T23:56:47Z

tests/unit_tests/pipeline_parallel/test_multimodule_schedules.py

+    return block
+
+
+def create_module_with_grid(tp, pp, dp, grid_offset, hidden_size):


Maybe create_module_and_grid is more appropriate, because this function create grid, not use grid to create model.

Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-02-17T03:18:05Z

/ok to test 5f941d1

shifangx · 2026-02-23T00:57:43Z

Hi, @yaoyu-33, @yashaswikarnati, Could you help increase the priority of this PR? As for as I know, #3129 is the last functionality PR for M4, and it is fundamental for DistTrain.

There are also other leftover prs for M4, but #3129 is the most importance one.

Hi, @yashaswikarnati , what is the next step of this pr?
@NVIDIA/core-adlr , @NVIDIA/pipeline-parallelism , @NVIDIA/mcore-oncall , @erhoo82 , can you help review this pr?

shifangx · 2026-02-23T02:09:39Z

/ok to test 908ea5f

shifangx · 2026-02-23T04:50:32Z

Hi, @yashaswikarnati, can you help to address CI test issue?
Some test cases failed.

janEbert · 2026-02-23T16:10:04Z

megatron/core/pipeline_parallel/utils.py

+
+    # Apply grad scaling if needed (for last stage only)
+    for module_name in output_tensor.keys():
+        if output_tensor_grad[module_name] is None and config.grad_scale_func is not None:


Just out of curiosity, why is the scaling only applied when the gradient is None?

Doesn't this break when using gradient accumulation?

yashaswikarnati · 2026-02-24T07:07:04Z

/ok to test ee189df

yashaswikarnati · 2026-02-24T07:09:30Z

Hi, @yashaswikarnati, can you help to address CI test issue? Some test cases failed.

test failures look unrelated and all tests pass locally, retriggered again

yashaswikarnati added 7 commits January 27, 2026 11:48

add pp stage checkers to p2p communicator

c601de4

add process group collection wrapper

84ae4f0

support multimodule pipelining in 1f1b schedule

0fa3dd8

fix dim mapping in torch cat bridge comm

b22f638

handle 3d 2d tensor conversion in multimodule comm

3badf57

add unit tests for multimodule pipeline schedules

20d03f5

yashaswikarnati requested review from a team as code owners January 28, 2026 22:53

ko3n1g requested a review from a team January 28, 2026 22:54

yashaswikarnati and others added 3 commits January 28, 2026 15:25

rename module_collections to module_pgs for clarity

b102eb7

rename tensor conversion functions for clarity

ebbb509

Merge branch 'main' into yash/1f1b_changes

2d7c176

dimapihtar added complexity: high Expert Review Apply this label to indicate that your PR is ready for expert review. labels Jan 29, 2026

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 21:28 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 29, 2026

copy-pr-bot bot had a problem deploying to nemo-ci January 29, 2026 21:28 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 21:28 Inactive

yaoyu-33 reviewed Feb 2, 2026

View reviewed changes

megatron/core/pipeline_parallel/bridge_communicator.py Show resolved Hide resolved

yashaswikarnati mentioned this pull request Feb 2, 2026

Add multi-module heterogeneous parallelism support for MIMO model #3211

Open

6 tasks

Fix linting issues: format code and remove unused imports

0b6cefd

yaoyu-33 reviewed Feb 4, 2026

View reviewed changes

dimapihtar requested a review from erhoo82 February 4, 2026 15:10

Merge branch 'main' into yash/1f1b_changes

597862e

copy-pr-bot bot temporarily deployed to nemo-ci February 14, 2026 01:11 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 14, 2026 01:11 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 14, 2026 01:11 Inactive

shifangx reviewed Feb 15, 2026

View reviewed changes

test: fix isort formatting in multimodule schedule test

5f941d1

Signed-off-by: ykarnati <ykarnati@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 03:18 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 03:18 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 03:18 Inactive

copy-pr-bot bot temporarily deployed to test February 17, 2026 03:18 Inactive

yashaswikarnati added 2 commits February 18, 2026 13:00

handle encoder only ranks

b1db431

cache PGs across bridge communicators

5846567

Merge branch 'main' into yash/1f1b_changes

908ea5f

copy-pr-bot bot temporarily deployed to test February 23, 2026 02:10 Inactive

janEbert reviewed Feb 23, 2026

View reviewed changes

Guard ambiguous multimodule comm tensor shape

ee189df

copy-pr-bot bot temporarily deployed to test February 24, 2026 07:07 Inactive

		return block


		def create_module_with_grid(tp, pp, dp, grid_offset, hidden_size):

Comments

Conversation

yashaswikarnati commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Jan 28, 2026

Uh oh!

dimapihtar commented Jan 29, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shifangx commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shifangx commented Feb 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Feb 17, 2026

Uh oh!

shifangx commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shifangx commented Feb 23, 2026

Uh oh!

shifangx commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Feb 24, 2026

Uh oh!

yashaswikarnati commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yashaswikarnati commented Jan 28, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`

shifangx commented Feb 14, 2026 •

edited

Loading

shifangx commented Feb 23, 2026 •

edited

Loading