Support multimodule pipelining in 1F1B schedule#3129
Support multimodule pipelining in 1F1B schedule#3129yashaswikarnati wants to merge 17 commits intoNVIDIA:mainfrom
Conversation
- Rename ProcessGroupCollectionWrapper to MultiModuleProcessGroupCollection - Rename language_model field to language_model_module_name for clarity - Add language_model_module_name param to backward_step_multimodule - Use functools.partial to bind param, keeping signature consistent - Add type hints to _ensure_3d_tensor and _restore_tensor_shape - Move is_multimodule check earlier for validation and backward selection
|
/ok to test 2d7c176 |
| Returns: | ||
| 3D tensor (with singleton last dim if input was 2D), list of 3D tensors, or None. | ||
| """ | ||
| if tensor is None: |
There was a problem hiding this comment.
can you assert fail if a 3d tensor passed in and its last_dim.size != 1
There was a problem hiding this comment.
I think for 3D tensor its a no op. any 3d tensor is fine (last dim can be 1 and no assert is needed). will make it clear in the doc string. thx
There was a problem hiding this comment.
if original 3d and last dim==1, _restore_tensor_from_comm will make it a 2d?
dim in = [a, b, 1], dim out will be [a, b]?
There was a problem hiding this comment.
good call out! i added an assert in the prepare comm to prevent this ee189df
| input_tensors.append(input_tensor) | ||
| output_tensors.append(output_tensor) | ||
| deallocate_output_tensor(output_tensor[0], config.deallocate_pipeline_outputs) | ||
| deallocate_output_tensor(output_tensor, config.deallocate_pipeline_outputs) |
There was a problem hiding this comment.
to consistently handle individual tensors or lists or dicts we have these checks in the deallocate output tensor. just passing the first element is indeed not needed
|
Hi, @yaoyu-33, @yashaswikarnati, Could you help increase the priority of this PR? There are also other leftover prs for M4, but #3129 is the most importance one. |
|
/ok to test 597862e |
| return block | ||
|
|
||
|
|
||
| def create_module_with_grid(tp, pp, dp, grid_offset, hidden_size): |
There was a problem hiding this comment.
Maybe create_module_and_grid is more appropriate, because this function create grid, not use grid to create model.
Signed-off-by: ykarnati <ykarnati@nvidia.com>
|
/ok to test 5f941d1 |
Hi, @yashaswikarnati , what is the next step of this pr? |
|
/ok to test 908ea5f |
|
Hi, @yashaswikarnati, can you help to address CI test issue? |
|
|
||
| # Apply grad scaling if needed (for last stage only) | ||
| for module_name in output_tensor.keys(): | ||
| if output_tensor_grad[module_name] is None and config.grad_scale_func is not None: |
There was a problem hiding this comment.
Just out of curiosity, why is the scaling only applied when the gradient is None?
There was a problem hiding this comment.
Doesn't this break when using gradient accumulation?
|
/ok to test ee189df |
test failures look unrelated and all tests pass locally, retriggered again |
Summary
Adds support for multi-module pipeline parallelism (encoder + LLM) in the 1F1B schedule.
Changes:
MultiModuleProcessGroupCollectionfor managing process groups across modules{module_name: tensor}in forward/backwardbackward_step_multimoduleto handle backward for multimodule casesContribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.