-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Support multimodule pipelining in 1F1B schedule #3129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
c601de4
add pp stage checkers to p2p communicator
yashaswikarnati 84ae4f0
add process group collection wrapper
yashaswikarnati 0fa3dd8
support multimodule pipelining in 1f1b schedule
yashaswikarnati b22f638
fix dim mapping in torch cat bridge comm
yashaswikarnati 3badf57
handle 3d 2d tensor conversion in multimodule comm
yashaswikarnati 20d03f5
add unit tests for multimodule pipeline schedules
yashaswikarnati a6606d8
refactor multimodule pg collection and backward step
yashaswikarnati b102eb7
rename module_collections to module_pgs for clarity
yashaswikarnati ebbb509
rename tensor conversion functions for clarity
yashaswikarnati 2d7c176
Merge branch 'main' into yash/1f1b_changes
dimapihtar 0b6cefd
Fix linting issues: format code and remove unused imports
yashaswikarnati 597862e
Merge branch 'main' into yash/1f1b_changes
shifangx 5f941d1
test: fix isort formatting in multimodule schedule test
yashaswikarnati b1db431
handle encoder only ranks
yashaswikarnati 5846567
cache PGs across bridge communicators
yashaswikarnati 908ea5f
Merge branch 'main' into yash/1f1b_changes
shifangx ee189df
Guard ambiguous multimodule comm tensor shape
yashaswikarnati 81cf623
move backward_step_dict to schedules.py
yashaswikarnati 6542743
Merge branch 'main' into yash/1f1b_changes
shifangx 4f01712
Refactor: expose total_stages/current_stage on communicators
yashaswikarnati 738db94
Merge remote-tracking branch 'upstream/main' into yash/1f1b_changes
yashaswikarnati edc8159
Fix test isolation: destroy leaked NCCL process groups in multimodule…
yashaswikarnati 92b65d1
Remove redundant pg_collection asserts from schedules.py
yashaswikarnati 78ee58c
Add missing copyright header to test_bridge_communicator.py
yashaswikarnati File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.