[Dev] fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlap#2201
[Dev] fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlap#2201yanring merged 47 commits intoNVIDIA:devfrom
Conversation
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
Thanks for the PR. Please mark the title with [Dev] fix(moe): xxx and label this PR with |
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test 32fc988 |
|
/ok to test 776d224 |
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test 487eea9 |
|
/ok to test c568c37 |
|
@lhb8125 could you fix the api checks? |
|
/ok to test 36648e3 |
| if g is not None: | ||
| g.record_stream(self.stream) | ||
| if not self.delay_grads_release: | ||
| g.untyped_storage().resize_(0) |
There was a problem hiding this comment.
Could you add some explanation here?
There was a problem hiding this comment.
Fixed in lhb8125#50, @lhb8125 can you help take a look~
| """Delay the weight gradient computation to improve batch-level communication overlapping""" | ||
|
|
||
| ep_overlap_early_attn_memory_release: bool = False | ||
| """Release the memory of the attention module early in EP overlap. Note this flag has |
There was a problem hiding this comment.
This description is a bit vague—when exactly should users enable or disable this feature? Also, the connection to overlap_moe_expert_parallel_comm isn't clear here, which will likely confuse users.
There was a problem hiding this comment.
Fixed in lhb8125#50, @lhb8125 can you help take a look~
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test 0708cc1 |
|
/ok to test 2cfaec1 |
|
/ok to test 0f8663b |
fix comments of dev 2201
|
/ok to test 97de523 |
|
/ok to test 12a2a22 |
What does this PR do ?
PR for main
enable_deepepwithuse_flex_dispatcherso that deepep and hybridep will be treated in the same way in 1f1b a2a overlap;Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.