Skip to content

[Dev] feat(moe): Fine-grained activation offloading#1912

Merged
lhb8125 merged 4 commits intoNVIDIA:devfrom
lhb8125:hongbinl/activation_offloading_github
Oct 27, 2025
Merged

[Dev] feat(moe): Fine-grained activation offloading#1912
lhb8125 merged 4 commits intoNVIDIA:devfrom
lhb8125:hongbinl/activation_offloading_github

Conversation

@lhb8125
Copy link
Contributor

@lhb8125 lhb8125 commented Oct 24, 2025

What does this PR do ?

PR for main branch

Memory capacity are more and more important with the rising of extreme sparse MoE models like DeepSeek-V3 and Qwen3-235B. Fine-grained recomputing reduces the memory footprint at the cost of extra recomputation, while offloading could utilize the host-device bandwidth to achieve nearly zero-overhead.

The current CPU offloading strategy from TE is a layer-level strategy, which offloads the activations in a granularity of the transformer layer, which is coarse-level and hard to highlight the most prominent activations.

Fine-grained Activation Offloading targets at offloading the activation at the granularity of specific modules, so that we can calibrate the amount of offloading activation to maximize the training throughput.

Design Doc

Compared with the current cpu offloading strategy provided by TE, this PR has several advantages:

  • support PP=1/PP/VPP;
  • support MoE models;
  • manually specify offloading the modules with a large memory footprint;
  • Work with fine-grained recomputation to reduce the total activations as much as possible;

How does fine-grained offloading work with fine-grained recomputing?

  • For modules with minor perf overhead like layernorm or moe_act, use recomputing to reduce memory footprint;
  • For other modules, use offloading to reduce memory footprint;
  • Make sure the offloading/reloading could be overlapped with computing;
image

Benchmark

DeepSeek-V3-proxy on H100

Setup
  • Layer parameters are same as DeepSeek-V3 model
  • Layer number is cut off to 14 layers
  • Replace the fisrt 3 dense layers with 3 moe layers
  • TP1PP4EP16VPP1CP1-MBS1GBS512, bf16 training
  • Offload expert_fc1, moe_act, act_norm and mlp_norm
Results
Throughput (TFlops) Max reserved memory (MB)
Baseline 321 74306
Offload expert_fc1,moe_act,layernorm 315 61046

DeepSeek-V3 on GB200 (from @hongxiaob )

Setup
  • Same model structure with DeepSeek-V3 but no mtp
  • TP1PP8EP32CP1VPP4-MBS1GBS2048, mxfp8
  • Offload moe_act
Results
Throughput (TFlops) Max reserved memory (MB)
Baseline 945 169094
Offload moe_act 930 151054

@lhb8125 lhb8125 requested review from a team as code owners October 24, 2025 02:41
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lhb8125 lhb8125 requested review from hxbai and yanring October 24, 2025 02:43
@yanring yanring added module: moe Expert Review Apply this label to indicate that your PR is ready for expert review. labels Oct 24, 2025
@yanring yanring added this to the Core 0.15 milestone Oct 24, 2025
@yanring
Copy link
Contributor

yanring commented Oct 24, 2025

LGTM. We've already done one round of review on GitLab.

@lhb8125
Copy link
Contributor Author

lhb8125 commented Oct 24, 2025

/ok to test 336feef

@lhb8125
Copy link
Contributor Author

lhb8125 commented Oct 24, 2025

/ok to test ada9057

lhb8125 and others added 4 commits October 26, 2025 18:29
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125 lhb8125 force-pushed the hongbinl/activation_offloading_github branch from 2be5f41 to 50b5f36 Compare October 27, 2025 01:32
@lhb8125
Copy link
Contributor Author

lhb8125 commented Oct 27, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 27, 2025

/ok to test

@lhb8125, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@lhb8125
Copy link
Contributor Author

lhb8125 commented Oct 27, 2025

/ok to test 50b5f36

@ko3n1g
Copy link
Contributor

ko3n1g commented Oct 27, 2025

This PR was failing on the internal CI with a few issues. Let's sync offline on it.

@lhb8125
Copy link
Contributor Author

lhb8125 commented Oct 27, 2025

/ok to test 50b5f36

@elliottnv
Copy link

AI: @ko3n1g to verify the fixed version of the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request core_dev_r0.15.0 Expert Review Apply this label to indicate that your PR is ready for expert review. module: moe Run tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants