[Dev] feat(moe): Fine-grained activation offloading by lhb8125 · Pull Request #1912 · NVIDIA/Megatron-LM

lhb8125 · 2025-10-24T02:41:49Z

What does this PR do ?

Memory capacity are more and more important with the rising of extreme sparse MoE models like DeepSeek-V3 and Qwen3-235B. Fine-grained recomputing reduces the memory footprint at the cost of extra recomputation, while offloading could utilize the host-device bandwidth to achieve nearly zero-overhead.

The current CPU offloading strategy from TE is a layer-level strategy, which offloads the activations in a granularity of the transformer layer, which is coarse-level and hard to highlight the most prominent activations.

Fine-grained Activation Offloading targets at offloading the activation at the granularity of specific modules, so that we can calibrate the amount of offloading activation to maximize the training throughput.

Design Doc

Compared with the current cpu offloading strategy provided by TE, this PR has several advantages:

support PP=1/PP/VPP;
support MoE models;
manually specify offloading the modules with a large memory footprint;
Work with fine-grained recomputation to reduce the total activations as much as possible;

How does fine-grained offloading work with fine-grained recomputing?

For modules with minor perf overhead like layernorm or moe_act, use recomputing to reduce memory footprint;
For other modules, use offloading to reduce memory footprint;
Make sure the offloading/reloading could be overlapped with computing;

Benchmark

DeepSeek-V3-proxy on H100

Setup

Layer parameters are same as DeepSeek-V3 model
Layer number is cut off to 14 layers
Replace the fisrt 3 dense layers with 3 moe layers
TP1PP4EP16VPP1CP1-MBS1GBS512, bf16 training
Offload expert_fc1, moe_act, act_norm and mlp_norm

Results

	Throughput (TFlops)	Max reserved memory (MB)
Baseline	321	74306
Offload expert_fc1,moe_act,layernorm	315	61046

DeepSeek-V3 on GB200 (from @hongxiaob )

Setup

Same model structure with DeepSeek-V3 but no mtp
TP1PP8EP32CP1VPP4-MBS1GBS2048, mxfp8
Offload moe_act

Results

	Throughput (TFlops)	Max reserved memory (MB)
Baseline	945	169094
Offload moe_act	930	151054

copy-pr-bot · 2025-10-24T02:41:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yanring · 2025-10-24T02:45:05Z

LGTM. We've already done one round of review on GitLab.

lhb8125 · 2025-10-24T02:46:24Z

/ok to test 336feef

lhb8125 · 2025-10-24T03:29:13Z

/ok to test ada9057

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 · 2025-10-27T01:32:53Z

/ok to test

copy-pr-bot · 2025-10-27T01:32:55Z

/ok to test

@lhb8125, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

lhb8125 · 2025-10-27T01:33:24Z

/ok to test 50b5f36

ko3n1g · 2025-10-27T13:39:02Z

This PR was failing on the internal CI with a few issues. Let's sync offline on it.

lhb8125 · 2025-10-27T13:46:44Z

/ok to test 50b5f36

elliottnv · 2025-12-09T16:56:04Z

AI: @ko3n1g to verify the fixed version of the PR.

lhb8125 requested review from a team as code owners October 24, 2025 02:41

lhb8125 requested review from hxbai and yanring October 24, 2025 02:43

yanring assigned lhb8125 Oct 24, 2025

yanring added module: moe Expert Review Apply this label to indicate that your PR is ready for expert review. labels Oct 24, 2025

yanring added this to the Core 0.15 milestone Oct 24, 2025

copy-pr-bot bot temporarily deployed to nemo-ci October 24, 2025 02:46 Inactive

copy-pr-bot bot temporarily deployed to test October 24, 2025 02:47 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 24, 2025 02:49 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 24, 2025 03:04 Error

copy-pr-bot bot temporarily deployed to nemo-ci October 24, 2025 03:04 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 24, 2025 03:29 Inactive

copy-pr-bot bot temporarily deployed to test October 24, 2025 03:30 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 24, 2025 04:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 24, 2025 04:18 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 26, 2025 00:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 26, 2025 00:53 Inactive

lhb8125 and others added 4 commits October 26, 2025 18:29

support fine-grained activation offloading

5c209c3

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

update years in copyright

784a34c

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

update copyright

6e5c29e

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

Merge branch 'dev' into hongbinl/activation_offloading_github

50b5f36

lhb8125 force-pushed the hongbinl/activation_offloading_github branch from 2be5f41 to 50b5f36 Compare October 27, 2025 01:32

yanring approved these changes Oct 27, 2025

View reviewed changes

chtruong814 approved these changes Oct 27, 2025

View reviewed changes

pablo-garay approved these changes Oct 27, 2025

View reviewed changes

erhoo82 mentioned this pull request Nov 17, 2025

[25.11 patch] PRs to add NVIDIA-NeMo/Megatron-Bridge#1372

Open

yanring mentioned this pull request Dec 3, 2025

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

Open

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] feat(moe): Fine-grained activation offloading#1912

[Dev] feat(moe): Fine-grained activation offloading#1912
lhb8125 merged 4 commits intoNVIDIA:devfrom
lhb8125:hongbinl/activation_offloading_github

lhb8125 commented Oct 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

yanring commented Oct 24, 2025

Uh oh!

lhb8125 commented Oct 24, 2025

Uh oh!

lhb8125 commented Oct 24, 2025

Uh oh!

lhb8125 commented Oct 27, 2025

Uh oh!

copy-pr-bot bot commented Oct 27, 2025

Uh oh!

lhb8125 commented Oct 27, 2025

Uh oh!

ko3n1g commented Oct 27, 2025

Uh oh!

lhb8125 commented Oct 27, 2025

Uh oh!

elliottnv commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

lhb8125 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Compared with the current cpu offloading strategy provided by TE, this PR has several advantages:

How does fine-grained offloading work with fine-grained recomputing?

Benchmark

DeepSeek-V3-proxy on H100

Setup

Results

DeepSeek-V3 on GB200 (from @hongxiaob )

Setup

Results

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

yanring commented Oct 24, 2025

Uh oh!

lhb8125 commented Oct 24, 2025

Uh oh!

lhb8125 commented Oct 24, 2025

Uh oh!

lhb8125 commented Oct 27, 2025

Uh oh!

copy-pr-bot bot commented Oct 27, 2025

Uh oh!

lhb8125 commented Oct 27, 2025

Uh oh!

ko3n1g commented Oct 27, 2025

Uh oh!

lhb8125 commented Oct 27, 2025

Uh oh!

elliottnv commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lhb8125 commented Oct 24, 2025 •

edited

Loading