You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Fine-grained Activation Offloading (collaborated with rednote)
2
+
3
+
Memory capacity is more and more important with the rising of extreme sparse MoE models like DeepSeek-V3 and Qwen3-235B. Fine-grained recomputing reduces the memory footprint at the cost of extra recomputation, while offloading could utilize the host-device bandwidth to achieve nearly zero-overhead. Fine-grained Activation Offloading targets at offloading the activation at the granularity of specific modules, so that we can calibrate the amount of offloading activation to maximize the training throughput.
4
+
5
+
**Features**
6
+
* Support PP=1/PP/Interleaved PP
7
+
* Compatible with fine-grained recomputation
8
+
* Support FP8
9
+
* Support MTP
10
+
* Support mixed dense & moe layer
11
+
* Support A2A Overlap
12
+
* Support CUDA Graph
13
+
* (Temporary) cuda graph scope cannot contains the offloading modules
14
+
15
+
**Usage**
16
+
```bash
17
+
# Enable fine-grained activation offloading
18
+
--fine-grained-activation-offloading
19
+
20
+
# Specify which modules are going to offload its input
0 commit comments