Add MiniMax-M3 NVFP4 variant (MTP + non-MTP)#577
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request adds the new nvfp4 quantized variant (nvidia/MiniMax-M3-NVFP4) for the MiniMax-M3 model, updating the configuration and documentation to support running it on Blackwell GPUs with vLLM. The review feedback recommends defining and prefixing commands with the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable to ensure the model utilizes optimized FlashInfer kernels for FP4 MoE layers on Blackwell.
|
@faradawn can you please review this one? |
|
LGTM. I've resolved the Gemini comments since the NVFP4 MOE env var is not needed. Below is the full serving command that supports this recipes |
| ## Quantized Variant (NVFP4, Blackwell) | ||
|
|
||
| [`nvidia/MiniMax-M3-NVFP4`](https://huggingface.co/nvidia/MiniMax-M3-NVFP4) is | ||
| an NVFP4 checkpoint quantized by NVIDIA — roughly **1/4 the VRAM** of the BF16 | ||
| release, so the 427B model fits comfortably on a single Blackwell node (B200 / | ||
| B300) with KV-cache headroom. Select the **nvfp4** variant above, or pass the | ||
| repo id directly to `vllm serve`. | ||
|
|
||
| > **vLLM support is in-flight.** MiniMax-M3 NVFP4 needs the modelopt NVFP4 path | ||
| > added in [vLLM PR #46380](https://github.com/vllm-project/vllm/pull/46380), | ||
| > which is not yet merged. Until it lands in a release, build vLLM from that | ||
| > branch (or a nightly once merged); a stock build will not recognise the NVFP4 | ||
| > quant config. | ||
|
|
||
| ```bash | ||
| vllm serve nvidia/MiniMax-M3-NVFP4 \ | ||
| --tensor-parallel-size 8 \ | ||
| --block-size 128 \ | ||
| --tool-call-parser minimax_m3 \ | ||
| --reasoning-parser minimax_m3 \ | ||
| --enable-auto-tool-choice | ||
| ``` | ||
|
|
||
| Add `--enable-expert-parallel` (TP+EP) or `--data-parallel-size 8 | ||
| --enable-expert-parallel` (DP+EP) to scale across the node, exactly as for the | ||
| BF16/MXFP8 commands above. For text-only serving, add `--language-model-only` | ||
| to skip the vision encoder and free VRAM for KV cache. | ||
|
|
||
| ### NVFP4 + EAGLE3 spec decoding (MTP) | ||
|
|
||
| The NVFP4 target pairs with the same EAGLE3 draft head as the other variants. | ||
| Enable the **Spec decoding** feature above, or append the draft config to the | ||
| command: | ||
|
|
||
| ```bash | ||
| vllm serve nvidia/MiniMax-M3-NVFP4 \ | ||
| --tensor-parallel-size 8 \ | ||
| --block-size 128 \ | ||
| --tool-call-parser minimax_m3 \ | ||
| --reasoning-parser minimax_m3 \ | ||
| --enable-auto-tool-choice \ | ||
| --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}' | ||
| ``` | ||
|
|
There was a problem hiding this comment.
| ## Quantized Variant (NVFP4, Blackwell) | |
| [`nvidia/MiniMax-M3-NVFP4`](https://huggingface.co/nvidia/MiniMax-M3-NVFP4) is | |
| an NVFP4 checkpoint quantized by NVIDIA — roughly **1/4 the VRAM** of the BF16 | |
| release, so the 427B model fits comfortably on a single Blackwell node (B200 / | |
| B300) with KV-cache headroom. Select the **nvfp4** variant above, or pass the | |
| repo id directly to `vllm serve`. | |
| > **vLLM support is in-flight.** MiniMax-M3 NVFP4 needs the modelopt NVFP4 path | |
| > added in [vLLM PR #46380](https://github.com/vllm-project/vllm/pull/46380), | |
| > which is not yet merged. Until it lands in a release, build vLLM from that | |
| > branch (or a nightly once merged); a stock build will not recognise the NVFP4 | |
| > quant config. | |
| ```bash | |
| vllm serve nvidia/MiniMax-M3-NVFP4 \ | |
| --tensor-parallel-size 8 \ | |
| --block-size 128 \ | |
| --tool-call-parser minimax_m3 \ | |
| --reasoning-parser minimax_m3 \ | |
| --enable-auto-tool-choice | |
| ``` | |
| Add `--enable-expert-parallel` (TP+EP) or `--data-parallel-size 8 | |
| --enable-expert-parallel` (DP+EP) to scale across the node, exactly as for the | |
| BF16/MXFP8 commands above. For text-only serving, add `--language-model-only` | |
| to skip the vision encoder and free VRAM for KV cache. | |
| ### NVFP4 + EAGLE3 spec decoding (MTP) | |
| The NVFP4 target pairs with the same EAGLE3 draft head as the other variants. | |
| Enable the **Spec decoding** feature above, or append the draft config to the | |
| command: | |
| ```bash | |
| vllm serve nvidia/MiniMax-M3-NVFP4 \ | |
| --tensor-parallel-size 8 \ | |
| --block-size 128 \ | |
| --tool-call-parser minimax_m3 \ | |
| --reasoning-parser minimax_m3 \ | |
| --enable-auto-tool-choice \ | |
| --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}' | |
| ``` |
There was a problem hiding this comment.
The information correctly matches the selection UI's command. Do we want to keep this? Otherwise, it looks good to me. @esmeetu
Signed-off-by: Ankur-singh <ankusingh@nvidia.com>
931c5b2 to
3418cc7
Compare
functionstackx
left a comment
There was a problem hiding this comment.
following this recipe, it passes evals https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28197637552/job/83557638525
Adds an NVFP4 variant (
nvidia/MiniMax-M3-NVFP4) to the MiniMax-M3 recipe — NVIDIA-quantized 4-bit weights, ~1/4 the BF16 VRAM (vram_minimum_gb: 257), fitting a single Blackwell (B200/B300) node with KV-cache headroom.Both MTP and non-MTP are covered: the recipe's existing opt-in
spec_decodingfeature already wires the EAGLE3 draft head (Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), so the NVFP4 target runs with spec decoding on (MTP) or off (non-MTP) via the command builder — no new feature needed.Changes to
models/MiniMaxAI/MiniMax-M3.yaml:nvfp4variant (model_id: nvidia/MiniMax-M3-NVFP4), following the existingMiniMax-M2.7NVFP4 precedent.+EAGLE3(MTP) serve commands.metadescription/date refresh; NVFP4 / EAGLE3 / vLLM PR links added to References.NVFP4 support is in-flight in vLLM (PR #46380); the guide notes the build requirement until it merges.
Validated with
node scripts/build-recipes-api.mjs(✓ JSON API parses;nvidia/MiniMax-M3-NVFP4renders as a promoted variant, Blackwell-only).