Skip to content

Add MiniMax-M3 NVFP4 variant (MTP + non-MTP)#577

Open
Ankur-singh wants to merge 1 commit into
vllm-project:mainfrom
Ankur-singh:minimax-m3-nvfp4
Open

Add MiniMax-M3 NVFP4 variant (MTP + non-MTP)#577
Ankur-singh wants to merge 1 commit into
vllm-project:mainfrom
Ankur-singh:minimax-m3-nvfp4

Conversation

@Ankur-singh

Copy link
Copy Markdown

Adds an NVFP4 variant (nvidia/MiniMax-M3-NVFP4) to the MiniMax-M3 recipe — NVIDIA-quantized 4-bit weights, ~1/4 the BF16 VRAM (vram_minimum_gb: 257), fitting a single Blackwell (B200/B300) node with KV-cache headroom.

Both MTP and non-MTP are covered: the recipe's existing opt-in spec_decoding feature already wires the EAGLE3 draft head (Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), so the NVFP4 target runs with spec decoding on (MTP) or off (non-MTP) via the command builder — no new feature needed.

Changes to models/MiniMaxAI/MiniMax-M3.yaml:

  • New nvfp4 variant (model_id: nvidia/MiniMax-M3-NVFP4), following the existing MiniMax-M2.7 NVFP4 precedent.
  • Guide section "Quantized Variant (NVFP4, Blackwell)" with TP8 and +EAGLE3 (MTP) serve commands.
  • meta description/date refresh; NVFP4 / EAGLE3 / vLLM PR links added to References.

NVFP4 support is in-flight in vLLM (PR #46380); the guide notes the build requirement until it merges.

Validated with node scripts/build-recipes-api.mjs (✓ JSON API parses; nvidia/MiniMax-M3-NVFP4 renders as a promoted variant, Blackwell-only).

@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 25, 2026 8:15pm

Request Review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the new nvfp4 quantized variant (nvidia/MiniMax-M3-NVFP4) for the MiniMax-M3 model, updating the configuration and documentation to support running it on Blackwell GPUs with vLLM. The review feedback recommends defining and prefixing commands with the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable to ensure the model utilizes optimized FlashInfer kernels for FP4 MoE layers on Blackwell.

Comment thread models/MiniMaxAI/MiniMax-M3.yaml
Comment thread models/MiniMaxAI/MiniMax-M3.yaml
Comment thread models/MiniMaxAI/MiniMax-M3.yaml
@Ankur-singh

Copy link
Copy Markdown
Author

@faradawn can you please review this one?

@faradawn

Copy link
Copy Markdown
Collaborator

LGTM.

I've resolved the Gemini comments since the NVFP4 MOE env var is not needed. Below is the full serving command that supports this recipes

vllm serve nvidia/MiniMax-M3-NVFP4 \
$PARALLEL_ARGS \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \ >>> not needed. 
--block-size 128 \ >>> not needed
--language-model-only \ >>> not needed
--max-cudagraph-capture-size 2048 \ >>> not needed
--max-num-batched-tokens "$((ISL * 2 ))" \ >>> not needed
--stream-interval 20 --no-enable-prefix-caching \ >>> not needed
--trust-remote-code > $SERVER_LOG 2>&1 &

Comment on lines +434 to +477
## Quantized Variant (NVFP4, Blackwell)

[`nvidia/MiniMax-M3-NVFP4`](https://huggingface.co/nvidia/MiniMax-M3-NVFP4) is
an NVFP4 checkpoint quantized by NVIDIA — roughly **1/4 the VRAM** of the BF16
release, so the 427B model fits comfortably on a single Blackwell node (B200 /
B300) with KV-cache headroom. Select the **nvfp4** variant above, or pass the
repo id directly to `vllm serve`.

> **vLLM support is in-flight.** MiniMax-M3 NVFP4 needs the modelopt NVFP4 path
> added in [vLLM PR #46380](https://github.com/vllm-project/vllm/pull/46380),
> which is not yet merged. Until it lands in a release, build vLLM from that
> branch (or a nightly once merged); a stock build will not recognise the NVFP4
> quant config.

```bash
vllm serve nvidia/MiniMax-M3-NVFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
```

Add `--enable-expert-parallel` (TP+EP) or `--data-parallel-size 8
--enable-expert-parallel` (DP+EP) to scale across the node, exactly as for the
BF16/MXFP8 commands above. For text-only serving, add `--language-model-only`
to skip the vision encoder and free VRAM for KV cache.

### NVFP4 + EAGLE3 spec decoding (MTP)

The NVFP4 target pairs with the same EAGLE3 draft head as the other variants.
Enable the **Spec decoding** feature above, or append the draft config to the
command:

```bash
vllm serve nvidia/MiniMax-M3-NVFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice \
--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Quantized Variant (NVFP4, Blackwell)
[`nvidia/MiniMax-M3-NVFP4`](https://huggingface.co/nvidia/MiniMax-M3-NVFP4) is
an NVFP4 checkpoint quantized by NVIDIA — roughly **1/4 the VRAM** of the BF16
release, so the 427B model fits comfortably on a single Blackwell node (B200 /
B300) with KV-cache headroom. Select the **nvfp4** variant above, or pass the
repo id directly to `vllm serve`.
> **vLLM support is in-flight.** MiniMax-M3 NVFP4 needs the modelopt NVFP4 path
> added in [vLLM PR #46380](https://github.com/vllm-project/vllm/pull/46380),
> which is not yet merged. Until it lands in a release, build vLLM from that
> branch (or a nightly once merged); a stock build will not recognise the NVFP4
> quant config.
```bash
vllm serve nvidia/MiniMax-M3-NVFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
```
Add `--enable-expert-parallel` (TP+EP) or `--data-parallel-size 8
--enable-expert-parallel` (DP+EP) to scale across the node, exactly as for the
BF16/MXFP8 commands above. For text-only serving, add `--language-model-only`
to skip the vision encoder and free VRAM for KV cache.
### NVFP4 + EAGLE3 spec decoding (MTP)
The NVFP4 target pairs with the same EAGLE3 draft head as the other variants.
Enable the **Spec decoding** feature above, or append the draft config to the
command:
```bash
vllm serve nvidia/MiniMax-M3-NVFP4 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice \
--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information correctly matches the selection UI's command. Do we want to keep this? Otherwise, it looks good to me. @esmeetu

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't harm to have this.

Signed-off-by: Ankur-singh <ankusingh@nvidia.com>

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants