Add MiniMax-M3 NVFP4 variant (MTP + non-MTP) by Ankur-singh · Pull Request #577 · vllm-project/recipes

Ankur-singh · 2026-06-25T16:48:22Z

Adds an NVFP4 variant (nvidia/MiniMax-M3-NVFP4) to the MiniMax-M3 recipe — NVIDIA-quantized 4-bit weights, ~1/4 the BF16 VRAM (vram_minimum_gb: 257), fitting a single Blackwell (B200/B300) node with KV-cache headroom.

Both MTP and non-MTP are covered: the recipe's existing opt-in spec_decoding feature already wires the EAGLE3 draft head (Inferact/MiniMax-M3-EAGLE3, 3 speculative tokens), so the NVFP4 target runs with spec decoding on (MTP) or off (non-MTP) via the command builder — no new feature needed.

Changes to models/MiniMaxAI/MiniMax-M3.yaml:

New nvfp4 variant (model_id: nvidia/MiniMax-M3-NVFP4), following the existing MiniMax-M2.7 NVFP4 precedent.
Guide section "Quantized Variant (NVFP4, Blackwell)" with TP8 and +EAGLE3 (MTP) serve commands.
meta description/date refresh; NVFP4 / EAGLE3 / vLLM PR links added to References.

NVFP4 support is in-flight in vLLM (PR #46380); the guide notes the build requirement until it merges.

Validated with node scripts/build-recipes-api.mjs (✓ JSON API parses; nvidia/MiniMax-M3-NVFP4 renders as a promoted variant, Blackwell-only).

vercel · 2026-06-25T16:48:28Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
vllm-recipes	Ready	Preview, Comment	Jun 25, 2026 8:15pm

gemini-code-assist

Code Review

This pull request adds the new nvfp4 quantized variant (nvidia/MiniMax-M3-NVFP4) for the MiniMax-M3 model, updating the configuration and documentation to support running it on Blackwell GPUs with vLLM. The review feedback recommends defining and prefixing commands with the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable to ensure the model utilizes optimized FlashInfer kernels for FP4 MoE layers on Blackwell.

Ankur-singh · 2026-06-25T16:59:16Z

@faradawn can you please review this one?

faradawn · 2026-06-25T17:19:19Z

LGTM.

I've resolved the Gemini comments since the NVFP4 MOE env var is not needed. Below is the full serving command that supports this recipes

vllm serve nvidia/MiniMax-M3-NVFP4 \
$PARALLEL_ARGS \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \ >>> not needed. 
--block-size 128 \ >>> not needed
--language-model-only \ >>> not needed
--max-cudagraph-capture-size 2048 \ >>> not needed
--max-num-batched-tokens "$((ISL * 2 ))" \ >>> not needed
--stream-interval 20 --no-enable-prefix-caching \ >>> not needed
--trust-remote-code > $SERVER_LOG 2>&1 &

faradawn · 2026-06-25T17:22:56Z

+  ## Quantized Variant (NVFP4, Blackwell)
+
+  [`nvidia/MiniMax-M3-NVFP4`](https://huggingface.co/nvidia/MiniMax-M3-NVFP4) is
+  an NVFP4 checkpoint quantized by NVIDIA — roughly **1/4 the VRAM** of the BF16
+  release, so the 427B model fits comfortably on a single Blackwell node (B200 /
+  B300) with KV-cache headroom. Select the **nvfp4** variant above, or pass the
+  repo id directly to `vllm serve`.
+
+  > **vLLM support is in-flight.** MiniMax-M3 NVFP4 needs the modelopt NVFP4 path
+  > added in [vLLM PR #46380](https://github.com/vllm-project/vllm/pull/46380),
+  > which is not yet merged. Until it lands in a release, build vLLM from that
+  > branch (or a nightly once merged); a stock build will not recognise the NVFP4
+  > quant config.
+
+  ```bash
+  vllm serve nvidia/MiniMax-M3-NVFP4 \
+    --tensor-parallel-size 8 \
+    --block-size 128 \
+    --tool-call-parser minimax_m3 \
+    --reasoning-parser minimax_m3 \
+    --enable-auto-tool-choice
+  ```
+
+  Add `--enable-expert-parallel` (TP+EP) or `--data-parallel-size 8
+  --enable-expert-parallel` (DP+EP) to scale across the node, exactly as for the
+  BF16/MXFP8 commands above. For text-only serving, add `--language-model-only`
+  to skip the vision encoder and free VRAM for KV cache.
+
+  ### NVFP4 + EAGLE3 spec decoding (MTP)
+
+  The NVFP4 target pairs with the same EAGLE3 draft head as the other variants.
+  Enable the **Spec decoding** feature above, or append the draft config to the
+  command:
+
+  ```bash
+  vllm serve nvidia/MiniMax-M3-NVFP4 \
+    --tensor-parallel-size 8 \
+    --block-size 128 \
+    --tool-call-parser minimax_m3 \
+    --reasoning-parser minimax_m3 \
+    --enable-auto-tool-choice \
+    --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'
+  ```
+


Suggested change

## Quantized Variant (NVFP4, Blackwell)

[`nvidia/MiniMax-M3-NVFP4`](https://huggingface.co/nvidia/MiniMax-M3-NVFP4) is

an NVFP4 checkpoint quantized by NVIDIA — roughly **1/4 the VRAM** of the BF16

release, so the 427B model fits comfortably on a single Blackwell node (B200 /

B300) with KV-cache headroom. Select the **nvfp4** variant above, or pass the

repo id directly to `vllm serve`.

> **vLLM support is in-flight.** MiniMax-M3 NVFP4 needs the modelopt NVFP4 path

> added in [vLLM PR #46380](https://github.com/vllm-project/vllm/pull/46380),

> which is not yet merged. Until it lands in a release, build vLLM from that

> branch (or a nightly once merged); a stock build will not recognise the NVFP4

> quant config.

```bash

vllm serve nvidia/MiniMax-M3-NVFP4 \

--tensor-parallel-size 8 \

--block-size 128 \

--tool-call-parser minimax_m3 \

--reasoning-parser minimax_m3 \

--enable-auto-tool-choice

```

Add `--enable-expert-parallel` (TP+EP) or `--data-parallel-size 8

--enable-expert-parallel` (DP+EP) to scale across the node, exactly as for the

BF16/MXFP8 commands above. For text-only serving, add `--language-model-only`

to skip the vision encoder and free VRAM for KV cache.

### NVFP4 + EAGLE3 spec decoding (MTP)

The NVFP4 target pairs with the same EAGLE3 draft head as the other variants.

Enable the **Spec decoding** feature above, or append the draft config to the

command:

```bash

vllm serve nvidia/MiniMax-M3-NVFP4 \

--tensor-parallel-size 8 \

--block-size 128 \

--tool-call-parser minimax_m3 \

--reasoning-parser minimax_m3 \

--enable-auto-tool-choice \

--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'

```

The information correctly matches the selection UI's command. Do we want to keep this? Otherwise, it looks good to me. @esmeetu

doesn't harm to have this.

Signed-off-by: Ankur-singh <ankusingh@nvidia.com>

functionstackx

following this recipe, it passes evals https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28197637552/job/83557638525

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread models/MiniMaxAI/MiniMax-M3.yaml

Comment thread models/MiniMaxAI/MiniMax-M3.yaml

Comment thread models/MiniMaxAI/MiniMax-M3.yaml

vercel Bot deployed to Preview June 25, 2026 16:49 View deployment

This was referenced Jun 25, 2026

Add MiniMax-M3 NVFP4 B300 single-node vLLM benchmark (EAGLE3 spec decode) SemiAnalysisAI/InferenceX#1929

Merged

Add MiniMax-M3 NVFP4 B300 single-node aggregated vLLM benchmark SemiAnalysisAI/InferenceX#1928

Merged

faradawn approved these changes Jun 25, 2026

View reviewed changes

Add MiniMax-M3 NVFP4 variant (MTP + non-MTP)

3418cc7

Signed-off-by: Ankur-singh <ankusingh@nvidia.com>

Ankur-singh force-pushed the minimax-m3-nvfp4 branch from 931c5b2 to 3418cc7 Compare June 25, 2026 20:14

vercel Bot deployed to Preview June 25, 2026 20:15 View deployment

functionstackx approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MiniMax-M3 NVFP4 variant (MTP + non-MTP)#577

Add MiniMax-M3 NVFP4 variant (MTP + non-MTP)#577
Ankur-singh wants to merge 1 commit into
vllm-project:mainfrom
Ankur-singh:minimax-m3-nvfp4

Ankur-singh commented Jun 25, 2026

Uh oh!

vercel Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ankur-singh commented Jun 25, 2026

Uh oh!

faradawn commented Jun 25, 2026

Uh oh!

faradawn Jun 25, 2026

Uh oh!

faradawn Jun 25, 2026

Uh oh!

esmeetu Jun 26, 2026

Uh oh!

functionstackx left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Ankur-singh commented Jun 25, 2026

Uh oh!

vercel Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ankur-singh commented Jun 25, 2026

Uh oh!

faradawn commented Jun 25, 2026

Uh oh!

faradawn Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

faradawn Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

esmeetu Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vercel Bot commented Jun 25, 2026 •

edited

Loading