Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export by kinjalpatel27 · Pull Request #1143 · NVIDIA/Model-Optimizer

kinjalpatel27 · 2026-03-30T20:09:13Z

What does this PR do?

Type of change: Bug fix

When export_hf_vllm_fq_checkpoint folds fake-quantized weights into the HF checkpoint, it applies the weight quantizer's full forward pass — including any Hadamard rotation — to produce fake_quant(rotate(W)). The weight quantizer is then disabled in the saved ModelOpt state so vLLM reload skips re-quantization.

However, if fold_weight is called after reload (e.g. in fakequant_worker.py), QuantModule.fold_weight checks fake_quant but not is_enabled, so it calls the disabled weight quantizer. With rotate=True, rotation runs before the
_disabled early-return in TensorQuantizer.forward, corrupting the already-folded weight with a second rotation.

This fix clears _rotate on weight quantizers alongside disable() before saving the ModelOpt state. Both are restored on the in-memory model after saving, so the export remains non-mutating.

Usage

Testing

cd <MODELOPT_PATH>/examples/vllm_serve

Add following config to modelopt/torch/quantization/config.py
NVFP4_ROTATE_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "enable": True,
            "rotate": True,
        },
        "*input_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "enable": True,
            "rotate": True,
        },
        **_default_disabled_quantizer_cfg,
    },
    "algorithm": {
        "method": "max",
    },
}

python ../llm_ptq/hf_ptq.py             --pyt_ckpt_path Qwen/Qwen3-0.6B    --quant_cfg  NVFP4_ROTATE_CFG    --dataset cnn_dailymail             --export_path qwen3-rotate             --trust_remote_code             --batch_size 4             --calib_size 512 --vllm_fakequant_export

MODELOPT_STATE_PATH=qwen3-rotate/vllm_fq_modelopt_state.pth python vllm_serve_fakequant.py qwen3-rotate/ -tp 1         --served-model-name Qwen3-0.6B --host 0.0.0.0  --port 8001 --trust-remote-code         --disable-custom-all-reduce         --enforce-eager         --gpu-memory-utilization 0.8         --max-model-len 32768

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

Bug Fixes
- Fixed export flow to correctly preserve rotation configurations for quantized weights during model export, ensuring accurate quantization behavior in exported models.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

copy-pr-bot · 2026-03-30T20:09:17Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-30T20:09:20Z

📝 Walkthrough

Walkthrough

The export flow was modified to preserve each weight quantizer's rotation configuration during the temporary disable/enable window used for snapshotting ModelOpt quantizer state. Rotation settings are now captured before disabling, restored to a disabled form while snapshotting, and then reset to their original values after completion.

Changes

Cohort / File(s)	Summary
Rotation Configuration Preservation `modelopt/torch/export/plugins/vllm_fakequant_hf.py`	Modified export flow to capture and restore weight quantizer rotation configuration. When a quantizer's `rotate_is_enabled` is true, the code now stores the original `_rotate` value and mutates it to a type-preserving disabled form (handling both `RotateConfig` and legacy dict shapes). After snapshotting, the original rotation setting is restored.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: a bug fix addressing weight quantizer rotation during vLLM fakequant export, which is the core issue described in the PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	The modified vllm_fakequant_hf.py file contains no security anti-patterns including unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, or # nosec comments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kinjal/vllm_export_rotation_fix

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-30T20:13:27Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1143/
Built to branch `gh-pages` at 2026-03-31 23:17 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
107-122: Please add a rotate=True export/reload regression test.

This only fails when export, reload, and a later fold_weight() interact, so it is easy to regress silently without an end-to-end check.

As per coding guidelines, "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 107 - 122,
Add a regression test that exercises export → reload → fold_weight to ensure the
quantizer rotation state is preserved: create a small model using QuantModule
with a TensorQuantizer that has rotate enabled, export it using the logic in
vllm_fakequant_hf.py (so the code path that disables quantizer.rotate and stores
orig_rotate in wqs_to_restore runs), reload the exported model, call
fold_weight() on the reloaded model, and assert that weights are identical to
the pre-export folded result and that the TensorQuantizer._rotate value is
restored to its original boolean; target the test at the export/reload flow that
manipulates QuantModule, TensorQuantizer, rotate/rotate_is_enabled, and
fold_weight so regressions are caught.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 118-122: The code disables quantizers and mutates
quantizer._rotate but does not guarantee restoration on error; wrap the section
that disables quantizers and the following processing (the region that builds
wqs_to_restore and does work up to where restores currently happen) in a
try/finally so restoration always runs: in the finally iterate wqs_to_restore
and call quantizer.enable() (or reverse the earlier disable call) and restore
quantizer._rotate to orig_rotate (using quantizer.rotate_is_enabled to decide if
it was changed), and swallow/log any exceptions during restore to avoid masking
the original exception; reference the quantizer object, wqs_to_restore list,
quantizer.disable/enable, quantizer._rotate and quantizer.rotate_is_enabled so
you restore state even if an exception occurs.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 107-122: Add a regression test that exercises export → reload →
fold_weight to ensure the quantizer rotation state is preserved: create a small
model using QuantModule with a TensorQuantizer that has rotate enabled, export
it using the logic in vllm_fakequant_hf.py (so the code path that disables
quantizer.rotate and stores orig_rotate in wqs_to_restore runs), reload the
exported model, call fold_weight() on the reloaded model, and assert that
weights are identical to the pre-export folded result and that the
TensorQuantizer._rotate value is restored to its original boolean; target the
test at the export/reload flow that manipulates QuantModule, TensorQuantizer,
rotate/rotate_is_enabled, and fold_weight so regressions are caught.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f9e9ddd-4ce2-43db-b2d3-5afccc6c397a

📥 Commits

Reviewing files that changed from the base of the PR and between f04e106 and 19563b0.

📒 Files selected for processing (1)

modelopt/torch/export/plugins/vllm_fakequant_hf.py

modelopt/torch/export/plugins/vllm_fakequant_hf.py

codecov · 2026-03-30T20:21:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.20%. Comparing base (ada1e26) to head (83b8ee6).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1143      +/-   ##
==========================================
+ Coverage   70.18%   70.20%   +0.02%     
==========================================
  Files         230      230              
  Lines       26080    26080              
==========================================
+ Hits        18304    18310       +6     
+ Misses       7776     7770       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv

LGTM. The bug is real — TensorQuantizer.forward() applies Hadamard rotation before the _disabled early-return, so when fold_weight is called on vLLM reload (which checks fake_quant but not is_enabled), the already-folded weight gets a second rotation. The fix correctly clears _rotate alongside disable() before saving the modelopt state, and restores both afterward to keep the export non-mutating.

One minor observation: _rotate can be a bool, dict, or RotateConfig — setting it to False works correctly for the serialized state (since rotate_is_enabled handles all three types), but preserving the original type (e.g. RotateConfig(enable=False)) would be slightly cleaner. Not a blocker.

Worth noting that the deeper root cause is QuantModule.fold_weight not guarding on is_enabled / _disabled before invoking the quantizer, but that's a broader change — this targeted fix is appropriate.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
128-132: Logic is correct; consider optional try/finally for robustness.

The sequence properly saves orig_rotate before mutation and correctly uses rotate_is_enabled to avoid unnecessary changes. The fix addresses the double-rotation bug effectively.

As noted in past review, wrapping lines 128-165 in a try/finally would guarantee restoration even if an exception occurs mid-export (relevant when callers catch exceptions and continue). This is optional given your deployment context but would make the function safer as reusable library code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 128 - 132,
Wrap the block that disables quantizers and records originals (the code that
calls quantizer.disable(), saves orig_rotate from quantizer._rotate, checks
quantizer.rotate_is_enabled and sets quantizer._rotate =
disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 128-132: Wrap the block that disables quantizers and records
originals (the code that calls quantizer.disable(), saves orig_rotate from
quantizer._rotate, checks quantizer.rotate_is_enabled and sets quantizer._rotate
= disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6c0d7633-e5eb-4ca7-8003-e0d0f2fa9fe6

📥 Commits

Reviewing files that changed from the base of the PR and between 19563b0 and f56eecc.

📒 Files selected for processing (1)

modelopt/torch/export/plugins/vllm_fakequant_hf.py

Bug fix: rotation disabled during export for folded weights

19563b0

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 marked this pull request as ready for review March 30, 2026 20:11

kinjalpatel27 requested a review from a team as a code owner March 30, 2026 20:11

kinjalpatel27 requested a review from cjluo-nv March 30, 2026 20:11

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

modelopt/torch/export/plugins/vllm_fakequant_hf.py Show resolved Hide resolved

realAsma approved these changes Mar 31, 2026

View reviewed changes

cjluo-nv approved these changes Mar 31, 2026

View reviewed changes

minor

f56eecc

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Merge branch 'main' into kinjal/vllm_export_rotation_fix

83b8ee6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143

Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143
kinjalpatel27 wants to merge 3 commits intomainfrom
kinjal/vllm_export_rotation_fix

kinjalpatel27 commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-31 23:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

cjluo-nv left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kinjalpatel27 commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-31 23:17 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kinjalpatel27 commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-31 23:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Mar 30, 2026 •

edited

Loading