Skip to content

Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143

Open
kinjalpatel27 wants to merge 3 commits intomainfrom
kinjal/vllm_export_rotation_fix
Open

Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143
kinjalpatel27 wants to merge 3 commits intomainfrom
kinjal/vllm_export_rotation_fix

Conversation

@kinjalpatel27
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 commented Mar 30, 2026

What does this PR do?

Type of change: Bug fix

When export_hf_vllm_fq_checkpoint folds fake-quantized weights into the HF checkpoint, it applies the weight quantizer's full forward pass — including any Hadamard rotation — to produce fake_quant(rotate(W)). The weight quantizer is then disabled in the saved ModelOpt state so vLLM reload skips re-quantization.

However, if fold_weight is called after reload (e.g. in fakequant_worker.py), QuantModule.fold_weight checks fake_quant but not is_enabled, so it calls the disabled weight quantizer. With rotate=True, rotation runs before the
_disabled early-return in TensorQuantizer.forward, corrupting the already-folded weight with a second rotation.

This fix clears _rotate on weight quantizers alongside disable() before saving the ModelOpt state. Both are restored on the in-memory model after saving, so the export remains non-mutating.

Usage

Testing

cd <MODELOPT_PATH>/examples/vllm_serve

Add following config to modelopt/torch/quantization/config.py
NVFP4_ROTATE_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "enable": True,
            "rotate": True,
        },
        "*input_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "enable": True,
            "rotate": True,
        },
        **_default_disabled_quantizer_cfg,
    },
    "algorithm": {
        "method": "max",
    },
}

python ../llm_ptq/hf_ptq.py             --pyt_ckpt_path Qwen/Qwen3-0.6B    --quant_cfg  NVFP4_ROTATE_CFG    --dataset cnn_dailymail             --export_path qwen3-rotate             --trust_remote_code             --batch_size 4             --calib_size 512 --vllm_fakequant_export

MODELOPT_STATE_PATH=qwen3-rotate/vllm_fq_modelopt_state.pth python vllm_serve_fakequant.py qwen3-rotate/ -tp 1         --served-model-name Qwen3-0.6B --host 0.0.0.0  --port 8001 --trust-remote-code         --disable-custom-all-reduce         --enforce-eager         --gpu-memory-utilization 0.8         --max-model-len 32768

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes
    • Fixed export flow to correctly preserve rotation configurations for quantized weights during model export, ensuring accurate quantization behavior in exported models.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

The export flow was modified to preserve each weight quantizer's rotation configuration during the temporary disable/enable window used for snapshotting ModelOpt quantizer state. Rotation settings are now captured before disabling, restored to a disabled form while snapshotting, and then reset to their original values after completion.

Changes

Cohort / File(s) Summary
Rotation Configuration Preservation
modelopt/torch/export/plugins/vllm_fakequant_hf.py
Modified export flow to capture and restore weight quantizer rotation configuration. When a quantizer's rotate_is_enabled is true, the code now stores the original _rotate value and mutates it to a type-preserving disabled form (handling both RotateConfig and legacy dict shapes). After snapshotting, the original rotation setting is restored.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: a bug fix addressing weight quantizer rotation during vLLM fakequant export, which is the core issue described in the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed The modified vllm_fakequant_hf.py file contains no security anti-patterns including unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, or # nosec comments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kinjal/vllm_export_rotation_fix

Comment @coderabbitai help to get the list of available commands and usage tips.

@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review March 30, 2026 20:11
@kinjalpatel27 kinjalpatel27 requested a review from a team as a code owner March 30, 2026 20:11
@kinjalpatel27 kinjalpatel27 requested a review from cjluo-nv March 30, 2026 20:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1143/

Built to branch gh-pages at 2026-03-31 23:17 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

107-122: Please add a rotate=True export/reload regression test.

This only fails when export, reload, and a later fold_weight() interact, so it is easy to regress silently without an end-to-end check.

As per coding guidelines, "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 107 - 122,
Add a regression test that exercises export → reload → fold_weight to ensure the
quantizer rotation state is preserved: create a small model using QuantModule
with a TensorQuantizer that has rotate enabled, export it using the logic in
vllm_fakequant_hf.py (so the code path that disables quantizer.rotate and stores
orig_rotate in wqs_to_restore runs), reload the exported model, call
fold_weight() on the reloaded model, and assert that weights are identical to
the pre-export folded result and that the TensorQuantizer._rotate value is
restored to its original boolean; target the test at the export/reload flow that
manipulates QuantModule, TensorQuantizer, rotate/rotate_is_enabled, and
fold_weight so regressions are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 118-122: The code disables quantizers and mutates
quantizer._rotate but does not guarantee restoration on error; wrap the section
that disables quantizers and the following processing (the region that builds
wqs_to_restore and does work up to where restores currently happen) in a
try/finally so restoration always runs: in the finally iterate wqs_to_restore
and call quantizer.enable() (or reverse the earlier disable call) and restore
quantizer._rotate to orig_rotate (using quantizer.rotate_is_enabled to decide if
it was changed), and swallow/log any exceptions during restore to avoid masking
the original exception; reference the quantizer object, wqs_to_restore list,
quantizer.disable/enable, quantizer._rotate and quantizer.rotate_is_enabled so
you restore state even if an exception occurs.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 107-122: Add a regression test that exercises export → reload →
fold_weight to ensure the quantizer rotation state is preserved: create a small
model using QuantModule with a TensorQuantizer that has rotate enabled, export
it using the logic in vllm_fakequant_hf.py (so the code path that disables
quantizer.rotate and stores orig_rotate in wqs_to_restore runs), reload the
exported model, call fold_weight() on the reloaded model, and assert that
weights are identical to the pre-export folded result and that the
TensorQuantizer._rotate value is restored to its original boolean; target the
test at the export/reload flow that manipulates QuantModule, TensorQuantizer,
rotate/rotate_is_enabled, and fold_weight so regressions are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f9e9ddd-4ce2-43db-b2d3-5afccc6c397a

📥 Commits

Reviewing files that changed from the base of the PR and between f04e106 and 19563b0.

📒 Files selected for processing (1)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.20%. Comparing base (ada1e26) to head (83b8ee6).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1143      +/-   ##
==========================================
+ Coverage   70.18%   70.20%   +0.02%     
==========================================
  Files         230      230              
  Lines       26080    26080              
==========================================
+ Hits        18304    18310       +6     
+ Misses       7776     7770       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The bug is real — TensorQuantizer.forward() applies Hadamard rotation before the _disabled early-return, so when fold_weight is called on vLLM reload (which checks fake_quant but not is_enabled), the already-folded weight gets a second rotation. The fix correctly clears _rotate alongside disable() before saving the modelopt state, and restores both afterward to keep the export non-mutating.

One minor observation: _rotate can be a bool, dict, or RotateConfig — setting it to False works correctly for the serialized state (since rotate_is_enabled handles all three types), but preserving the original type (e.g. RotateConfig(enable=False)) would be slightly cleaner. Not a blocker.

Worth noting that the deeper root cause is QuantModule.fold_weight not guarding on is_enabled / _disabled before invoking the quantizer, but that's a broader change — this targeted fix is appropriate.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

128-132: Logic is correct; consider optional try/finally for robustness.

The sequence properly saves orig_rotate before mutation and correctly uses rotate_is_enabled to avoid unnecessary changes. The fix addresses the double-rotation bug effectively.

As noted in past review, wrapping lines 128-165 in a try/finally would guarantee restoration even if an exception occurs mid-export (relevant when callers catch exceptions and continue). This is optional given your deployment context but would make the function safer as reusable library code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 128 - 132,
Wrap the block that disables quantizers and records originals (the code that
calls quantizer.disable(), saves orig_rotate from quantizer._rotate, checks
quantizer.rotate_is_enabled and sets quantizer._rotate =
disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 128-132: Wrap the block that disables quantizers and records
originals (the code that calls quantizer.disable(), saves orig_rotate from
quantizer._rotate, checks quantizer.rotate_is_enabled and sets quantizer._rotate
= disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6c0d7633-e5eb-4ca7-8003-e0d0f2fa9fe6

📥 Commits

Reviewing files that changed from the base of the PR and between 19563b0 and f56eecc.

📒 Files selected for processing (1)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants