Make the LLaVA / LLaVA-Next test guard explicit#5778
Conversation
…=-2`, and turn the test guard into an assertion
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84e9ccc368
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| "num_key_value_heads": 2, | ||
| "embed_dim": 64, |
There was a problem hiding this comment.
these keys aren't used, we can safely drop them
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
…ount for LLaVA design constraints
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 79ac139. Configure here.
| new_param = trainer.model.get_parameter(n) | ||
| # LLaVA & LLaVA-Next: vision_feature_layer=-2 leaves the last encoder layer (layers.1 and post_layernorm | ||
| # without gradient by design. Assert they stay frozen — if they ever start training, the feature-selection | ||
| # plumbing has likely regressed. |
There was a problem hiding this comment.
Inconsistent comments across trainers violate AGENTS.md consistency rule
Low Severity
The LLaVA frozen-param comment in DPO/SFT tests differs from the identical logic in GRPO/RLOO tests. DPO/SFT uses "(layers.1 and post_layernorm" (with an unclosed parenthesis and extra trailing text about regression), while GRPO/RLOO uses "(layers.1) and post_layernorm (pooler-only path)" (properly punctuated). AGENTS.md requires word-for-word identical comments when the logic is identical.
Additional Locations (2)
Triggered by project rule: ../.ai/AGENTS.md
Reviewed by Cursor Bugbot for commit 79ac139. Configure here.


The VLM trainer tests assert every trainable parameter changes after
trainer.train(). For LLaVA / LLaVA-Next some params are skipped with a vague:We investigated. Those params are not bugs: they are unreachable from the loss by design.
This PR turns the skip into a positive assertion that they stay frozen, and cleans up two spurious keys in the tiny-model generation scripts.
Why those params are frozen
LLaVA stitches three off-the-shelf pieces together:
Linearlayers that map vision features into the LM's embedding dim.At forward time: image → CLIP → pick one intermediate encoder layer's hidden states (
vision_feature_layer=-2, the second-to-last) → drop the CLS token → projector → splice the result into the LM's input embeddings at the<image>placeholder.Taking features from an intermediate layer rather than the final one is the reason the last encoder layer (and
post_layernorm) never reach the loss.flowchart LR img[Image] --> pe[patch_embed] pe --> l0[layers.0] l0 --> l1[layers.1] l1 --> dots["…"] dots --> l22[layers.22] l22 --> l23[layers.23] l23 --> pln[post_layernorm] pln -.->|"pooler_output<br/>(unused)"| X([ ]) l22 ==>|"hidden_states[-2]"| drop["drop CLS token"] drop ==> proj[multi_modal_projector] proj ==> lm text[Text tokens] --> lm[Language Model] lm --> out[Output] subgraph vt[Vision tower — CLIP-ViT-Large, 24 layers] pe l0 l1 dots l22 l23 pln end style l23 stroke:#c33,stroke-dasharray:4 3 style pln stroke:#c33,stroke-dasharray:4 3 style X stroke:#c33,stroke-dasharray:4 3Bold arrows = the actual path features take to the LLM.
Red dashed = computed every forward, but the result never reaches the loss (
vision_feature_layer = -2, pooler unused).Note
Low Risk
Low risk: updates are limited to test expectations and tiny-model generation configs, with no production training logic changes. Main risk is making CI stricter for LLaVA/LLaVA-Next, which could surface latent upstream model/config differences.
Overview
Updates VLM trainer tests (
SFT/DPO/GRPO/RLOO) to stop vaguely skipping certain LLaVA/LLaVA-Next vision-tower parameters and instead positively assert that the last vision encoder layer andpost_layernormremain unchanged (frozen) aftertrainer.train().Cleans up tiny LLaVA/LLaVA-Next model generation scripts by removing unused/invalid
vision_configkeys, and pins test runs to specific hub PR revisions for the updated tiny LLaVA models viatests/conftest.py’sMODEL_REVISIONS.Reviewed by Cursor Bugbot for commit 79ac139. Bugbot is set up for automated code reviews on this repo. Configure here.