Skip to content

Make the LLaVA / LLaVA-Next test guard explicit#5778

Open
qgallouedec wants to merge 4 commits into
mainfrom
fix-llava
Open

Make the LLaVA / LLaVA-Next test guard explicit#5778
qgallouedec wants to merge 4 commits into
mainfrom
fix-llava

Conversation

@qgallouedec
Copy link
Copy Markdown
Member

@qgallouedec qgallouedec commented May 15, 2026

The VLM trainer tests assert every trainable parameter changes after trainer.train(). For LLaVA / LLaVA-Next some params are skipped with a vague:

# For some reason, these params are not updated. … We should investigate this further, but for now we just skip these params.

We investigated. Those params are not bugs: they are unreachable from the loss by design.

This PR turns the skip into a positive assertion that they stay frozen, and cleans up two spurious keys in the tiny-model generation scripts.

Why those params are frozen

LLaVA stitches three off-the-shelf pieces together:

  1. Vision tower: a frozen, pretrained CLIP vision transformer (24 encoder layers in real LLaVA, 2 in our tiny version).
  2. Multi-modal projector: two small Linear layers that map vision features into the LM's embedding dim.
  3. Language model: a pretrained LLM (Vicuna for LLaVA-1.5, Mistral for LLaVA-Next).

At forward time: image → CLIP → pick one intermediate encoder layer's hidden states (vision_feature_layer=-2, the second-to-last) → drop the CLS token → projector → splice the result into the LM's input embeddings at the <image> placeholder.

Taking features from an intermediate layer rather than the final one is the reason the last encoder layer (and post_layernorm) never reach the loss.

flowchart LR
    img[Image] --> pe[patch_embed]
    pe --> l0[layers.0]
    l0 --> l1[layers.1]
    l1 --> dots["…"]
    dots --> l22[layers.22]
    l22 --> l23[layers.23]
    l23 --> pln[post_layernorm]
    pln -.->|"pooler_output<br/>(unused)"| X([ ])

    l22 ==>|"hidden_states[-2]"| drop["drop CLS token"]
    drop ==> proj[multi_modal_projector]
    proj ==> lm

    text[Text tokens] --> lm[Language Model]
    lm --> out[Output]

    subgraph vt[Vision tower — CLIP-ViT-Large, 24 layers]
        pe
        l0
        l1
        dots
        l22
        l23
        pln
    end

    style l23 stroke:#c33,stroke-dasharray:4 3
    style pln stroke:#c33,stroke-dasharray:4 3
    style X stroke:#c33,stroke-dasharray:4 3
Loading

Bold arrows = the actual path features take to the LLM.
Red dashed = computed every forward, but the result never reaches the loss (vision_feature_layer = -2, pooler unused).


Note

Low Risk
Low risk: updates are limited to test expectations and tiny-model generation configs, with no production training logic changes. Main risk is making CI stricter for LLaVA/LLaVA-Next, which could surface latent upstream model/config differences.

Overview
Updates VLM trainer tests (SFT/DPO/GRPO/RLOO) to stop vaguely skipping certain LLaVA/LLaVA-Next vision-tower parameters and instead positively assert that the last vision encoder layer and post_layernorm remain unchanged (frozen) after trainer.train().

Cleans up tiny LLaVA/LLaVA-Next model generation scripts by removing unused/invalid vision_config keys, and pins test runs to specific hub PR revisions for the updated tiny LLaVA models via tests/conftest.py’s MODEL_REVISIONS.

Reviewed by Cursor Bugbot for commit 79ac139. Bugbot is set up for automated code reviews on this repo. Configure here.

…=-2`, and turn the test guard into an assertion
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84e9ccc368

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/conftest.py
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread tests/conftest.py
Comment on lines -39 to -40
"num_key_value_heads": 2,
"embed_dim": 64,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these keys aren't used, we can safely drop them

Comment thread tests/test_dpo_trainer.py Outdated
Comment thread tests/test_sft_trainer.py Outdated
Comment thread tests/test_sft_trainer.py Outdated
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 79ac139. Configure here.

Comment thread tests/test_dpo_trainer.py
new_param = trainer.model.get_parameter(n)
# LLaVA & LLaVA-Next: vision_feature_layer=-2 leaves the last encoder layer (layers.1 and post_layernorm
# without gradient by design. Assert they stay frozen — if they ever start training, the feature-selection
# plumbing has likely regressed.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent comments across trainers violate AGENTS.md consistency rule

Low Severity

The LLaVA frozen-param comment in DPO/SFT tests differs from the identical logic in GRPO/RLOO tests. DPO/SFT uses "(layers.1 and post_layernorm" (with an unclosed parenthesis and extra trailing text about regression), while GRPO/RLOO uses "(layers.1) and post_layernorm (pooler-only path)" (properly punctuated). AGENTS.md requires word-for-word identical comments when the logic is identical.

Additional Locations (2)
Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md

Reviewed by Cursor Bugbot for commit 79ac139. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants