fix(model): convert bool mask_cache to float additive mask for softcapping by nuthalapativarun · Pull Request #2235 · Lightning-AI/litgpt

nuthalapativarun · 2026-04-14T15:17:46Z

What does this PR do?

When the KV cache is active, build_mask_cache() returns a torch.bool tensor where True indicates a position that should be attended to (lower triangle). In scaled_dot_product_attention, for models that use attention_logit_softcapping (e.g. Gemma 2), this boolean mask was added directly to the softcapped scores:

scores = scores + mask  # mask is torch.bool → adds 0 or 1, NOT 0 or -inf

This breaks causal masking: future positions received a score boost of +1 instead of -inf, so softmax assigned non-zero attention weight to tokens that should be completely masked out.

Fix

Add an elif branch that converts the incoming boolean mask to a float additive mask before the addition (True → 0.0, False → -inf). The same fix is applied to both CausalSelfAttention and MultiheadLatentAttention.

elif mask.dtype == torch.bool:
    # build_mask_cache returns a boolean mask (True=keep); convert to additive float mask
    mask = torch.zeros_like(mask, dtype=q.dtype).masked_fill_(~mask, torch.finfo(q.dtype).min)
scores = scores + mask

Testing

Added test_attention_mask_bool_to_float_with_softcapping which:

Verifies mask_cache is indeed torch.bool (pre-condition of the bug)
Runs a prefill forward pass with KV cache enabled
Runs the same forward pass without KV cache
Asserts the two outputs are numerically close — they diverge without the fix because the bool mask corrupts the attention distribution

Fixes #1672

…pping When KV cache is active, build_mask_cache() returns a torch.bool tensor (True=keep). In scaled_dot_product_attention the bool mask was added directly to scores, contributing 0 or 1 instead of 0 or -inf, which breaks causal masking for models that use attention_logit_softcapping (e.g. Gemma 2). Add an elif branch that converts the boolean mask to an additive float mask (True→0.0, False→-inf) before the scores addition. The fix is applied to both CausalSelfAttention and MultiheadLatentAttention. Fixes Lightning-AI#1672

azure-pipelines · 2026-04-14T15:17:57Z

Azure Pipelines: 4 pipeline(s) require an authorized user to comment /azp run to run.

for more information, see https://pre-commit.ci

azure-pipelines · 2026-04-14T15:18:11Z

Azure Pipelines: 4 pipeline(s) require an authorized user to comment /azp run to run.

nuthalapativarun · 2026-04-18T17:44:34Z

Hi! Just checking in — CI appears to be waiting on an authorized /azp run trigger. Happy to make any changes needed to move this forward. Thanks!

nuthalapativarun requested review from KaelanDt, andyland, k223kim, lianakoleva and t-vi as code owners April 14, 2026 15:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

68f47a2

for more information, see https://pre-commit.ci

Merge branch 'main' into fix/attention-mask-softcapping-kv-cache

fc9d53f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(model): convert bool mask_cache to float additive mask for softcapping#2235

fix(model): convert bool mask_cache to float additive mask for softcapping#2235
nuthalapativarun wants to merge 3 commits into
Lightning-AI:mainfrom
nuthalapativarun:fix/attention-mask-softcapping-kv-cache

nuthalapativarun commented Apr 14, 2026

Uh oh!

azure-pipelines Bot commented Apr 14, 2026

Uh oh!

azure-pipelines Bot commented Apr 14, 2026

Uh oh!

nuthalapativarun commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nuthalapativarun commented Apr 14, 2026

What does this PR do?

Fix

Testing

Uh oh!

azure-pipelines Bot commented Apr 14, 2026

Uh oh!

azure-pipelines Bot commented Apr 14, 2026

Uh oh!

nuthalapativarun commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant