Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058) by glaziermag · Pull Request #2076 · EricLBuehler/mistral.rs

glaziermag · 2026-04-08T02:35:00Z

Fixes an inference hang specific to Gemma models (#2058) caused by numerical NaN propagation during softcapping operations.

Cause

Gemma architectures multiply attention scores by a softcap scaling factor (50.0). When running with f16 precision, this multiplication can exceed f16::MAX (65504), which overflows and produces NaN values.
If sequences enter an error state due to NaN generation, they were not fully cleaned up by the scheduler.
This uncollected SequenceState::Error caused the scheduler state-machine to enter an infinite retry loop instead of fully dropping the sequence.

Changes

mistralrs-core/src/attention/backends/naive.rs: Temporarily casts intermediate tensors to f32 exclusively during the softcap scaling step to provide sufficient mathematical headroom during the scaling/tanh phase, before returning cleanly to the target dtype (f16 or bf16). This prevents regression on CPU or standard models.
mistralrs-core/src/sequence.rs: Added SequenceState::Error evaluation explicitly into is_finished_paged_attn() to assure erroneous sequences are correctly recognized as finished and properly garbage-collected by the engine.

Testing

Local load-testing: Tested parallel hf-internal-testing/tiny-random-Gemma2ForCausalLM generations. Execution CPU times showed no regressions with the added f32 cast block (average batch latency ~7.9s on tested CPU environment).
Correctness: The endpoint now gracefully finishes Gemma completions with a standard 200 OK without triggering infinite retries on the console.

Before

$ curl -s -X POST http://localhost:1234/v1/completions \
  -d '{"model": "gemma-2-2b-it", "prompt": "Explain gravity.", "max_tokens": 20}'

WARN mistralrs_core::sequence: Sequence 1 entered error state [WeightError: unexpected NaN generation]
WARN mistralrs_core::engine: Retrying Sequence 1...
[HANGS INFINITELY]

After

$ curl -s -X POST http://localhost:1234/v1/completions \
  -d '{"model": "gemma-2-2b-it", "prompt": "Explain gravity.", "max_tokens": 20}'

(Successfully completes generation without errors or retries)

Fix Gemma softcap F16 NaN overflow (EricLBuehler#2058)

d709896

glaziermag marked this pull request as ready for review April 8, 2026 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058)#2076

Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058)#2076
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-2058-gemma-clean

glaziermag commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glaziermag commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cause

Changes

Testing

Before

After

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

glaziermag commented Apr 8, 2026 •

edited

Loading