Commit 3b86b96
committed
fix(deepseek-v4): keep MoE routing scores and attention softmax in fp32
Two precision issues that compound across 61 layers and degrade
backbone parity vs reference (observed during MTP parity testing
in #2191):
1. sqrtsoftplus Gate cast routing scores back to bf16 immediately
after computing sqrt(softplus(x.float())), losing precision for
expert selection. The HashGate counterpart stays in fp32. Remove
the .to(scores.dtype) cast so non-hash layers match.
2. eager_attention_with_sink ran softmax in the input dtype (bf16
under autocast). Force fp32 softmax for numerical stability,
matching standard practice.
Also fix a stale docstring claiming compress-ratio attention was
not yet implemented — it has been wired in.
Signed-off-by: khazic <khazzz1c@gmail.com>1 parent 41786e2 commit 3b86b96
2 files changed
Lines changed: 5 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
50 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
51 | 52 | | |
52 | 53 | | |
53 | 54 | | |
| |||
473 | 474 | | |
474 | 475 | | |
475 | 476 | | |
476 | | - | |
| 477 | + | |
477 | 478 | | |
478 | 479 | | |
479 | 480 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
358 | | - | |
| 358 | + | |
359 | 359 | | |
360 | 360 | | |
361 | 361 | | |
| |||
0 commit comments