Skip to content

feat(trainer): replace DPPO+KL default loss with IcePop#2401

Draft
samsja wants to merge 3 commits intomainfrom
feat/icepop-loss
Draft

feat(trainer): replace DPPO+KL default loss with IcePop#2401
samsja wants to merge 3 commits intomainfrom
feat/icepop-loss

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 3, 2026

Summary

Override the default RL loss with IcePop from the INTELLECT-3 technical report (eq. 1):

$$J_{\text{IcePop}}(\theta) = \mathbb{E}\left[ \frac{1}{\sum_i |y_i|} \sum_{i,t} \mathcal{M}!\left(\frac{\pi_{\text{train}}(y_{i,t}\mid x, y_{i,<t};\theta)}{\pi_{\text{infer}}(y_{i,t}\mid x, y_{i,<t};\theta_{\text{old}})};\alpha,\beta\right) \widehat{A}_{i,t} \right]$$

$$\mathcal{M}(k; \alpha, \beta) = \begin{cases} k & k \in [\alpha, \beta] \ 0 & \text{otherwise} \end{cases}$$

Tokens whose trainer/inference importance ratio falls outside $[\alpha, \beta]$ get zero policy-gradient weight — they are dropped, not clipped. There is no separate KL penalty — the double-sided ratio mask is what keeps the update inside the trust region.

Breaking config changes (trainer.loss)

Removed Added Default
dppo_mask_low icepop_ratio_low (α) 0.2
dppo_mask_high icepop_ratio_high (β) 5.0
kl_tau (term removed)

adv_tau and teacher_tau are unchanged. mismatch_kl / is_masked* metrics are retained for observability (semantics now ratio-based, not advantage-conditioned).

Notes

  • Token-level mask is detached from autograd: gradient flows through importance_ratio only for in-range tokens.
  • Replaces, rather than adds, a loss type — the user asked for the new behavior to override the existing default.

🤖 Generated with Claude Code

samsja and others added 3 commits May 2, 2026 19:08
Override the default RL loss with IcePop from the INTELLECT-3
technical report (https://arxiv.org/abs/2512.16144):

  J(θ) = E[ (1 / Σ|y|) Σ_t M(r_t; α, β) · Â_t ]
  M(k; α, β) = k if k ∈ [α, β] else 0

Tokens whose trainer/inference importance ratio falls outside
[α, β] (defaults: 0.5, 5.0) get zero policy-gradient weight —
dropped, not clipped. Whole rollouts are zeroed when any
trainable token's ratio collapses below `icepop_rollout_min_ratio`
(default: 1e-5), guarding against catastrophic train/infer
divergence. The KL penalty term is gone; the double-sided ratio
mask is what keeps updates inside the trust region.

Breaking config changes:
- removed: `dppo_mask_low`, `dppo_mask_high`, `kl_tau`
- added: `icepop_ratio_low`, `icepop_ratio_high`,
  `icepop_rollout_min_ratio`
- unchanged: `adv_tau`, `teacher_tau`

`mismatch_kl` is retained as an observability metric.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review: the rollout-level drop was extra; the token-level
double-sided mask is sufficient. Lower α default from 0.5 to 0.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing tests pass `dppo_mask_high` as a kwarg, but pydantic in this
codebase silently ignores unknown fields, so the smoke tests still
construct a valid `DefaultLossConfig` with the new defaults. No need
to touch them in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant