feat(trainer): replace DPPO+KL default loss with IcePop by samsja · Pull Request #2401 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-03T01:24:31Z

Summary

Override the default RL loss with IcePop from the INTELLECT-3 technical report (eq. 1):

$$J_{\text{IcePop}}(\theta) = \mathbb{E}\left[ \frac{1}{\sum_i |y_i|} \sum_{i,t} \mathcal{M}!\left(\frac{\pi_{\text{train}}(y_{i,t}\mid x, y_{i,<t};\theta)}{\pi_{\text{infer}}(y_{i,t}\mid x, y_{i,<t};\theta_{\text{old}})};\alpha,\beta\right) \widehat{A}_{i,t} \right]$$

$$\mathcal{M}(k; \alpha, \beta) = \begin{cases} k & k \in [\alpha, \beta] \ 0 & \text{otherwise} \end{cases}$$

Tokens whose trainer/inference importance ratio falls outside $[\alpha, \beta]$ get zero policy-gradient weight — they are dropped, not clipped. There is no separate KL penalty — the double-sided ratio mask is what keeps the update inside the trust region.

Breaking config changes (`trainer.loss`)

Removed	Added	Default
`dppo_mask_low`	`icepop_ratio_low` (α)	`0.2`
`dppo_mask_high`	`icepop_ratio_high` (β)	`5.0`
`kl_tau`	—	(term removed)

adv_tau and teacher_tau are unchanged. mismatch_kl / is_masked* metrics are retained for observability (semantics now ratio-based, not advantage-conditioned).

Notes

Token-level mask is detached from autograd: gradient flows through importance_ratio only for in-range tokens.
Replaces, rather than adds, a loss type — the user asked for the new behavior to override the existing default.

🤖 Generated with Claude Code

Override the default RL loss with IcePop from the INTELLECT-3 technical report (https://arxiv.org/abs/2512.16144): J(θ) = E[ (1 / Σ|y|) Σ_t M(r_t; α, β) · Â_t ] M(k; α, β) = k if k ∈ [α, β] else 0 Tokens whose trainer/inference importance ratio falls outside [α, β] (defaults: 0.5, 5.0) get zero policy-gradient weight — dropped, not clipped. Whole rollouts are zeroed when any trainable token's ratio collapses below `icepop_rollout_min_ratio` (default: 1e-5), guarding against catastrophic train/infer divergence. The KL penalty term is gone; the double-sided ratio mask is what keeps updates inside the trust region. Breaking config changes: - removed: `dppo_mask_low`, `dppo_mask_high`, `kl_tau` - added: `icepop_ratio_low`, `icepop_ratio_high`, `icepop_rollout_min_ratio` - unchanged: `adv_tau`, `teacher_tau` `mismatch_kl` is retained as an observability metric. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per review: the rollout-level drop was extra; the token-level double-sided mask is sufficient. Lower α default from 0.5 to 0.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Existing tests pass `dppo_mask_high` as a kwarg, but pydantic in this codebase silently ignores unknown fields, so the smoke tests still construct a valid `DefaultLossConfig` with the new defaults. No need to touch them in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samsja and others added 3 commits May 2, 2026 19:08

fixup: drop rollout min-ratio, set α default to 0.2

db60419

Per review: the rollout-level drop was extra; the token-level double-sided mask is sufficient. Lower α default from 0.5 to 0.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samsja force-pushed the feat/icepop-loss branch from cfdc5f9 to 1da4a33 Compare May 3, 2026 02:09

samsja mentioned this pull request May 5, 2026

feat(trainer): IPO+KL — double-sided mask, KL only on kept tokens #2423

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trainer): replace DPPO+KL default loss with IcePop#2401

feat(trainer): replace DPPO+KL default loss with IcePop#2401
samsja wants to merge 3 commits intomainfrom
feat/icepop-loss

samsja commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Breaking config changes (trainer.loss)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented May 3, 2026 •

edited

Loading

Breaking config changes (`trainer.loss`)