feat(trainer): replace DPPO+KL default loss with IcePop#2401
Draft
feat(trainer): replace DPPO+KL default loss with IcePop#2401
Conversation
Override the default RL loss with IcePop from the INTELLECT-3 technical report (https://arxiv.org/abs/2512.16144): J(θ) = E[ (1 / Σ|y|) Σ_t M(r_t; α, β) · Â_t ] M(k; α, β) = k if k ∈ [α, β] else 0 Tokens whose trainer/inference importance ratio falls outside [α, β] (defaults: 0.5, 5.0) get zero policy-gradient weight — dropped, not clipped. Whole rollouts are zeroed when any trainable token's ratio collapses below `icepop_rollout_min_ratio` (default: 1e-5), guarding against catastrophic train/infer divergence. The KL penalty term is gone; the double-sided ratio mask is what keeps updates inside the trust region. Breaking config changes: - removed: `dppo_mask_low`, `dppo_mask_high`, `kl_tau` - added: `icepop_ratio_low`, `icepop_ratio_high`, `icepop_rollout_min_ratio` - unchanged: `adv_tau`, `teacher_tau` `mismatch_kl` is retained as an observability metric. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review: the rollout-level drop was extra; the token-level double-sided mask is sufficient. Lower α default from 0.5 to 0.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Existing tests pass `dppo_mask_high` as a kwarg, but pydantic in this codebase silently ignores unknown fields, so the smoke tests still construct a valid `DefaultLossConfig` with the new defaults. No need to touch them in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Override the default RL loss with IcePop from the INTELLECT-3 technical report (eq. 1):
Tokens whose trainer/inference importance ratio falls outside$[\alpha, \beta]$ get zero policy-gradient weight — they are dropped, not clipped. There is no separate KL penalty — the double-sided ratio mask is what keeps the update inside the trust region.
Breaking config changes (
trainer.loss)dppo_mask_lowicepop_ratio_low(α)0.2dppo_mask_highicepop_ratio_high(β)5.0kl_tauadv_tauandteacher_tauare unchanged.mismatch_kl/is_masked*metrics are retained for observability (semantics now ratio-based, not advantage-conditioned).Notes
importance_ratioonly for in-range tokens.🤖 Generated with Claude Code