feat: scope trust_remote_code to Kimi-K2 family with pinned revisions#7
Merged
feat: scope trust_remote_code to Kimi-K2 family with pinned revisions#7
Conversation
Centralise tokenizer loading through a new ``renderers.base.load_tokenizer`` helper. Default policy: ``trust_remote_code=False``. Opt-in for the Moonshot Kimi-K2 family only — and even then with ``revision`` pinned to a reviewed sha so a future malicious push to the upstream repo cannot auto-propagate to anyone calling ``create_renderer_pool``. Why scoped now: empirical audit of every model in MODEL_RENDERER_MAP shows only ``moonshotai/Kimi-K2-Instruct``, ``Kimi-K2.5``, and ``Kimi-K2.6`` actually require ``trust_remote_code`` (their tokenizer config has an ``auto_map.AutoTokenizer`` entry pointing at ``tokenization_kimi.TikTokenTokenizer`` — a 353-line tiktoken wrapper shipped in-repo). Every other model — Qwen3/3.5/3.6/3-VL, GLM-5/4.5, MiniMax-M2, DeepSeek-V3, Nemotron-3, GPT-OSS, Qwen2.5 — loads cleanly without remote code. The previous unconditional ``trust_remote_code=True`` in ``create_renderer_pool`` granted arbitrary-Python-on-from_pretrained for every supported model, when only 3 actually need it. Pinned revisions (current as of 2026-05-07): - moonshotai/Kimi-K2-Instruct: fd1984e2b7a3350dbf7305fe73a4ede25c14de50 - moonshotai/Kimi-K2.5: 4d01dfe0332d63057c186e0b262165819efb6611 - moonshotai/Kimi-K2.6: 2755962d07cb42aa2d988a35bcb65cd4a9c2de82 Bumping these requires deliberate review of the upstream diff. Changes: - renderers/base.py: add ``TRUSTED_REVISIONS`` allow-list and ``load_tokenizer`` helper; ``create_renderer_pool``'s factory uses it. - tests/conftest.py + every test that loaded tokenizers directly: route through ``load_tokenizer``. No more ad-hoc ``trust_remote_code=True``. - tests/test_load_tokenizer.py (new): unit-tests the policy itself — allow-list shape (Kimi-only), revision is a 40-char sha (no branch names), AutoTokenizer.from_pretrained call shape per model class, unknown paths fall through to no-trust, real Qwen + Kimi smoke loads. - Bump 0.1.6 → 0.1.7. 902 passed, 48 skipped, 1 xfailed locally. No parity regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Centralise tokenizer loading through a new `renderers.base.load_tokenizer` helper. Default: `trust_remote_code=False`. Opt-in only for the Moonshot Kimi-K2 family, and even then with `revision` pinned to a reviewed sha so a future malicious push to the upstream repo cannot auto-propagate to anyone calling `create_renderer_pool`.
Why
Empirical audit of every model in `MODEL_RENDERER_MAP`:
Only 3 of 32 entries actually need it. The previous unconditional `trust_remote_code=True` in `create_renderer_pool` granted arbitrary-Python-on-`from_pretrained` for every supported model.
The Kimi requirement is real: `tokenizer_config.json` has `auto_map.AutoTokenizer = ["tokenization_kimi.TikTokenTokenizer", null]`, which makes transformers download and `import` a 353-line tiktoken wrapper shipped in-repo. Pinning the `revision` keeps the trust narrow — even with `trust_remote_code=True`, transformers executes the tokenizer Python from that exact commit only.
Pinned revisions (current as of 2026-05-07)
```
moonshotai/Kimi-K2-Instruct: fd1984e2b7a3350dbf7305fe73a4ede25c14de50
moonshotai/Kimi-K2.5: 4d01dfe0332d63057c186e0b262165819efb6611
moonshotai/Kimi-K2.6: 2755962d07cb42aa2d988a35bcb65cd4a9c2de82
```
Bumping requires deliberate review of the upstream diff.
Changes
Downstream
Test plan
🤖 Generated with Claude Code