feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate by hallerite · Pull Request #2408 · PrimeIntellect-ai/prime-rl

hallerite · 2026-05-03T22:39:55Z

Summary

vLLM 0.20 ships a generic tokens-in / tokens-out endpoint at /inference/v1/generate (vllm.entrypoints.serve.disagg.serving.ServingTokens) that supersedes the bespoke /v1/generate handler prime-rl maintained on top of vllm 0.19. Replace it.

Net effect: −1 endpoint, ~−275 LoC, no functional change for callers.

What's in the PR

Server side

Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate route in server.py — vLLM 0.20's build_app already attaches /inference/v1/generate via attach_disagg_router.
Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve two prime-rl features the upstream protocol doesn't natively cover:
1. data_parallel_rank routing — read from the X-data-parallel-rank header and forwarded to engine_client.generate. The DP-replicated inference servers prime-rl runs need this to target a specific replica.
2. routed_experts per-token export — surfaced on each choice when the engine is launched with enable_return_routed_experts=True. This is what the trainer's router-replay path consumes.
custom_init_app_state swaps the upstream serving_tokens instance for our subclass after init_app_state.

Orchestrator side

compute_teacher_logprobs in orchestrator/utils.py points at /inference/v1/generate, builds the upstream payload (token_ids + nested sampling_params), and re-flattens prompt_logprobs from the upstream list[dict[token_id, Logprob]] shape back to the list[float] callers expect.

Tests

Replace test_serving_generate.py (class deleted) with test_serving_tokens.py — exercises the prime-rl deltas (routed_experts encoding, response shape stability).
Update test_teacher_logprobs.py to expect the new endpoint URL, payload shape, and response unwrap.

Renderers / verifiers pins

Pin renderers to PyPI ==0.1.6 (the first release after the renderers/verifiers monorepo split — same code as 9acdc60, just shipped as a published wheel) and declare it as a direct prime-rl dependency. Previously transitively pulled via verifiers' in-tree workspace package.
Bump verifiers to 7bdc769 to pick up the post-split main. The renderers release carries the matching client-side switch to /inference/v1/generate.

Test plan

tests/unit/inference/test_serving_tokens.py (4 tests) and tests/unit/orchestrator/test_teacher_logprobs.py (1 test) pass locally against vllm 0.20.
Server module imports cleanly (prime_rl.inference.vllm.server, prime_rl.inference.vllm.serving_tokens).
Companion verifiers tests (packages/renderers/tests/test_client.py) green.
uv.lock regenerated against renderers==0.1.6 (PyPI) and verifiers 7bdc769.
E2E text-only renderer + TITO rollouts against a live vllm 0.20 server. configs/multi_reverse_text/rl.toml × 20 steps on 2× RTX PRO 6000. Renderer run: 2688 calls to /inference/v1/generate, all steps green, eval Avg@4=0.85. TITO run: 20/20 steps green, eval Avg@4=0.85 (reverse-text is single-turn so the TITO client falls back to MITO /v1/chat/completions per openai_chat_completions_token_client.py:114 — exercises the routing flag but not the /tokens wire path).

Out of scope (follow-ups)

VLM rollouts via the renderer. The new endpoint handles MM features end-to-end (it accepts pre-built MultiModalFeatures), but validate_renderer_vs_vlm in configs/orchestrator.py still blocks the combination. Lifting the ban needs the renderer client to build features client-side (HF processor → MultiModalKwargsItem → base64 msgpack). Cleaner as a separate PR after this lands.
mismatch_kl is unrelated. The drift identified in MM_KL_INVESTIGATION_SUMMARY.md lives in vLLM's bf16 forward kernel, not the wire protocol — this migration neither helps nor hurts it.

🤖 Generated with Claude Code

Note

Medium Risk
Swaps the token-level generation API from a custom /v1/generate implementation to vLLM 0.20’s /inference/v1/generate, which can break callers or response expectations if any edge cases differ. Adds a custom ServingTokens subclass that intercepts headers and post-processes outputs, so integration correctness depends on matching vLLM’s evolving protocol.

Overview
Moves token-level generation off the legacy /v1/generate endpoint onto vLLM 0.20’s /inference/v1/generate. The bespoke serving_generate.py handler and its route are removed, and server.py now swaps vLLM’s serving_tokens instance to a new PrimeRlServingTokens wrapper during app init.

PrimeRlServingTokens preserves prime-RL-specific behavior on the new endpoint: forwards X-data-parallel-rank into engine_client.generate, re-exports per-token routed_experts in responses, and applies server-side defaulting for sampling_params.max_tokens when omitted (avoiding vLLM’s 16-token default).

The orchestrator’s compute_teacher_logprobs is updated to call /inference/v1/generate with the new request schema (token_ids + nested sampling_params) and to re-flatten upstream prompt_logprobs back into the legacy list-of-floats shape expected by callers. Tests are updated accordingly (replace test_serving_generate.py with test_serving_tokens.py, adjust teacher logprobs expectations), and dependencies are updated by pinning renderers==0.1.6 (PyPI) and bumping verifiers to a new git rev with lockfile refresh.

^{Reviewed by Cursor Bugbot for commit 913cc4c. Bugbot is set up for automated code reviews on this repo. Configure here.}

…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)

cursor · 2026-05-05T10:56:26Z

+    return {
+        "data": base64.b85encode(arr.tobytes()).decode("ascii"),
+        "shape": list(arr.shape),
+    }


Duplicated _encode_routed_experts logic across two files

Low Severity

The standalone _encode_routed_experts function in serving_tokens.py has identical logic to the instance method _RoutedExpertsCapture._encode_routed_experts in serving_chat_with_tokens.py. Both base85-encode a numpy array and return a {"data": ..., "shape": ...} dict. One shared utility function could serve both callers, reducing the risk of inconsistent fixes if the encoding format ever changes.

^{Reviewed by Cursor Bugbot for commit e16f639. Configure here.}

…/generate vLLM 0.20 ships a tokens-in / tokens-out endpoint at /inference/v1/generate (disagg/serving.py) that supersedes the bespoke /v1/generate handler prime-rl shipped on top of vllm 0.19. Replace it. Server side: - Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate route in server.py — vLLM 0.20's build_app already attaches /inference/v1/generate via attach_disagg_router. - Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve two prime-rl features the upstream protocol doesn't natively cover: 1. data_parallel_rank routing — read from the X-data-parallel-rank header and forwarded to engine_client.generate. 2. routed_experts per-token export — surfaced on each choice when the engine is launched with enable_return_routed_experts=True. custom_init_app_state swaps the upstream serving_tokens instance for our subclass. Orchestrator side: - compute_teacher_logprobs in orchestrator/utils.py points at /inference/v1/generate, builds the upstream payload (token_ids + nested sampling_params), and re-flattens prompt_logprobs from the upstream list[dict[token_id, Logprob]] shape back to the list[float] callers expect. Tests: - Replace test_serving_generate.py (class deleted) with test_serving_tokens.py — exercises the prime-rl deltas (routed_experts encoding, response shape stability). - Update test_teacher_logprobs.py to expect the new endpoint URL, payload shape, and response unwrap. Renderers pin: - Bump renderers source to 9c0b738e on the verifiers repo so the client-side switch to /inference/v1/generate ships together. Net: -1 endpoint, ~-275 LoC, no functional change for callers (renderer client emits the same parsed response shape; teacher logprobs return identical list[float]).

…to 7bdc769 Renderers moved out of the verifiers monorepo into their own repo (verifiers#1282). Repoint the source from verifiers/packages/renderers to PrimeIntellect-ai/renderers @ 9acdc60 and declare renderers as a direct prime-rl dependency since it was previously transitively pulled via verifiers' in-tree workspace package. Bump verifiers to 7bdc769 to pick up the post-split main. Pairs with the /inference/v1/generate switch — the renderer client at 9acdc60 emits the new endpoint shape.

Renderers 0.1.6 was published on PyPI today (commit 9acdc60 + version bump). Switch from the git rev source to the canonical PyPI release — keeps the same code (==0.1.6) but avoids depending on the renderers git repo at install time. Keeps `renderers = false` in `[tool.uv.exclude-newer-package]` since 0.1.6 is inside the 7-day cooldown window.

…generate vLLM 0.20's ServingTokens hands the client-supplied SamplingParams to the engine verbatim. SamplingParams.max_tokens defaults to 16 (a dataclass-level default that predates the OpenAI-compat layer), so any caller that omits the field gets a 16-token completion — long enough to start a sentence and stop mid-word. Other vLLM endpoints (/v1/chat/completions, /v1/completions, /v1/responses) all mask this server-side via vllm.entrypoints.utils.get_max_tokens, which falls back to max_model_len - prompt_len. The disagg endpoint skips that path. Mirror it inside PrimeRlServingTokens so callers don't need a client-side workaround. Detection: re-read the cached request body to tell "client sent max_tokens=16" from "client sent nothing → SamplingParams default 16". Pessimistic on read failures (assume the client did set it). Drop once vLLM patches upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 913cc4c. Configure here.}

cursor · 2026-05-07T23:45:40Z

+            async for res in result_generator:
+                final_res = res
+        except asyncio.CancelledError:
+            return self.create_error_response("Client disconnected")


Missing engine abort on client disconnect

Medium Severity

When a client disconnects mid-generation, serve_tokens_full_generator catches asyncio.CancelledError and returns an error response, but never calls engine_client.abort(request_id). The deleted serving_generate.py explicitly called await self.engine_client.abort(request_id) before re-raising, ensuring the engine stopped processing the request. Without the abort, the inference engine may continue consuming GPU compute for requests whose clients are long gone, which compounds under high concurrency (2k+ simultaneous rollouts).

^{Reviewed by Cursor Bugbot for commit 913cc4c. Configure here.}

…rompt_len" This reverts commit 831f8bc. The fix moved server-side: prime-rl's PrimeRlServingTokens now applies get_max_tokens() defaulting in serve_tokens (PrimeIntellect-ai/prime-rl#2408, commit 913cc4ca), matching every other vLLM endpoint. The client-side workaround was always a band-aid and is no longer needed for prime-rl deployments. Other vLLM 0.20 deployments hitting /inference/v1/generate still need the upstream fix or to apply the prime-rl override locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 4, 2026

feat(renderers): switch client to vLLM 0.20 /inference/v1/generate PrimeIntellect-ai/verifiers#1282

Merged

feat: switch client to vLLM 0.20 /inference/v1/generate PrimeIntellect-ai/renderers#1

Merged

hallerite marked this pull request as ready for review May 5, 2026 10:46

cursor Bot reviewed May 5, 2026

View reviewed changes

hallerite added 2 commits May 7, 2026 14:15

hallerite force-pushed the feat/unify-inference-generate branch from 794a588 to 0f16ddc Compare May 7, 2026 14:22

hallerite changed the base branch from feat/vllm-0.20-cu13 to main May 7, 2026 14:22

hallerite and others added 2 commits May 7, 2026 14:30

cursor Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
hallerite wants to merge 4 commits intomainfrom
feat/unify-inference-generate

hallerite commented May 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot May 5, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Server side

Orchestrator side

Tests

Renderers / verifiers pins

Test plan

Out of scope (follow-ups)

Uh oh!

Uh oh!

cursor Bot May 5, 2026

Choose a reason for hiding this comment

Duplicated _encode_routed_experts logic across two files

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 7, 2026

Choose a reason for hiding this comment

Missing engine abort on client disconnect

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented May 3, 2026 •

edited by cursor Bot

Loading

Duplicated `_encode_routed_experts` logic across two files