feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
Conversation
…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
| return { | ||
| "data": base64.b85encode(arr.tobytes()).decode("ascii"), | ||
| "shape": list(arr.shape), | ||
| } |
There was a problem hiding this comment.
Duplicated _encode_routed_experts logic across two files
Low Severity
The standalone _encode_routed_experts function in serving_tokens.py has identical logic to the instance method _RoutedExpertsCapture._encode_routed_experts in serving_chat_with_tokens.py. Both base85-encode a numpy array and return a {"data": ..., "shape": ...} dict. One shared utility function could serve both callers, reducing the risk of inconsistent fixes if the encoding format ever changes.
Reviewed by Cursor Bugbot for commit e16f639. Configure here.
…/generate
vLLM 0.20 ships a tokens-in / tokens-out endpoint at /inference/v1/generate
(disagg/serving.py) that supersedes the bespoke /v1/generate handler
prime-rl shipped on top of vllm 0.19. Replace it.
Server side:
- Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate
route in server.py — vLLM 0.20's build_app already attaches
/inference/v1/generate via attach_disagg_router.
- Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve
two prime-rl features the upstream protocol doesn't natively cover:
1. data_parallel_rank routing — read from the X-data-parallel-rank
header and forwarded to engine_client.generate.
2. routed_experts per-token export — surfaced on each choice when
the engine is launched with enable_return_routed_experts=True.
custom_init_app_state swaps the upstream serving_tokens instance for our
subclass.
Orchestrator side:
- compute_teacher_logprobs in orchestrator/utils.py points at
/inference/v1/generate, builds the upstream payload (token_ids +
nested sampling_params), and re-flattens prompt_logprobs from the
upstream list[dict[token_id, Logprob]] shape back to the list[float]
callers expect.
Tests:
- Replace test_serving_generate.py (class deleted) with
test_serving_tokens.py — exercises the prime-rl deltas
(routed_experts encoding, response shape stability).
- Update test_teacher_logprobs.py to expect the new endpoint URL,
payload shape, and response unwrap.
Renderers pin:
- Bump renderers source to 9c0b738e on the verifiers repo so the
client-side switch to /inference/v1/generate ships together.
Net: -1 endpoint, ~-275 LoC, no functional change for callers (renderer
client emits the same parsed response shape; teacher logprobs return
identical list[float]).
…to 7bdc769 Renderers moved out of the verifiers monorepo into their own repo (verifiers#1282). Repoint the source from verifiers/packages/renderers to PrimeIntellect-ai/renderers @ 9acdc60 and declare renderers as a direct prime-rl dependency since it was previously transitively pulled via verifiers' in-tree workspace package. Bump verifiers to 7bdc769 to pick up the post-split main. Pairs with the /inference/v1/generate switch — the renderer client at 9acdc60 emits the new endpoint shape.
794a588 to
0f16ddc
Compare
Renderers 0.1.6 was published on PyPI today (commit 9acdc60 + version bump). Switch from the git rev source to the canonical PyPI release — keeps the same code (==0.1.6) but avoids depending on the renderers git repo at install time. Keeps `renderers = false` in `[tool.uv.exclude-newer-package]` since 0.1.6 is inside the 7-day cooldown window.
…generate vLLM 0.20's ServingTokens hands the client-supplied SamplingParams to the engine verbatim. SamplingParams.max_tokens defaults to 16 (a dataclass-level default that predates the OpenAI-compat layer), so any caller that omits the field gets a 16-token completion — long enough to start a sentence and stop mid-word. Other vLLM endpoints (/v1/chat/completions, /v1/completions, /v1/responses) all mask this server-side via vllm.entrypoints.utils.get_max_tokens, which falls back to max_model_len - prompt_len. The disagg endpoint skips that path. Mirror it inside PrimeRlServingTokens so callers don't need a client-side workaround. Detection: re-read the cached request body to tell "client sent max_tokens=16" from "client sent nothing → SamplingParams default 16". Pessimistic on read failures (assume the client did set it). Drop once vLLM patches upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 913cc4c. Configure here.
| async for res in result_generator: | ||
| final_res = res | ||
| except asyncio.CancelledError: | ||
| return self.create_error_response("Client disconnected") |
There was a problem hiding this comment.
Missing engine abort on client disconnect
Medium Severity
When a client disconnects mid-generation, serve_tokens_full_generator catches asyncio.CancelledError and returns an error response, but never calls engine_client.abort(request_id). The deleted serving_generate.py explicitly called await self.engine_client.abort(request_id) before re-raising, ensuring the engine stopped processing the request. Without the abort, the inference engine may continue consuming GPU compute for requests whose clients are long gone, which compounds under high concurrency (2k+ simultaneous rollouts).
Reviewed by Cursor Bugbot for commit 913cc4c. Configure here.
…rompt_len" This reverts commit 831f8bc. The fix moved server-side: prime-rl's PrimeRlServingTokens now applies get_max_tokens() defaulting in serve_tokens (PrimeIntellect-ai/prime-rl#2408, commit 913cc4ca), matching every other vLLM endpoint. The client-side workaround was always a band-aid and is no longer needed for prime-rl deployments. Other vLLM 0.20 deployments hitting /inference/v1/generate still need the upstream fix or to apply the prime-rl override locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
vLLM 0.20 ships a generic tokens-in / tokens-out endpoint at
/inference/v1/generate(vllm.entrypoints.serve.disagg.serving.ServingTokens) that supersedes the bespoke/v1/generatehandler prime-rl maintained on top of vllm 0.19. Replace it.Net effect: −1 endpoint, ~−275 LoC, no functional change for callers.
What's in the PR
Server side
src/prime_rl/inference/vllm/serving_generate.pyand the/v1/generateroute inserver.py— vLLM 0.20'sbuild_appalready attaches/inference/v1/generateviaattach_disagg_router.ServingTokenswithPrimeRlServingTokensto preserve two prime-rl features the upstream protocol doesn't natively cover:data_parallel_rankrouting — read from theX-data-parallel-rankheader and forwarded toengine_client.generate. The DP-replicated inference servers prime-rl runs need this to target a specific replica.routed_expertsper-token export — surfaced on each choice when the engine is launched withenable_return_routed_experts=True. This is what the trainer's router-replay path consumes.custom_init_app_stateswaps the upstreamserving_tokensinstance for our subclass afterinit_app_state.Orchestrator side
compute_teacher_logprobsinorchestrator/utils.pypoints at/inference/v1/generate, builds the upstream payload (token_ids+ nestedsampling_params), and re-flattensprompt_logprobsfrom the upstreamlist[dict[token_id, Logprob]]shape back to thelist[float]callers expect.Tests
test_serving_generate.py(class deleted) withtest_serving_tokens.py— exercises the prime-rl deltas (routed_expertsencoding, response shape stability).test_teacher_logprobs.pyto expect the new endpoint URL, payload shape, and response unwrap.Renderers / verifiers pins
renderersto PyPI==0.1.6(the first release after the renderers/verifiers monorepo split — same code as9acdc60, just shipped as a published wheel) and declare it as a direct prime-rl dependency. Previously transitively pulled via verifiers' in-tree workspace package.verifiersto7bdc769to pick up the post-split main. The renderers release carries the matching client-side switch to/inference/v1/generate.Test plan
tests/unit/inference/test_serving_tokens.py(4 tests) andtests/unit/orchestrator/test_teacher_logprobs.py(1 test) pass locally against vllm 0.20.prime_rl.inference.vllm.server,prime_rl.inference.vllm.serving_tokens).packages/renderers/tests/test_client.py) green.uv.lockregenerated againstrenderers==0.1.6(PyPI) and verifiers7bdc769.configs/multi_reverse_text/rl.toml× 20 steps on 2× RTX PRO 6000. Renderer run: 2688 calls to/inference/v1/generate, all steps green, eval Avg@4=0.85. TITO run: 20/20 steps green, eval Avg@4=0.85 (reverse-textis single-turn so the TITO client falls back to MITO/v1/chat/completionsperopenai_chat_completions_token_client.py:114— exercises the routing flag but not the/tokenswire path).Out of scope (follow-ups)
MultiModalFeatures), butvalidate_renderer_vs_vlminconfigs/orchestrator.pystill blocks the combination. Lifting the ban needs the renderer client to build features client-side (HF processor →MultiModalKwargsItem→ base64 msgpack). Cleaner as a separate PR after this lands.mismatch_klis unrelated. The drift identified inMM_KL_INVESTIGATION_SUMMARY.mdlives in vLLM's bf16 forward kernel, not the wire protocol — this migration neither helps nor hurts it.🤖 Generated with Claude Code
Note
Medium Risk
Swaps the token-level generation API from a custom
/v1/generateimplementation to vLLM 0.20’s/inference/v1/generate, which can break callers or response expectations if any edge cases differ. Adds a customServingTokenssubclass that intercepts headers and post-processes outputs, so integration correctness depends on matching vLLM’s evolving protocol.Overview
Moves token-level generation off the legacy
/v1/generateendpoint onto vLLM 0.20’s/inference/v1/generate. The bespokeserving_generate.pyhandler and its route are removed, andserver.pynow swaps vLLM’sserving_tokensinstance to a newPrimeRlServingTokenswrapper during app init.PrimeRlServingTokenspreserves prime-RL-specific behavior on the new endpoint: forwardsX-data-parallel-rankintoengine_client.generate, re-exports per-tokenrouted_expertsin responses, and applies server-side defaulting forsampling_params.max_tokenswhen omitted (avoiding vLLM’s 16-token default).The orchestrator’s
compute_teacher_logprobsis updated to call/inference/v1/generatewith the new request schema (token_ids+ nestedsampling_params) and to re-flatten upstreamprompt_logprobsback into the legacy list-of-floats shape expected by callers. Tests are updated accordingly (replacetest_serving_generate.pywithtest_serving_tokens.py, adjust teacher logprobs expectations), and dependencies are updated by pinningrenderers==0.1.6(PyPI) and bumpingverifiersto a new git rev with lockfile refresh.Reviewed by Cursor Bugbot for commit 913cc4c. Bugbot is set up for automated code reviews on this repo. Configure here.