Skip to content

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408

Open
hallerite wants to merge 4 commits intomainfrom
feat/unify-inference-generate
Open

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
hallerite wants to merge 4 commits intomainfrom
feat/unify-inference-generate

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 3, 2026

Summary

vLLM 0.20 ships a generic tokens-in / tokens-out endpoint at /inference/v1/generate (vllm.entrypoints.serve.disagg.serving.ServingTokens) that supersedes the bespoke /v1/generate handler prime-rl maintained on top of vllm 0.19. Replace it.

Net effect: −1 endpoint, ~−275 LoC, no functional change for callers.

What's in the PR

Server side

  • Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate route in server.py — vLLM 0.20's build_app already attaches /inference/v1/generate via attach_disagg_router.
  • Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve two prime-rl features the upstream protocol doesn't natively cover:
    1. data_parallel_rank routing — read from the X-data-parallel-rank header and forwarded to engine_client.generate. The DP-replicated inference servers prime-rl runs need this to target a specific replica.
    2. routed_experts per-token export — surfaced on each choice when the engine is launched with enable_return_routed_experts=True. This is what the trainer's router-replay path consumes.
  • custom_init_app_state swaps the upstream serving_tokens instance for our subclass after init_app_state.

Orchestrator side

  • compute_teacher_logprobs in orchestrator/utils.py points at /inference/v1/generate, builds the upstream payload (token_ids + nested sampling_params), and re-flattens prompt_logprobs from the upstream list[dict[token_id, Logprob]] shape back to the list[float] callers expect.

Tests

  • Replace test_serving_generate.py (class deleted) with test_serving_tokens.py — exercises the prime-rl deltas (routed_experts encoding, response shape stability).
  • Update test_teacher_logprobs.py to expect the new endpoint URL, payload shape, and response unwrap.

Renderers / verifiers pins

  • Pin renderers to PyPI ==0.1.6 (the first release after the renderers/verifiers monorepo split — same code as 9acdc60, just shipped as a published wheel) and declare it as a direct prime-rl dependency. Previously transitively pulled via verifiers' in-tree workspace package.
  • Bump verifiers to 7bdc769 to pick up the post-split main. The renderers release carries the matching client-side switch to /inference/v1/generate.

Test plan

  • tests/unit/inference/test_serving_tokens.py (4 tests) and tests/unit/orchestrator/test_teacher_logprobs.py (1 test) pass locally against vllm 0.20.
  • Server module imports cleanly (prime_rl.inference.vllm.server, prime_rl.inference.vllm.serving_tokens).
  • Companion verifiers tests (packages/renderers/tests/test_client.py) green.
  • uv.lock regenerated against renderers==0.1.6 (PyPI) and verifiers 7bdc769.
  • E2E text-only renderer + TITO rollouts against a live vllm 0.20 server. configs/multi_reverse_text/rl.toml × 20 steps on 2× RTX PRO 6000. Renderer run: 2688 calls to /inference/v1/generate, all steps green, eval Avg@4=0.85. TITO run: 20/20 steps green, eval Avg@4=0.85 (reverse-text is single-turn so the TITO client falls back to MITO /v1/chat/completions per openai_chat_completions_token_client.py:114 — exercises the routing flag but not the /tokens wire path).

Out of scope (follow-ups)

  • VLM rollouts via the renderer. The new endpoint handles MM features end-to-end (it accepts pre-built MultiModalFeatures), but validate_renderer_vs_vlm in configs/orchestrator.py still blocks the combination. Lifting the ban needs the renderer client to build features client-side (HF processor → MultiModalKwargsItem → base64 msgpack). Cleaner as a separate PR after this lands.
  • mismatch_kl is unrelated. The drift identified in MM_KL_INVESTIGATION_SUMMARY.md lives in vLLM's bf16 forward kernel, not the wire protocol — this migration neither helps nor hurts it.

🤖 Generated with Claude Code


Note

Medium Risk
Swaps the token-level generation API from a custom /v1/generate implementation to vLLM 0.20’s /inference/v1/generate, which can break callers or response expectations if any edge cases differ. Adds a custom ServingTokens subclass that intercepts headers and post-processes outputs, so integration correctness depends on matching vLLM’s evolving protocol.

Overview
Moves token-level generation off the legacy /v1/generate endpoint onto vLLM 0.20’s /inference/v1/generate. The bespoke serving_generate.py handler and its route are removed, and server.py now swaps vLLM’s serving_tokens instance to a new PrimeRlServingTokens wrapper during app init.

PrimeRlServingTokens preserves prime-RL-specific behavior on the new endpoint: forwards X-data-parallel-rank into engine_client.generate, re-exports per-token routed_experts in responses, and applies server-side defaulting for sampling_params.max_tokens when omitted (avoiding vLLM’s 16-token default).

The orchestrator’s compute_teacher_logprobs is updated to call /inference/v1/generate with the new request schema (token_ids + nested sampling_params) and to re-flatten upstream prompt_logprobs back into the legacy list-of-floats shape expected by callers. Tests are updated accordingly (replace test_serving_generate.py with test_serving_tokens.py, adjust teacher logprobs expectations), and dependencies are updated by pinning renderers==0.1.6 (PyPI) and bumping verifiers to a new git rev with lockfile refresh.

Reviewed by Cursor Bugbot for commit 913cc4c. Bugbot is set up for automated code reviews on this repo. Configure here.

hallerite added a commit to PrimeIntellect-ai/verifiers that referenced this pull request May 4, 2026
…package

Now that renderers lives in its own repo
(https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep
directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean
``generate()`` rewrite) and remove ``packages/renderers/`` from the
verifiers tree.

This also drops the ``uv pip install -e packages/renderers`` CI hack
introduced in c969123 — no longer needed once renderers resolves
through ``[tool.uv.sources]``.

Bump the version constraints to ``renderers>=0.1.6``. Once renderers
v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the
constraint resolve from the trusted publisher.

Companion to:
  - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite)
  - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
@hallerite hallerite marked this pull request as ready for review May 5, 2026 10:46
Comment thread src/prime_rl/inference/vllm/serving_tokens.py
return {
"data": base64.b85encode(arr.tobytes()).decode("ascii"),
"shape": list(arr.shape),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated _encode_routed_experts logic across two files

Low Severity

The standalone _encode_routed_experts function in serving_tokens.py has identical logic to the instance method _RoutedExpertsCapture._encode_routed_experts in serving_chat_with_tokens.py. Both base85-encode a numpy array and return a {"data": ..., "shape": ...} dict. One shared utility function could serve both callers, reducing the risk of inconsistent fixes if the encoding format ever changes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e16f639. Configure here.

hallerite added 2 commits May 7, 2026 14:15
…/generate

vLLM 0.20 ships a tokens-in / tokens-out endpoint at /inference/v1/generate
(disagg/serving.py) that supersedes the bespoke /v1/generate handler
prime-rl shipped on top of vllm 0.19. Replace it.

Server side:
- Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate
  route in server.py — vLLM 0.20's build_app already attaches
  /inference/v1/generate via attach_disagg_router.
- Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve
  two prime-rl features the upstream protocol doesn't natively cover:
    1. data_parallel_rank routing — read from the X-data-parallel-rank
       header and forwarded to engine_client.generate.
    2. routed_experts per-token export — surfaced on each choice when
       the engine is launched with enable_return_routed_experts=True.
  custom_init_app_state swaps the upstream serving_tokens instance for our
  subclass.

Orchestrator side:
- compute_teacher_logprobs in orchestrator/utils.py points at
  /inference/v1/generate, builds the upstream payload (token_ids +
  nested sampling_params), and re-flattens prompt_logprobs from the
  upstream list[dict[token_id, Logprob]] shape back to the list[float]
  callers expect.

Tests:
- Replace test_serving_generate.py (class deleted) with
  test_serving_tokens.py — exercises the prime-rl deltas
  (routed_experts encoding, response shape stability).
- Update test_teacher_logprobs.py to expect the new endpoint URL,
  payload shape, and response unwrap.

Renderers pin:
- Bump renderers source to 9c0b738e on the verifiers repo so the
  client-side switch to /inference/v1/generate ships together.

Net: -1 endpoint, ~-275 LoC, no functional change for callers (renderer
client emits the same parsed response shape; teacher logprobs return
identical list[float]).
…to 7bdc769

Renderers moved out of the verifiers monorepo into their own repo
(verifiers#1282). Repoint the source from verifiers/packages/renderers
to PrimeIntellect-ai/renderers @ 9acdc60 and declare renderers as a
direct prime-rl dependency since it was previously transitively pulled
via verifiers' in-tree workspace package. Bump verifiers to 7bdc769 to
pick up the post-split main.

Pairs with the /inference/v1/generate switch — the renderer client at
9acdc60 emits the new endpoint shape.
@hallerite hallerite force-pushed the feat/unify-inference-generate branch from 794a588 to 0f16ddc Compare May 7, 2026 14:22
@hallerite hallerite changed the base branch from feat/vllm-0.20-cu13 to main May 7, 2026 14:22
hallerite and others added 2 commits May 7, 2026 14:30
Renderers 0.1.6 was published on PyPI today (commit 9acdc60 + version
bump). Switch from the git rev source to the canonical PyPI release —
keeps the same code (==0.1.6) but avoids depending on the renderers
git repo at install time.

Keeps `renderers = false` in `[tool.uv.exclude-newer-package]` since
0.1.6 is inside the 7-day cooldown window.
…generate

vLLM 0.20's ServingTokens hands the client-supplied SamplingParams to the
engine verbatim. SamplingParams.max_tokens defaults to 16 (a dataclass-level
default that predates the OpenAI-compat layer), so any caller that omits the
field gets a 16-token completion — long enough to start a sentence and stop
mid-word.

Other vLLM endpoints (/v1/chat/completions, /v1/completions, /v1/responses)
all mask this server-side via vllm.entrypoints.utils.get_max_tokens, which
falls back to max_model_len - prompt_len. The disagg endpoint skips that
path. Mirror it inside PrimeRlServingTokens so callers don't need a
client-side workaround.

Detection: re-read the cached request body to tell "client sent
max_tokens=16" from "client sent nothing → SamplingParams default 16".
Pessimistic on read failures (assume the client did set it).

Drop once vLLM patches upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 913cc4c. Configure here.

async for res in result_generator:
final_res = res
except asyncio.CancelledError:
return self.create_error_response("Client disconnected")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing engine abort on client disconnect

Medium Severity

When a client disconnects mid-generation, serve_tokens_full_generator catches asyncio.CancelledError and returns an error response, but never calls engine_client.abort(request_id). The deleted serving_generate.py explicitly called await self.engine_client.abort(request_id) before re-raising, ensuring the engine stopped processing the request. Without the abort, the inference engine may continue consuming GPU compute for requests whose clients are long gone, which compounds under high concurrency (2k+ simultaneous rollouts).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 913cc4c. Configure here.

hallerite added a commit to PrimeIntellect-ai/verifiers that referenced this pull request May 8, 2026
…rompt_len"

This reverts commit 831f8bc.

The fix moved server-side: prime-rl's PrimeRlServingTokens now applies
get_max_tokens() defaulting in serve_tokens (PrimeIntellect-ai/prime-rl#2408,
commit 913cc4ca), matching every other vLLM endpoint. The client-side
workaround was always a band-aid and is no longer needed for prime-rl
deployments. Other vLLM 0.20 deployments hitting /inference/v1/generate
still need the upstream fix or to apply the prime-rl override locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant