Skip to content

feat(renderers): switch client to vLLM 0.20 /inference/v1/generate#1282

Merged
hallerite merged 9 commits intomainfrom
feat/renderer-inference-v1-generate
May 5, 2026
Merged

feat(renderers): switch client to vLLM 0.20 /inference/v1/generate#1282
hallerite merged 9 commits intomainfrom
feat/renderer-inference-v1-generate

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 3, 2026

Summary

vLLM 0.20 ships a unified tokens-in / tokens-out endpoint at /inference/v1/generate that supersedes the bespoke /v1/generate handler prime-rl shipped on top of vllm 0.19. Migrate verifiers' RendererClient onto the new endpoint and pin the renderers package to its lean rewrite.

Companion PRs:

What changed

Renderers pin (pyproject.toml + uv.lock)

  • [tool.uv.sources] pins renderers to PrimeIntellect-ai/renderers@40bc2a6 (the head of the companion PR — the lean generate() rewrite).
  • packages/renderers/ deleted from the verifiers tree; the package is no longer vendored.
  • Version constraint bumped to renderers>=0.1.6 in [project] deps and the renderers extra. Once renderers-v0.1.6 publishes to PyPI, drop [tool.uv.sources] and let it resolve from PyPI directly.
  • Drops the uv pip install -e packages/renderers CI hack — no longer needed once renderers resolves through [tool.uv.sources].
  • Deletes .github/workflows/publish-renderers.yml — the publish flow now lives in the renderers repo (ci: add PyPI publish workflow renderers#2).

RendererClient adapter (verifiers/clients/renderer_client.py)

  • get_native_response builds a sampling_params dict from the caller's flat sampling_args / extra_body and calls the new generate(...) with named args. This is the right place for the OpenAI-shaped → lean adaptation; the renderers package itself no longer carries OpenAI-SDK conventions.
  • from_native_response reads request_id instead of id; Usage is reconstructed from token-list lengths (the new endpoint doesn't return a usage block).

Test plan

  • tests/test_renderer_client.py + tests/test_renderer_e2e.py updated for the new request_id / sampling_params shapes — 42/42 pass against the external pin.
  • e2e renderer rollout against a live vllm 0.20 server (prime-rl#2408 + this PR's client): 20-step multi_reverse_text RL run, 2688 calls to /inference/v1/generate, eval Avg@4 = 0.83 — identical numbers to the fat-API version.
  • ruff format --check / ruff check clean.
  • ty check verifiers passes (0 errors).
  • Rebased onto current main (picks up Make renderers optional and add PyPI publish workflow #1279 — renderers as optional / PyPI publish workflow).

Notes

  • VLMs are still blocked by prime-rl's validate_renderer_vs_vlm config validator. The new endpoint already supports MM features end-to-end; lifting the ban needs the renderer client to build features client-side (HF processor → MultiModalKwargsItem → base64 msgpack). That's a separate PR — easier to review on its own once this lands.

Note

Medium Risk
Moderate risk because it changes the inference request/response contract (generate call shape, request_id, and usage reconstruction) and removes the vendored packages/renderers implementation in favor of a pinned external dependency.

Overview
Switches RendererClient over to renderers.client.generate, adapting OpenAI-style sampling args into a sampling_params dict, passing through cache_salt/priority/headers, and updating response handling to use request_id and reconstruct usage from token lengths.

Removes the in-repo packages/renderers implementation (and its publish-renderers GitHub workflow) and pins renderers via pyproject.toml (renderers>=0.1.6 plus a git tool.uv.sources override) so the renderer code is consumed as an external package.

Reviewed by Cursor Bugbot for commit b494fb7. Bugbot is set up for automated code reviews on this repo. Configure here.

hallerite and others added 3 commits May 4, 2026 21:45
Replace the OpenAI-chat-completions-shaped ``completions_request`` with
a lean ``generate()`` built around what /inference/v1/generate actually
exposes:

- Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim.
  No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no
  ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are
  OpenAI-SDK habits that don't apply here.
- Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named
  args (matching the wire shape, no rummaging through extra_body).
- Result dict drops the ChatCompletion-shaped fillers (``id``,
  ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual
  field /inference/v1/generate returns) and the renderer-specific
  fields (content, reasoning_content, tool_calls, finish_reason,
  prompt/completion_ids, completion_logprobs, routed_experts).
- ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced
  by us; everything else flows through.

Kept: the ``finish_reason: stop → tool_calls`` promotion when the
renderer extracts tool calls client-side (downstream agent loops
genuinely depend on it), the AsyncOpenAI transport (auth + retries),
and the overlong-prompt 4xx diagnostic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The renderers package's ``completions_request`` was renamed to
``generate`` and grew a structured ``sampling_params`` arg. Update
``RendererClient`` and the e2e test scaffold to match.

- ``get_native_response``: build a ``sampling_params`` dict from the
  caller's flat sampling_args / extra_body, then call ``generate(...)``
  with named ``cache_salt`` / ``priority`` / ``extra_headers`` args.
  This is where the OpenAI-SDK kwarg conventions belong (the verifiers
  shim adapts the OpenAI-shaped surface to the lean generate() API);
  the renderer client itself no longer carries them.
- ``from_native_response``: read ``request_id`` (the field
  /inference/v1/generate actually returns) instead of ``id``;
  reconstruct ``Usage`` from token-list lengths since the endpoint
  doesn't return a usage block.
- ``ScriptedVLLM``: speak the new wire shape — POST to
  /inference/v1/generate, body uses ``token_ids`` and nested
  ``sampling_params``, response returns ``request_id`` and
  ``logprobs.content[*]``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /inference/v1/generate switch is a wire-protocol break against
v0.1.5 (which targets the legacy /generate endpoint). Tag this as a
new release so the PyPI publish workflow picks it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the feat/renderer-inference-v1-generate branch from d1f7821 to 0b797d2 Compare May 4, 2026 21:46
CI's ``uv sync`` resolves ``renderers>=0.1.5`` from PyPI, but this PR
bumps to v0.1.6 with the new ``generate`` API. Pre-merge there's no
PyPI release for v0.1.6 yet, so the import fails:

    ImportError: cannot import name 'generate' from 'renderers.client'

Add ``uv pip install -e packages/renderers`` after ``uv sync`` in both
test.yml and style.yml so CI uses the in-repo source. ``--no-sync`` on
the actual test/ty step prevents uv from rolling renderers back to the
PyPI version. Drop these steps after a renderers-v0.1.6 tag publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…package

Now that renderers lives in its own repo
(https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep
directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean
``generate()`` rewrite) and remove ``packages/renderers/`` from the
verifiers tree.

This also drops the ``uv pip install -e packages/renderers`` CI hack
introduced in c969123 — no longer needed once renderers resolves
through ``[tool.uv.sources]``.

Bump the version constraints to ``renderers>=0.1.6``. Once renderers
v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the
constraint resolve from the trusted publisher.

Companion to:
  - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite)
  - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
@hallerite hallerite marked this pull request as ready for review May 5, 2026 11:14
The renderers package now lives in PrimeIntellect-ai/renderers and ships
its own publish workflow (PrimeIntellect-ai/renderers#2). This stub no
longer has a target — packages/renderers/ was removed in 1d34beb.
hallerite added a commit to PrimeIntellect-ai/renderers that referenced this pull request May 5, 2026
Replace the OpenAI-chat-completions-shaped ``completions_request`` with
a lean ``generate()`` built around what /inference/v1/generate actually
exposes:

- Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim.
  No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no
  ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are
  OpenAI-SDK habits that don't apply here.
- Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named
  args (matching the wire shape, no rummaging through extra_body).
- Result dict drops the ChatCompletion-shaped fillers (``id``,
  ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual
  field /inference/v1/generate returns) and the renderer-specific
  fields (content, reasoning_content, tool_calls, finish_reason,
  prompt/completion_ids, completion_logprobs, routed_experts).
- ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced
  by us; everything else flows through.

Kept: the ``finish_reason: stop → tool_calls`` promotion when the
renderer extracts tool calls client-side (downstream agent loops
genuinely depend on it), the AsyncOpenAI transport (auth + retries),
and the overlong-prompt 4xx diagnostic.

Bump version 0.1.5 → 0.1.6 — the wire format change is a break against
v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6
to publish.

Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now
that this package lives in its own repo.
hallerite added a commit to PrimeIntellect-ai/renderers that referenced this pull request May 5, 2026
Replace the OpenAI-chat-completions-shaped ``completions_request`` with
a lean ``generate()`` built around what /inference/v1/generate actually
exposes:

- Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim.
  No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no
  ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are
  OpenAI-SDK habits that don't apply here.
- Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named
  args (matching the wire shape, no rummaging through extra_body).
- Result dict drops the ChatCompletion-shaped fillers (``id``,
  ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual
  field /inference/v1/generate returns) and the renderer-specific
  fields (content, reasoning_content, tool_calls, finish_reason,
  prompt/completion_ids, completion_logprobs, routed_experts).
- ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced
  by us; everything else flows through.

Kept: the ``finish_reason: stop → tool_calls`` promotion when the
renderer extracts tool calls client-side (downstream agent loops
genuinely depend on it), the AsyncOpenAI transport (auth + retries),
and the overlong-prompt 4xx diagnostic.

Bump version 0.1.5 → 0.1.6 — the wire format change is a break against
v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6
to publish.

Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now
that this package lives in its own repo.
renderers#1 squash-merged to PrimeIntellect-ai/renderers main as
9acdc60. Repoint [tool.uv.sources] from the now-deleted PR branch SHA
(40bc2a6) to the squash-merge commit so the pin tracks main rather
than a side-history commit.
Comment thread verifiers/clients/renderer_client.py
eligotts and others added 2 commits May 5, 2026 12:48
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce-v1-generate

# Conflicts:
#	packages/renderers/pyproject.toml
#	packages/renderers/uv.lock
#	uv.lock
@hallerite hallerite merged commit 7bdc769 into main May 5, 2026
8 checks passed
@hallerite hallerite deleted the feat/renderer-inference-v1-generate branch May 5, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants