feat(renderers): switch client to vLLM 0.20 /inference/v1/generate#1282
Merged
feat(renderers): switch client to vLLM 0.20 /inference/v1/generate#1282
Conversation
5 tasks
4e80478 to
d1f7821
Compare
Replace the OpenAI-chat-completions-shaped ``completions_request`` with a lean ``generate()`` built around what /inference/v1/generate actually exposes: - Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim. No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are OpenAI-SDK habits that don't apply here. - Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named args (matching the wire shape, no rummaging through extra_body). - Result dict drops the ChatCompletion-shaped fillers (``id``, ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual field /inference/v1/generate returns) and the renderer-specific fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts). - ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced by us; everything else flows through. Kept: the ``finish_reason: stop → tool_calls`` promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The renderers package's ``completions_request`` was renamed to ``generate`` and grew a structured ``sampling_params`` arg. Update ``RendererClient`` and the e2e test scaffold to match. - ``get_native_response``: build a ``sampling_params`` dict from the caller's flat sampling_args / extra_body, then call ``generate(...)`` with named ``cache_salt`` / ``priority`` / ``extra_headers`` args. This is where the OpenAI-SDK kwarg conventions belong (the verifiers shim adapts the OpenAI-shaped surface to the lean generate() API); the renderer client itself no longer carries them. - ``from_native_response``: read ``request_id`` (the field /inference/v1/generate actually returns) instead of ``id``; reconstruct ``Usage`` from token-list lengths since the endpoint doesn't return a usage block. - ``ScriptedVLLM``: speak the new wire shape — POST to /inference/v1/generate, body uses ``token_ids`` and nested ``sampling_params``, response returns ``request_id`` and ``logprobs.content[*]``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /inference/v1/generate switch is a wire-protocol break against v0.1.5 (which targets the legacy /generate endpoint). Tag this as a new release so the PyPI publish workflow picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d1f7821 to
0b797d2
Compare
CI's ``uv sync`` resolves ``renderers>=0.1.5`` from PyPI, but this PR
bumps to v0.1.6 with the new ``generate`` API. Pre-merge there's no
PyPI release for v0.1.6 yet, so the import fails:
ImportError: cannot import name 'generate' from 'renderers.client'
Add ``uv pip install -e packages/renderers`` after ``uv sync`` in both
test.yml and style.yml so CI uses the in-repo source. ``--no-sync`` on
the actual test/ty step prevents uv from rolling renderers back to the
PyPI version. Drop these steps after a renderers-v0.1.6 tag publishes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
The renderers package now lives in PrimeIntellect-ai/renderers and ships its own publish workflow (PrimeIntellect-ai/renderers#2). This stub no longer has a target — packages/renderers/ was removed in 1d34beb.
hallerite
added a commit
to PrimeIntellect-ai/renderers
that referenced
this pull request
May 5, 2026
Replace the OpenAI-chat-completions-shaped ``completions_request`` with a lean ``generate()`` built around what /inference/v1/generate actually exposes: - Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim. No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are OpenAI-SDK habits that don't apply here. - Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named args (matching the wire shape, no rummaging through extra_body). - Result dict drops the ChatCompletion-shaped fillers (``id``, ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual field /inference/v1/generate returns) and the renderer-specific fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts). - ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced by us; everything else flows through. Kept: the ``finish_reason: stop → tool_calls`` promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic. Bump version 0.1.5 → 0.1.6 — the wire format change is a break against v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6 to publish. Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now that this package lives in its own repo.
hallerite
added a commit
to PrimeIntellect-ai/renderers
that referenced
this pull request
May 5, 2026
Replace the OpenAI-chat-completions-shaped ``completions_request`` with a lean ``generate()`` built around what /inference/v1/generate actually exposes: - Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim. No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are OpenAI-SDK habits that don't apply here. - Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named args (matching the wire shape, no rummaging through extra_body). - Result dict drops the ChatCompletion-shaped fillers (``id``, ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual field /inference/v1/generate returns) and the renderer-specific fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts). - ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced by us; everything else flows through. Kept: the ``finish_reason: stop → tool_calls`` promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic. Bump version 0.1.5 → 0.1.6 — the wire format change is a break against v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6 to publish. Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now that this package lives in its own repo.
renderers#1 squash-merged to PrimeIntellect-ai/renderers main as 9acdc60. Repoint [tool.uv.sources] from the now-deleted PR branch SHA (40bc2a6) to the squash-merge commit so the pin tracks main rather than a side-history commit.
eligotts
reviewed
May 5, 2026
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce-v1-generate # Conflicts: # packages/renderers/pyproject.toml # packages/renderers/uv.lock # uv.lock
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vLLM 0.20 ships a unified tokens-in / tokens-out endpoint at
/inference/v1/generatethat supersedes the bespoke/v1/generatehandler prime-rl shipped on top of vllm 0.19. Migrate verifiers'RendererClientonto the new endpoint and pin the renderers package to its lean rewrite.Companion PRs:
generate()rewrite (formerly this PR'spackages/renderers/; now lives in its own repo)What changed
Renderers pin (
pyproject.toml+uv.lock)[tool.uv.sources]pinsrendererstoPrimeIntellect-ai/renderers@40bc2a6(the head of the companion PR — the leangenerate()rewrite).packages/renderers/deleted from the verifiers tree; the package is no longer vendored.renderers>=0.1.6in[project]deps and therenderersextra. Oncerenderers-v0.1.6publishes to PyPI, drop[tool.uv.sources]and let it resolve from PyPI directly.uv pip install -e packages/renderersCI hack — no longer needed once renderers resolves through[tool.uv.sources]..github/workflows/publish-renderers.yml— the publish flow now lives in the renderers repo (ci: add PyPI publish workflow renderers#2).RendererClientadapter (verifiers/clients/renderer_client.py)get_native_responsebuilds asampling_paramsdict from the caller's flat sampling_args /extra_bodyand calls the newgenerate(...)with named args. This is the right place for the OpenAI-shaped → lean adaptation; the renderers package itself no longer carries OpenAI-SDK conventions.from_native_responsereadsrequest_idinstead ofid;Usageis reconstructed from token-list lengths (the new endpoint doesn't return a usage block).Test plan
tests/test_renderer_client.py+tests/test_renderer_e2e.pyupdated for the newrequest_id/sampling_paramsshapes — 42/42 pass against the external pin.multi_reverse_textRL run, 2688 calls to/inference/v1/generate, eval Avg@4 = 0.83 — identical numbers to the fat-API version.ruff format --check/ruff checkclean.ty check verifierspasses (0 errors).main(picks up Makerenderersoptional and add PyPI publish workflow #1279 — renderers as optional / PyPI publish workflow).Notes
validate_renderer_vs_vlmconfig validator. The new endpoint already supports MM features end-to-end; lifting the ban needs the renderer client to build features client-side (HF processor →MultiModalKwargsItem→ base64 msgpack). That's a separate PR — easier to review on its own once this lands.Note
Medium Risk
Moderate risk because it changes the inference request/response contract (
generatecall shape,request_id, and usage reconstruction) and removes the vendoredpackages/renderersimplementation in favor of a pinned external dependency.Overview
Switches
RendererClientover torenderers.client.generate, adapting OpenAI-style sampling args into asampling_paramsdict, passing throughcache_salt/priority/headers, and updating response handling to userequest_idand reconstructusagefrom token lengths.Removes the in-repo
packages/renderersimplementation (and itspublish-renderersGitHub workflow) and pinsrenderersviapyproject.toml(renderers>=0.1.6plus a gittool.uv.sourcesoverride) so the renderer code is consumed as an external package.Reviewed by Cursor Bugbot for commit b494fb7. Bugbot is set up for automated code reviews on this repo. Configure here.