Reject full-context chat prompts before max_tokens underflows#2402
Open
Reject full-context chat prompts before max_tokens underflows#2402
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
max_tokens=0/chat/completions/tokensbehavior by validating after stitchedrequest.tokensare installed, so TITO checks the actual request tokensWhy This Fixes The Bug
Before this change, the standard
/v1/chat/completionsserving path rendered the prompt and then called vLLMget_max_tokens(...). Whenprompt_len == max_model_len, that helper can return0;request.to_sampling_params(max_tokens=0)then surfaces as:BadRequestError: max_tokens must be at least 1, got 0That error is a generic model error to the rollout env, so group-scored rollouts can reschedule an otherwise nearly complete group.
After this change,
OpenAIServingChatWithTokens.render_chat_request()validates rendered standard chat prompts before vLLM reachesget_max_tokens. If the prompt has no generation room, it raisesVLLMValidationError(parameter="input_tokens"), which vLLM serializes as a 400BadRequestErrorwith a context-length message:This model's maximum context length is ... However, your request has ... input tokens ... (parameter=input_tokens, value=...)The verifiers OpenAI client already classifies messages containing
maximum context length/context lengthasOverlongPromptError, and the env handles that asprompt_too_longinstead of a generic model failure.Impact
Long multi-turn rollouts that reach
max_model_lennow stop through the existing overlong-prompt path instead of sendingmax_tokens=0to vLLM. In the observed SLURM run this was not the dominant throughput bottleneck, but each occurrence can still be expensive because group scoring discards partial group progress and reschedules.Verification
uv run pytest tests/unit/inference/test_serving_chat_with_tokens.py tests/unit/inference/test_serving_generate.pyuv run ruff format --check src/prime_rl/inference/vllm/serving_chat_with_tokens.py tests/unit/inference/test_serving_chat_with_tokens.pyuv run ruff check src/prime_rl/inference/vllm/serving_chat_with_tokens.py tests/unit/inference/test_serving_chat_with_tokens.pyNote
Medium Risk
Changes request validation behavior in the OpenAI chat serving path; could alter which requests are rejected and the error shape returned, but is localized and covered by unit tests.
Overview
Adds an explicit context-length guard for standard
/v1/chat/completionsrequests by validating renderedengine_promptsinrender_chat_requestand returning aVLLMValidationError(parameter="input_tokens")when the prompt already fills the model context (avoidingmax_tokens=0underflow errors).Refactors the same check into
_validate_prompt_has_generation_room()and reuses it in the token-in (/chat/completions/tokens) path, while deferring validation there until afterrequest.tokensare stitched in so the check reflects the actual prompt tokens. Includes new unit tests covering both the guard behavior and the token-endpoint deferral.Reviewed by Cursor Bugbot for commit 82d8849. Bugbot is set up for automated code reviews on this repo. Configure here.