Skip to content

Reject full-context chat prompts before max_tokens underflows#2402

Open
rasdani wants to merge 1 commit intomainfrom
fix/chat-max-tokens-zero
Open

Reject full-context chat prompts before max_tokens underflows#2402
rasdani wants to merge 1 commit intomainfrom
fix/chat-max-tokens-zero

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 3, 2026

Summary

  • reject rendered standard chat prompts that already fill the model context before vLLM derives max_tokens=0
  • preserve /chat/completions/tokens behavior by validating after stitched request.tokens are installed, so TITO checks the actual request tokens
  • add focused coverage for the context-limit guard and token-endpoint deferral

Why This Fixes The Bug

Before this change, the standard /v1/chat/completions serving path rendered the prompt and then called vLLM get_max_tokens(...). When prompt_len == max_model_len, that helper can return 0; request.to_sampling_params(max_tokens=0) then surfaces as:

BadRequestError: max_tokens must be at least 1, got 0

That error is a generic model error to the rollout env, so group-scored rollouts can reschedule an otherwise nearly complete group.

After this change, OpenAIServingChatWithTokens.render_chat_request() validates rendered standard chat prompts before vLLM reaches get_max_tokens. If the prompt has no generation room, it raises VLLMValidationError(parameter="input_tokens"), which vLLM serializes as a 400 BadRequestError with a context-length message:

This model's maximum context length is ... However, your request has ... input tokens ... (parameter=input_tokens, value=...)

The verifiers OpenAI client already classifies messages containing maximum context length / context length as OverlongPromptError, and the env handles that as prompt_too_long instead of a generic model failure.

Impact

Long multi-turn rollouts that reach max_model_len now stop through the existing overlong-prompt path instead of sending max_tokens=0 to vLLM. In the observed SLURM run this was not the dominant throughput bottleneck, but each occurrence can still be expensive because group scoring discards partial group progress and reschedules.

Verification

  • uv run pytest tests/unit/inference/test_serving_chat_with_tokens.py tests/unit/inference/test_serving_generate.py
  • uv run ruff format --check src/prime_rl/inference/vllm/serving_chat_with_tokens.py tests/unit/inference/test_serving_chat_with_tokens.py
  • uv run ruff check src/prime_rl/inference/vllm/serving_chat_with_tokens.py tests/unit/inference/test_serving_chat_with_tokens.py

Note

Medium Risk
Changes request validation behavior in the OpenAI chat serving path; could alter which requests are rejected and the error shape returned, but is localized and covered by unit tests.

Overview
Adds an explicit context-length guard for standard /v1/chat/completions requests by validating rendered engine_prompts in render_chat_request and returning a VLLMValidationError(parameter="input_tokens") when the prompt already fills the model context (avoiding max_tokens=0 underflow errors).

Refactors the same check into _validate_prompt_has_generation_room() and reuses it in the token-in (/chat/completions/tokens) path, while deferring validation there until after request.tokens are stitched in so the check reflects the actual prompt tokens. Includes new unit tests covering both the guard behavior and the token-endpoint deferral.

Reviewed by Cursor Bugbot for commit 82d8849. Bugbot is set up for automated code reviews on this repo. Configure here.

@rasdani rasdani requested review from mikasenghaas and samsja May 3, 2026 16:20
@rasdani rasdani marked this pull request as ready for review May 3, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant