Skip to content

feat: add stream idle timeout with fallover to next model#2169

Open
albe2669 wants to merge 13 commits into
envoyproxy:mainfrom
albe2669:main
Open

feat: add stream idle timeout with fallover to next model#2169
albe2669 wants to merge 13 commits into
envoyproxy:mainfrom
albe2669:main

Conversation

@albe2669

@albe2669 albe2669 commented May 28, 2026

Copy link
Copy Markdown

Description

Adds a new optional streamIdleTimeout field to AIGatewayRouteRule so callers can bound the time without receiving any upstream bytes on a streaming response.

Why

Today the only deadline on an LLM streaming call is Timeouts.Request, which bounds the whole response. If a model is slow to emit its first token, the client waits up to the overall budget before failing. We want to fail fast on idle/slow upstreams and (with retry configured) fall over to the next backend in the rule before any response headers reach the client.

How

The AI Gateway extension server sets route.retry_policy.per_try_idle_timeout on the generated xDS routes. In the PostTranslateModify hook it walks the final RouteConfigurations, identifies each AIGatewayRoute route by its rule index, looks up the AIGatewayRoute, and for any rule with streamIdleTimeout set it sets per_try_idle_timeout to that value. The timeout is merged into the route's existing retry policy, so retry config produced by a BackendTrafficPolicy (retryOn, numRetries) is preserved.

Behavior

  • A rule with streamIdleTimeout set has per_try_idle_timeout applied to its routes.
  • A rule without it is, of course, left untouched.
  • If the timer fires before the first response byte arrives, Envoy resets the upstream stream before any headers leave the gateway. A BackendTrafficPolicy with retryOn covering reset will transparently fall over to the next backend (model) in the rule.
  • If the timer fires mid-stream after bytes have already arrived, the stream is cut and the client receives a 504.

Testing method

  • Unit tests for the route mutation (timeout applied, existing retry policy preserved, no-timeout/non-forwarding/missing-route cases) plus the existing suite.
  • Created an example with two tiny applications where one never responds, set to the first priority in the backend list, and set the streamIdleTimeout to 5 seconds, it should then fallover to the other application. This ran successfully. I can add it as an example if that makes sense.
  • Verified end-to-end on a local cluster: Envoy's route config dump shows per_try_idle_timeout=5s on the rule (with retry_on/num_retries intact), and a streaming request fails over to the healthy backend after ~5s.

Notes

  • Generative AI was used to assist in writing this change.
  • The route-name prefix encodes Envoy Gateway's internal xDS naming convention. If that schema changes the lookup will no longer match the rule — but unlike the JSONPatch approach this is a normal code path that can log/observe the miss rather than failing silently.

albe2669 added 2 commits May 28, 2026 14:51
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
@albe2669 albe2669 marked this pull request as ready for review May 29, 2026 06:55
@albe2669 albe2669 requested a review from a team as a code owner May 29, 2026 06:55
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 29, 2026
@nacx

nacx commented Jun 2, 2026

Copy link
Copy Markdown
Member

Can you provide a full example config that leverages this use case?

Also, using the EnvoyPatchPolicy is quite error-prone and could lead to issues when upgrading, etc. AIGW uses the extension server instead to patch configurations before sending them via xDS to the data plane. Could you update the proposal to use the extension server, as it would be more reliable?

@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.73%. Comparing base (67e4926) to head (6b361bd).

Files with missing lines Patch % Lines
internal/extensionserver/post_translate_modify.go 95.55% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2169      +/-   ##
==========================================
+ Coverage   84.71%   84.73%   +0.02%     
==========================================
  Files         144      144              
  Lines       21204    21263      +59     
==========================================
+ Hits        17962    18017      +55     
- Misses       2161     2163       +2     
- Partials     1081     1083       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: albe2669 <albert@risenielsen.dk>
@albe2669

albe2669 commented Jun 3, 2026

Copy link
Copy Markdown
Author

Can you provide a full example config that leverages this use case?

Also, using the EnvoyPatchPolicy is quite error-prone and could lead to issues when upgrading, etc. AIGW uses the extension server instead to patch configurations before sending them via xDS to the data plane. Could you update the proposal to use the extension server, as it would be more reliable?

Yeah of course. That makes much more sense, much cleaner solution. Should probably have looked a little more into how the system works. I pushed an update and updated the PR description.

In terms of the example config, I have been running this to test it: https://github.com/albe2669/ai-gateway/tree/example/examples/first_token_timeout. I can push that to this branch too if you think it's valuable to have.

For context, what we want to do is use virtual models such that if one model fails it falls over to the next one. But since inference takes so long, a 90-120 second timeout is set on the responses. Which, if the model never responds, means a request will take 90 seconds before falling over to the next model.
The logical next move is then to enable streaming in our client and try again if TTFT is over a certain value. BUT, if we do that, then we don't failover in the virtual model list as it's on the client, so we'll just keep trying the same model that may be unavailable, or just extremely slow.

This then solves that issue pretty cleanly as we will failover fast if the model didn't start responding after X seconds.

@nacx nacx left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}
return fmt.Errorf("failed to get AIGatewayRoute %s/%s: %w", parts[1], parts[2], err)
}
if ruleIndex >= len(aigwRoute.Spec.Rules) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this happen? Worth adding a comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in: b4bbeaf

Most of the other functions in the same file simply say the rule index is out of range.

@nacx

nacx commented Jun 3, 2026

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new FirstTokenTimeout field to AIGatewayRouteRule (in both v1alpha1 and v1beta1 APIs) to configure the maximum wait time for the first response byte before Envoy resets the upstream stream. The extension server is updated to apply this timeout as per_try_idle_timeout on generated routes, accompanied by new helper methods, deepcopy updates, CRD manifests, and unit tests. The review feedback suggests optimizing the implementation by caching retrieved AIGatewayRoute objects during the translation pass to avoid redundant Kubernetes API calls, adding defensive guards for route name parsing, and updating the unit tests accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread internal/extensionserver/post_translate_modify.go Outdated
Comment thread internal/extensionserver/post_translate_modify.go Outdated
Comment thread internal/extensionserver/extensionserver_test.go
albe2669 added 4 commits June 4, 2026 10:52
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
@albe2669 albe2669 requested a review from nacx June 4, 2026 09:01
@albe2669

albe2669 commented Jun 4, 2026

Copy link
Copy Markdown
Author

What is your policy on AI reviews? Shall I resolve them, does it do that itself, or do you check and do it?

Signed-off-by: albe2669 <albert@risenielsen.dk>
@albe2669 albe2669 changed the title feat: add timeout to first token with rollover to new model feat: add stream idle timeout with fallover to next model Jun 10, 2026
@albe2669

albe2669 commented Jun 10, 2026

Copy link
Copy Markdown
Author

FYI: i have renamed the parameter to the more appropriate streamIdleTimeout as it will also fail the request if the stream is idle for too long between tokens.

Also, sorry for the force push. I forgot the sign-off

@nacx nacx left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Overall LGTM. One last thing left:

Can you provide a full example config that leverages this use case?

Can you add some example for this use case?

@nacx

nacx commented Jun 10, 2026

Copy link
Copy Markdown
Member

What is your policy on AI reviews? Shall I resolve them, does it do that itself, or do you check and do it?

AI reviews are just reviews, but they will not get auto-resolved. Thanks for addressing the comments!

albe2669 added 3 commits June 19, 2026 09:34
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
@albe2669

Copy link
Copy Markdown
Author

Thanks!

Overall LGTM. One last thing left:

Can you provide a full example config that leverages this use case?

Can you add some example for this use case?

Sorry, I've been away.

Examples added!

@nacx

nacx commented Jun 22, 2026

Copy link
Copy Markdown
Member

This is great. Thanks for the example!
I have a final ask. We want examples to always be fine and not break, and we usually have e2e tests for them. This feature adds a config to the API that should be e2e-tested as well. Could you add an end-to-end test for this example here? https://github.com/envoyproxy/ai-gateway/tree/main/tests/e2e
You'll see many e2e there use the example files to test the functionality and always keep the examples up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants