feat: add stream idle timeout with fallover to next model#2169
feat: add stream idle timeout with fallover to next model#2169albe2669 wants to merge 13 commits into
Conversation
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
|
Can you provide a full example config that leverages this use case? Also, using the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2169 +/- ##
==========================================
+ Coverage 84.71% 84.73% +0.02%
==========================================
Files 144 144
Lines 21204 21263 +59
==========================================
+ Hits 17962 18017 +55
- Misses 2161 2163 +2
- Partials 1081 1083 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: albe2669 <albert@risenielsen.dk>
Yeah of course. That makes much more sense, much cleaner solution. Should probably have looked a little more into how the system works. I pushed an update and updated the PR description. In terms of the example config, I have been running this to test it: https://github.com/albe2669/ai-gateway/tree/example/examples/first_token_timeout. I can push that to this branch too if you think it's valuable to have. For context, what we want to do is use virtual models such that if one model fails it falls over to the next one. But since inference takes so long, a 90-120 second timeout is set on the responses. Which, if the model never responds, means a request will take 90 seconds before falling over to the next model. This then solves that issue pretty cleanly as we will failover fast if the model didn't start responding after X seconds. |
| } | ||
| return fmt.Errorf("failed to get AIGatewayRoute %s/%s: %w", parts[1], parts[2], err) | ||
| } | ||
| if ruleIndex >= len(aigwRoute.Spec.Rules) { |
There was a problem hiding this comment.
When can this happen? Worth adding a comment.
There was a problem hiding this comment.
Added in: b4bbeaf
Most of the other functions in the same file simply say the rule index is out of range.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new FirstTokenTimeout field to AIGatewayRouteRule (in both v1alpha1 and v1beta1 APIs) to configure the maximum wait time for the first response byte before Envoy resets the upstream stream. The extension server is updated to apply this timeout as per_try_idle_timeout on generated routes, accompanied by new helper methods, deepcopy updates, CRD manifests, and unit tests. The review feedback suggests optimizing the implementation by caching retrieved AIGatewayRoute objects during the translation pass to avoid redundant Kubernetes API calls, adding defensive guards for route name parsing, and updating the unit tests accordingly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
|
What is your policy on AI reviews? Shall I resolve them, does it do that itself, or do you check and do it? |
Signed-off-by: albe2669 <albert@risenielsen.dk>
|
FYI: i have renamed the parameter to the more appropriate Also, sorry for the force push. I forgot the sign-off |
nacx
left a comment
There was a problem hiding this comment.
Thanks!
Overall LGTM. One last thing left:
Can you provide a full example config that leverages this use case?
Can you add some example for this use case?
AI reviews are just reviews, but they will not get auto-resolved. Thanks for addressing the comments! |
Signed-off-by: albe2669 <albert@risenielsen.dk>
Signed-off-by: albe2669 <albert@risenielsen.dk>
Sorry, I've been away. Examples added! |
|
This is great. Thanks for the example! |
Description
Adds a new optional
streamIdleTimeoutfield toAIGatewayRouteRuleso callers can bound the time without receiving any upstream bytes on a streaming response.Why
Today the only deadline on an LLM streaming call is
Timeouts.Request, which bounds the whole response. If a model is slow to emit its first token, the client waits up to the overall budget before failing. We want to fail fast on idle/slow upstreams and (with retry configured) fall over to the next backend in the rule before any response headers reach the client.How
The AI Gateway extension server sets
route.retry_policy.per_try_idle_timeouton the generated xDS routes. In thePostTranslateModifyhook it walks the finalRouteConfigurations, identifies eachAIGatewayRouteroute by its rule index, looks up theAIGatewayRoute, and for any rule withstreamIdleTimeoutset it setsper_try_idle_timeoutto that value. The timeout is merged into the route's existing retry policy, so retry config produced by aBackendTrafficPolicy(retryOn,numRetries) is preserved.Behavior
streamIdleTimeoutset hasper_try_idle_timeoutapplied to its routes.Testing method
streamIdleTimeoutto 5 seconds, it should then fallover to the other application. This ran successfully. I can add it as an example if that makes sense.per_try_idle_timeout=5son the rule (withretry_on/num_retriesintact), and a streaming request fails over to the healthy backend after ~5s.Notes