Ordering of filters in the request flow #10
david-martin
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm wondering if there are use cases where the ordering of filters will need to be different than the below flow diagram of an inference/chat completion request?
There are definitely cases where some steps will be skipped/off.
Explanation of each filter step below the diagram.
I've deliberately left out implementation detail like if wasm-shim is involved, what protocol is used and what service(s) are called out, so as to focus on the flow.
flowchart TD A subgraph Request A2 B B2 C D D2 F G G1 end G2 subgraph Response I J K end L A["Incoming chat/completion request"] --> A2["Infrastructure rate limiting"] A2 --> B["Infrastructure token rate limiting"] B --> B2["AuthN/Z"] B2 --> C["Auth based rate limiting"] C --> D["Auth based token rate limiting"] D --> D2["Model selection"] D2 --> F["Prompt guard"] F --> G["Semantic cache"] G --> G1["Scheduling (Endpoint Picker)"] G1 --> G2["Model server"] G2 --> I["Increment token usage counter"] I --> J["Response risk check"] J --> K["Populate semantic cache"] K --> L["Flush response"]Infrastructure rate limiting
Request rate limiting that doesn't take any auth context into account.
Typically used to protect infrastructure.
Infrastructure token rate limiting
Token usage based rate limiting that doesn't take any auth context into account.
Typically used to protect infrastructure.
Token usage based limiting is reactive, in that token counts are updated at the end of a req/res flow, impacting the next request.
AuthN/Z
Authorisation and Authentication
Auth based rate limiting
Request rate limiting that takes the auth context into account.
Auth based token rate limiting
Token usage based rate limiting that takes auth context into account.
Token usage based limiting is reactive, in that token counts are updated at the end of a req/res flow, impacting the next request.
Model selection
Select the best matching model/group of models based on semantic similarity or predefined rules.
Prompt guard
Parse the prompt from the request body and send it to an risk checking/guard/guardian LLM to see if its classified as 'risky'
Scheduling (Endpoint Picker)
Lookup the 'best' model to send the request to based on the desired model & various metrics from the underlying models (like queue size, kvcache size, lora adapters). There may be session stickiness as well that short circuits this step.
This is what the GIE project is primarily about at this time.
Semantic cache
Check the prompt to see if its semantically the same request as something that has already been served, and has a response available in the cache.
Uses an embedding LLM to lookup vectors for a prompt, and a vector db to compare vectors and cache responses.
Model server
The actual model server that will do inference to generate a response.
Could be some model locally available, like in a kserve instance, or could be remote provided by some model provider e.g. openAI, AWS Bedrock.
Increment token usage counter
Parses the response body (json) to find the token usage stats, and increments some counter.
The counter context/key can have many inputs like headers, model or auth context.
Response risk check
Parse the response body and send it to an risk checking/guard/guardian LLM to see if its classified as 'risky'
Populate semantic cache
Populate the semantic cache with the response body, using the prompt embeddings vector as the key.
Other details like the model or auth context can be taken into account as well.
Beta Was this translation helpful? Give feedback.
All reactions