Ordering of filters in the request flow #10

david-martin · 2025-05-01T12:37:41Z

david-martin
May 1, 2025
Maintainer

I'm wondering if there are use cases where the ordering of filters will need to be different than the below flow diagram of an inference/chat completion request?
There are definitely cases where some steps will be skipped/off.

Explanation of each filter step below the diagram.
I've deliberately left out implementation detail like if wasm-shim is involved, what protocol is used and what service(s) are called out, so as to focus on the flow.

flowchart TD
    A
    subgraph Request
        A2
        B
        B2
        C
        D
        D2
        F
        G
        G1
    end
    G2
    subgraph Response
        I
        J
        K
    end
    L

A["Incoming chat/completion request"] --> A2["Infrastructure rate limiting"]
A2 --> B["Infrastructure token rate limiting"]
B --> B2["AuthN/Z"]
B2 --> C["Auth based rate limiting"]
C --> D["Auth based token rate limiting"]
D --> D2["Model selection"]
D2 --> F["Prompt guard"]
F --> G["Semantic cache"]
G --> G1["Scheduling (Endpoint Picker)"]
G1 --> G2["Model server"]
G2 --> I["Increment token usage counter"]
I --> J["Response risk check"]
J --> K["Populate semantic cache"]
K --> L["Flush response"]

Infrastructure rate limiting

Request rate limiting that doesn't take any auth context into account.
Typically used to protect infrastructure.

Infrastructure token rate limiting

Token usage based rate limiting that doesn't take any auth context into account.
Typically used to protect infrastructure.
Token usage based limiting is reactive, in that token counts are updated at the end of a req/res flow, impacting the next request.

AuthN/Z

Authorisation and Authentication

Auth based rate limiting

Request rate limiting that takes the auth context into account.

Auth based token rate limiting

Token usage based rate limiting that takes auth context into account.
Token usage based limiting is reactive, in that token counts are updated at the end of a req/res flow, impacting the next request.

Model selection

Select the best matching model/group of models based on semantic similarity or predefined rules.

Prompt guard

Parse the prompt from the request body and send it to an risk checking/guard/guardian LLM to see if its classified as 'risky'

Scheduling (Endpoint Picker)

Lookup the 'best' model to send the request to based on the desired model & various metrics from the underlying models (like queue size, kvcache size, lora adapters). There may be session stickiness as well that short circuits this step.
This is what the GIE project is primarily about at this time.

Semantic cache

Check the prompt to see if its semantically the same request as something that has already been served, and has a response available in the cache.
Uses an embedding LLM to lookup vectors for a prompt, and a vector db to compare vectors and cache responses.

Model server

The actual model server that will do inference to generate a response.
Could be some model locally available, like in a kserve instance, or could be remote provided by some model provider e.g. openAI, AWS Bedrock.

Increment token usage counter

Parses the response body (json) to find the token usage stats, and increments some counter.
The counter context/key can have many inputs like headers, model or auth context.

Response risk check

Parse the response body and send it to an risk checking/guard/guardian LLM to see if its classified as 'risky'

Populate semantic cache

Populate the semantic cache with the response body, using the prompt embeddings vector as the key.
Other details like the model or auth context can be taken into account as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kuadrant

Ordering of filters in the request flow #10

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Kuadrant

Ordering of filters in the request flow #10

Uh oh!

Uh oh!

david-martin May 1, 2025 Maintainer

Infrastructure rate limiting

Infrastructure token rate limiting

AuthN/Z

Auth based rate limiting

Auth based token rate limiting

Model selection

Prompt guard

Scheduling (Endpoint Picker)

Semantic cache

Model server

Increment token usage counter

Response risk check

Populate semantic cache

Replies: 0 comments

david-martin
May 1, 2025
Maintainer