Skip to content

Feature/prefix cache with redis reqcnt#2089

Open
penfree wants to merge 16 commits intovllm-project:mainfrom
penfree:feature/prefix-cache-with-redis-reqcnt
Open

Feature/prefix cache with redis reqcnt#2089
penfree wants to merge 16 commits intovllm-project:mainfrom
penfree:feature/prefix-cache-with-redis-reqcnt

Conversation

@penfree
Copy link
Copy Markdown
Contributor

@penfree penfree commented Apr 8, 2026

Summary

This PR includes critical bugfixes for KV sync prefix cache routing and adds
Redis-based request counting support for multi-instance gateway deployments.

Changes

✨ Feature: Redis-Based Request Counter for Multi-Instance Gateway Deployments

Motivation:
When multiple AIBrix gateway instances are deployed, each instance only tracks local
request counts. This causes incorrect load imbalance detection since no single
instance has the complete view of pod request distribution.

Implementation:

  • Added RequestCounter interface with two implementations:
    • localRequestCounter: Uses local metrics (existing behavior)
    • redisRequestCounter: Uses Redis hash to track distributed request counts
  • Modified prefix cache router to accept request counter via dependency injection
  • Refactored load balancing functions to accept map[string]int instead of
    cache.Cache
  • Environment variable AIBRIX_PREFIX_CACHE_USE_REDIS_REQCNT=true enables
    Redis-based counting

Key Features:

  • Redis key design: aibrix:prefix-cache-reqcnt:{model_name} (hashmap: pod_key →
    count)
  • Supports Redis Cluster via hashtag {model_name} for same-slot operations
  • Automatic cleanup when request count reaches zero
  • Timeout mechanism to stop tracking inactive models (5 minutes)
  • Returns all ready pods with count=0 for pods not in Redis

Files Changed:

  • pkg/plugins/gateway/algorithms/request_counter.go (new file): RequestCounter
    interface and implementations
  • pkg/plugins/gateway/algorithms/prefix_cache.go: Integration with request counter
  • pkg/plugins/gateway/algorithms/power_of_two.go: Updated podServerKey.String()
    to include namespace

Testing

  • ✅ All existing tests pass
  • ✅ Updated test fixtures to use requestCounter
  • ✅ Fixed test expectations for namespace-aware pod keys

Breaking Changes

None. Changes are backward compatible:

  • Default behavior remains unchanged (uses local request counter)
  • Redis-based counting is opt-in via environment variable

Related Issues

Closes: [Issue number if applicable]

Deployment Notes

To enable Redis-based request counting:

  1. Ensure Redis client is configured
  2. Set environment variable: AIBRIX_PREFIX_CACHE_USE_REDIS_REQCNT=true
  3. Multiple gateway instances will now share request count information

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RequestCounter interface and a Redis-based implementation to track pod request counts for load balancing in the prefix cache router. It also refactors the routing logic to use a shared sync indexer and improves pod identification by including namespaces in keys. The review feedback identifies several critical bugs: incorrect key handling in both the KV sync and standard routing paths which would break pod selection, an interface violation in the local request counter regarding namespace support, and potential Redis memory leaks due to missing TTLs. Additionally, logic was flagged that could lead to counter drift if request completion is tracked without a corresponding successful initiation.

@penfree penfree marked this pull request as ready for review April 8, 2026 09:36
@penfree
Copy link
Copy Markdown
Contributor Author

penfree commented Apr 8, 2026

@Jeffwan hi, please review this pr

@varungup90
Copy link
Copy Markdown
Collaborator

@penfree Thank you for the PR, is it possible to split the PR into two. Remove the redis counter into different PR, also I am started a PR to sync in-memory state across gateway instances (#1989) -- in that PR, it syncs in-memory prefix cache stats, but can be extended to share request counter as well.

@varungup90
Copy link
Copy Markdown
Collaborator

varungup90 commented Apr 8, 2026

Summary
This PR fixes a significant bug in kvSyncPrefixCacheRouter (it was querying an empty indexer instead of the shared one from the cache store) and adds a RequestCounter interface with Redis-backed distributed request counting across multiple gateway instances.

Critical Issues

  1. KV sync key format mismatch — pkg/plugins/gateway/algorithms/prefix_cache.go ~line 715

GetRequestCounts() now returns keys in namespace/name format, but the loop that builds podRequestCountWithKeys looks up podRequestCount[pod.Name] (bare name). This silently fails — every lookup misses, podRequestCountWithKeys is always empty, and the router falls back to least-request selection with no valid input.

  1. filterPodsByKey is dead code — pkg/plugins/gateway/algorithms/prefix_cache.go ~lines 635–643

The function is defined but never called. This suggests the refactoring was left incomplete. Go won't fail to compile on an unused function, but the routing code that should call it likely has the bug described in #1.

  1. KV sync nil-indexer fallback doesn't update metrics — pkg/plugins/gateway/algorithms/prefix_cache.go ~lines 254–285

When c.GetSyncPrefixIndexer() returns nil, the code logs a warning and skips creating kvSyncRouter, but prefixCacheIndexerStatus metrics are never set. The fallback to local-indexer behavior goes unrecorded.

High Priority
4. Breaking Redis key format change for Power-of-Two router — pkg/plugins/gateway/algorithms/utils.go, power_of_two.go

podServerKey was moved to utils.go and its format changed from name_port → namespace/name/port. Existing in-flight Redis keys under the old format will be silently ignored on deploy, causing a transient load spike. This needs deployment notes.

  1. lastModelRouteTime map grows unbounded — pkg/plugins/gateway/algorithms/request_counter.go ~line 1291

Updated on every GetRequestCountsWithPort call but never pruned. Long-running instances routing many different models will leak memory.

Medium Priority
6. Lua script re-evaluated on every call — request_counter.go ~line 1521

The Lua decrement script is a string literal inside DoneRequestCount(). In high-traffic environments, define it as a package-level redis.NewScript(...) and use EvalSha for Redis script caching.

  1. Expire resets TTL on every increment — request_counter.go ~line 1467

A single ongoing request continuously refreshes the TTL. Consider ExpireNX (set only if no TTL exists) or ExpireGT to avoid preventing cleanup of active keys.

  1. SyncIndexerProvider returns a concrete type — pkg/cache/cache_api.go ~line 71

GetSyncPrefixIndexer() *syncindexer.SyncPrefixHashTable tightly couples the interface to the implementation. An interface type would enable test injection and better separation of concerns.

  1. redismock in production require block — go.mod

github.com/go-redis/redismock/v9 is a test-only library but placed in the main require block, adding it as a production binary dependency.

@penfree
Copy link
Copy Markdown
Contributor Author

penfree commented Apr 9, 2026

@varungup90 Thanks for your patience。

Summary This PR fixes a significant bug in kvSyncPrefixCacheRouter (it was querying an empty indexer instead of the shared one from the cache store) and adds a RequestCounter interface with Redis-backed distributed request counting across multiple gateway instances.

Critical Issues

  1. KV sync key format mismatch — pkg/plugins/gateway/algorithms/prefix_cache.go ~line 715

GetRequestCounts() now returns keys in namespace/name format, but the loop that builds podRequestCountWithKeys looks up podRequestCount[pod.Name] (bare name). This silently fails — every lookup misses, podRequestCountWithKeys is always empty, and the router falls back to least-request selection with no valid input.

i haven't found any code that used bare pod name to look up podRequestCount

(.venv) ➜  pkg git:(feature/prefix-cache-with-redis-reqcnt) ack podRequestCount | grep Name           
plugins/gateway/algorithms/pd_disaggregation_benchmark_test.go:122:                             podRequestCounts[pod.Name] = requestCount
plugins/gateway/algorithms/request_counter.go:137:                      podRequestCount[keyName] = count
plugins/gateway/algorithms/prefix_cache.go:719:         // For KV sync router, we need podRequestCount to use pod keys (namespace/podName)
plugins/gateway/algorithms/pd_disaggregation.go:425:            podRequestCounts[pod.Name] = requestCount
plugins/gateway/algorithms/pd_disaggregation.go:483:            score := podRequestCounts[pod.Name] / math.Max(drainRate.GetSimpleValue(), defaultDrainRateEpsilon)
plugins/gateway/algorithms/pd_disaggregation.go:618:            reqCnt := float64(podRequestCount[pod.Name])
plugins/gateway/algorithms/pd_disaggregation.go:650:            normalizedRunningReqs := podRequestCounts[pod.Name] / maxRequestCount
plugins/gateway/algorithms/pd_disaggregation.go:667:                    "running_reqs", podRequestCounts[pod.Name], "max_running_reqs", maxRequestCount,
plugins/gateway/algorithms/pd_disaggregation.go:1168:   countInterface, _ := t.podRequestCounts.LoadOrStore(podName, &atomic.Int32{})
plugins/gateway/algorithms/pd_disaggregation.go:1191:   countInterface, exists := t.podRequestCounts.Load(podName)
plugins/gateway/algorithms/pd_disaggregation.go:1215:           countInterface, exists := t.podRequestCounts.Load(pod.Name)
plugins/gateway/algorithms/pd_disaggregation.go:1242:   countInterface, _ := t.podRequestCounts.LoadOrStore(podName, &atomic.Int32{})
plugins/gateway/algorithms/pd_disaggregation.go:1257:   countInterface, exists := t.podRequestCounts.Load(podName)
plugins/gateway/algorithms/pd_disaggregation.go:1281:   countInterface, exists := t.podRequestCounts.Load(podName)
plugins/gateway/algorithms/least_request.go:186:                podRequestCount[pod.Name] = int(runningReq.GetSimpleValue())
plugins/gateway/algorithms/least_request.go:217:                        podRequestCount[keyName] = count
  1. filterPodsByKey is dead code — pkg/plugins/gateway/algorithms/prefix_cache.go ~lines 635–643

The function is defined but never called. This suggests the refactoring was left incomplete. Go won't fail to compile on an unused function, but the routing code that should call it likely has the bug described in #1.

removed

  1. KV sync nil-indexer fallback doesn't update metrics — pkg/plugins/gateway/algorithms/prefix_cache.go ~lines 254–285

When c.GetSyncPrefixIndexer() returns nil, the code logs a warning and skips creating kvSyncRouter, but prefixCacheIndexerStatus metrics are never set. The fallback to local-indexer behavior goes unrecorded.

Fixed

High Priority 4. Breaking Redis key format change for Power-of-Two router — pkg/plugins/gateway/algorithms/utils.go, power_of_two.go

podServerKey was moved to utils.go and its format changed from name_port → namespace/name/port. Existing in-flight Redis keys under the old format will be silently ignored on deploy, causing a transient load spike. This needs deployment notes.

Since redis key in power of two will rotate every hour, i think this is not a problem

  1. lastModelRouteTime map grows unbounded — pkg/plugins/gateway/algorithms/request_counter.go ~line 1291

Updated on every GetRequestCountsWithPort call but never pruned. Long-running instances routing many different models will leak memory.

the max size of this map is the amount of model, i don't think there can have so many models that cause a memory issue, this every entry has only a string and a int64. however, pruning it needs many code

Medium Priority 6. Lua script re-evaluated on every call — request_counter.go ~line 1521

The Lua decrement script is a string literal inside DoneRequestCount(). In high-traffic environments, define it as a package-level redis.NewScript(...) and use EvalSha for Redis script caching.

Fixed

  1. Expire resets TTL on every increment — request_counter.go ~line 1467

A single ongoing request continuously refreshes the TTL. Consider ExpireNX (set only if no TTL exists) or ExpireGT to avoid preventing cleanup of active keys.

i've added key rotation, and set ttl in lua only when the key does not exist

  1. SyncIndexerProvider returns a concrete type — pkg/cache/cache_api.go ~line 71

GetSyncPrefixIndexer() *syncindexer.SyncPrefixHashTable tightly couples the interface to the implementation. An interface type would enable test injection and better separation of concerns.

**like PrefixHashTable, a GetSharedSyncPrefixHashTable is added to get singleton of SyncPrefixHashTable, GetSyncPrefixIndexer is removed from cache_api interface.

  1. redismock in production require block — go.mod

github.com/go-redis/redismock/v9 is a test-only library but placed in the main require block, adding it as a production binary dependency.

seems this is not an issue, the redismock is only used in test, so it's not built into the production binary

@Jeffwan Jeffwan requested a review from varungup90 April 10, 2026 04:26
tokenizerPool TokenizerPoolInterface // Add TokenizerPool reference
syncIndexer *syncindexer.SyncPrefixHashTable // Use concrete type
metricsEnabled bool
requestCounter RequestCounter // Request counter for load balancing
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local counter is not needed. This local counter is only incrementing requests for prefix-cache, compare to RealtimeNumRequestsRunning keep track of requests across all routing policies.

RealtimeNumRequestsRunning: is calculated via gateway and it is not fetched from the pod.

metricName = metrics.RealtimeNumRequestsRunning
if val, err := cache.GetMetricValueByPod(pod.Name, pod.Namespace, metricName); err == nil && val != nil {
				count = int(val.GetSimpleValue())
			}

Copy link
Copy Markdown
Contributor Author

@penfree penfree Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varungup90 i don't think so, localCounter implements the same interface as redisCounter, so kvSyncPrefixCacheRouter can use either of them as user's need.
When only one gateway instance is deployed, they don't need redisCounter with extra latency cost. localCounter can be a dropin replacement in such cases, without using many local util functions.

kvSyncRouter *kvSyncPrefixCacheRouter

// Request counter for load balancing
requestCounter RequestCounter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets simplify the PR, split the request counter into a different PR. This one to be more focussed on fixing bug with kv event sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varungup90 already split the bugfix to #2096 , i'll rebase this pr to main branch after it's merged

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varungup90 rebased, PTAL

pengfei913 and others added 16 commits April 13, 2026 16:50
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I77d872965fcaebf0c455d93646a4b29e34f1d8d0
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: Ifa2c860c0f41021dcb02c84aacd8829393e1b390
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I0e83f041ba96a471c1fb6931f89c717863ae6b67
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I42451439b67ebb96d01262a0f2dfe5a20240aad4
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I1e38be1fe495ab422fb7dd923dcbdf97b208e58f
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I25205a00bef2aa181d0816c7d9e3e2eb11233e79
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I459b8d61300755f2982155a4ec7763e47fb16f0e
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I061386e56cb0c468fb1d687295f8502974423f21
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I08a605b32965077bb8d16407e6d908ba430b6dcd
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I49584b893d6e9b7db8b46a9264a92d9f53116a5a
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I82543a8c72c5d2bcf05579f7da6d03968c118314
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I0a4c9128d40ccbe96537bc35576f8bc0b1903f2a
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: Ie2274aae8b9af5cf70b9fe7d35a347cb1f113b44
…cPrefixHashTable

Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I832cad5cd99b64f2a509eb51f9e87c61dfc6bf5f
…y happens to crash

Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I80ff6cce0e18ab7b8303f559c9b6de17947b5242
Signed-off-by: penfree <penfree.qiu@gmail.com>
Change-Id: I89dd144e36682b838499f85efd5b07c518c4da0d
@penfree penfree force-pushed the feature/prefix-cache-with-redis-reqcnt branch from 0a6b874 to 9b1f5ea Compare April 13, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants