kv-cache: bounded budget with residual-stream recomputation (KV Direct) by malibujack · Pull Request #21097 · ggml-org/llama.cpp

malibujack · 2026-03-28T04:27:13Z

Adds optional bounded KV cache that doesn't permanently lose evicted tokens. Instead, it saves layer-0 embeddings to a host-memory ring buffer and recomputes K/V on demand when evicted positions are needed again.

The key insight is that the residual stream at layer 0 encodes everything needed to reconstruct K/V — so saving ~5 KB of residual per token lets you recover hundreds of KB of cached state. This is independently validated by arXiv:2603.19664 ("The Residual Stream Is All You Need"), which shows 100% token match at every budget level vs permanent-loss baselines like H2O, StreamingLLM, and SnapKV.

What it does:

--kv-budget-tokens N or --kv-budget-mb N caps the cache at a fixed size
LRU eviction kicks in when over budget
Evicted positions get recomputed from saved residuals (batched, capped at 64 per cycle)
Budget=0 (default) disables everything — zero overhead, identical to current behavior

API additions:

llama_kv_direct_evict(ctx) — enforce budget, evict LRU positions
llama_kv_direct_recompute_misses(ctx) — restore evicted positions from residual pool

I've been running this in production for a few weeks on Qwen3 models with a 3090 Ti. Memory stays flat during long multi-turn conversations while output quality matches unbounded cache.

Disclosure: AI tools (Claude) assisted with code development and the initial PR description.

Replace the Phase 1 per-layer residual capture scaffolding with a single build_kv_direct_capture_embd() call in build_inp_embd(). This captures the input embedding tensor (layer-0 residual) into the graph result for post-compute readback by the residual pool. Changes: - llama-graph.h: kv_direct_residuals vector -> kv_direct_capture pointer, add t_kv_direct_capture to llm_graph_result, rename method - llama-graph.cpp: simplified constructor init, new capture method that allocates tensor + copies via compute graph, hook in build_inp_embd() - Model files (qwen3, qwen3moe, llama, deepseek, gemma3): remove per-layer build_kv_direct_capture() calls (now unused)

After graph_compute succeeds, read the captured layer-0 embedding tensor back from the GPU and store each token's embedding into the residual ring buffer pool. Touch LRU timestamps for KV cells that correspond to the batch positions, and increment the LRU step counter.

Add recompute_evicted() to llama_kv_cache that rebuilds evicted KV entries using saved residual embeddings via llama_decode's embd path. Add llama_kv_direct_recompute_misses() public API for Go bindings.

…egration

ggml-gh-bot · 2026-03-28T04:30:58Z

Hi @malibujack, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

malibujack · 2026-03-28T04:36:18Z

Hopefully that's okay :)

ggerganov · 2026-03-28T08:20:11Z

Nope, too much slop

malibujack · 2026-03-28T19:51:26Z

Fair enough, your loss. Take care.

edt-xx · 2026-04-01T20:00:28Z

@ggerganov This patch lets me run qwen3.5_Q4_K_M (50t/s) or Q8_0 (28t/s) with consistent t/s, Rate drops about 10% with 100000 tokens and memory used by context caching is bounded (about 5G for -c 262144). This pull may have been generated with an AI assist, but the results, here at least, do not act like AI slop. Strongly suggest you please take a look at how this works and the benefits it has.

7900XT 20G vram and 7700 96G ram. using b8609 vulkan (rocm gets a segmentation fault independent of this pull)

ggerganov · 2026-04-02T06:53:35Z

I agree, my response wasn't very appropriate - apologies.

I took a second look at the implementation and the paper and I don't understand how this works. We still have to materialize at some point the entire K and V for the full sequence. So even if we store just the residuals, at compute time we again need the full memory for the materialized keys and values. So I am failing to see how this approach works.

The observation that we can store the residuals and from them recompute the KV cache is trivial. It's almost exactly the same as saying "we can store the tokens and from them recompute the KV cache". The reason to have a KV cache in the first place is exactly to avoid this recomputation.

malibujack · 2026-04-02T07:51:20Z

This works by being selective about what is evicted, and narrow in what is restored if needed.

malibujack added 5 commits March 27, 2026 22:49

feat(kv-direct): implement recompute API for evicted KV positions

eb85e92

Add recompute_evicted() to llama_kv_cache that rebuilds evicted KV entries using saved residual embeddings via llama_decode's embd path. Add llama_kv_direct_recompute_misses() public API for Go bindings.

kv-direct: fix private access — move store_residuals to public method

7045fd6

kv-direct: add CLI args, common params plumbing, and memory layer int…

e960fd4

…egration

malibujack requested review from a team, CISC and ggerganov as code owners March 28, 2026 04:27

github-actions bot added the testing Everything test related label Mar 28, 2026

ggerganov closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache: bounded budget with residual-stream recomputation (KV Direct)#21097

kv-cache: bounded budget with residual-stream recomputation (KV Direct)#21097
malibujack wants to merge 5 commits intoggml-org:masterfrom
AktiveMatrix:kv-direct

malibujack commented Mar 28, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot bot commented Mar 28, 2026

Uh oh!

malibujack commented Mar 28, 2026

Uh oh!

ggerganov commented Mar 28, 2026

Uh oh!

malibujack commented Mar 28, 2026

Uh oh!

edt-xx commented Apr 1, 2026 •

edited

Loading

Uh oh!

ggerganov commented Apr 2, 2026

Uh oh!

malibujack commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

malibujack commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggml-gh-bot bot commented Mar 28, 2026

Uh oh!

malibujack commented Mar 28, 2026

Uh oh!

ggerganov commented Mar 28, 2026

Uh oh!

malibujack commented Mar 28, 2026

Uh oh!

edt-xx commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 2, 2026

Uh oh!

malibujack commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

malibujack commented Mar 28, 2026 •

edited

Loading

edt-xx commented Apr 1, 2026 •

edited

Loading