Skip to content

kv-cache: bounded budget with residual-stream recomputation (KV Direct)#21097

Closed
malibujack wants to merge 5 commits intoggml-org:masterfrom
AktiveMatrix:kv-direct
Closed

kv-cache: bounded budget with residual-stream recomputation (KV Direct)#21097
malibujack wants to merge 5 commits intoggml-org:masterfrom
AktiveMatrix:kv-direct

Conversation

@malibujack
Copy link
Copy Markdown

@malibujack malibujack commented Mar 28, 2026

Adds optional bounded KV cache that doesn't permanently lose evicted tokens. Instead, it saves layer-0 embeddings to a host-memory ring buffer and recomputes K/V on demand when evicted positions are needed again.

The key insight is that the residual stream at layer 0 encodes everything needed to reconstruct K/V — so saving ~5 KB of residual per token lets you recover hundreds of KB of cached state. This is independently validated by arXiv:2603.19664 ("The Residual Stream Is All You Need"), which shows 100% token match at every budget level vs permanent-loss baselines like H2O, StreamingLLM, and SnapKV.

What it does:

  • --kv-budget-tokens N or --kv-budget-mb N caps the cache at a fixed size
  • LRU eviction kicks in when over budget
  • Evicted positions get recomputed from saved residuals (batched, capped at 64 per cycle)
  • Budget=0 (default) disables everything — zero overhead, identical to current behavior

API additions:

  • llama_kv_direct_evict(ctx) — enforce budget, evict LRU positions
  • llama_kv_direct_recompute_misses(ctx) — restore evicted positions from residual pool

I've been running this in production for a few weeks on Qwen3 models with a 3090 Ti. Memory stays flat during long multi-turn conversations while output quality matches unbounded cache.

Disclosure: AI tools (Claude) assisted with code development and the initial PR description.

Replace the Phase 1 per-layer residual capture scaffolding with a single
build_kv_direct_capture_embd() call in build_inp_embd(). This captures
the input embedding tensor (layer-0 residual) into the graph result for
post-compute readback by the residual pool.

Changes:
- llama-graph.h: kv_direct_residuals vector -> kv_direct_capture pointer,
  add t_kv_direct_capture to llm_graph_result, rename method
- llama-graph.cpp: simplified constructor init, new capture method that
  allocates tensor + copies via compute graph, hook in build_inp_embd()
- Model files (qwen3, qwen3moe, llama, deepseek, gemma3): remove
  per-layer build_kv_direct_capture() calls (now unused)
After graph_compute succeeds, read the captured layer-0 embedding
tensor back from the GPU and store each token's embedding into the
residual ring buffer pool. Touch LRU timestamps for KV cells that
correspond to the batch positions, and increment the LRU step counter.
Add recompute_evicted() to llama_kv_cache that rebuilds evicted KV
entries using saved residual embeddings via llama_decode's embd path.
Add llama_kv_direct_recompute_misses() public API for Go bindings.
@malibujack malibujack requested review from a team, CISC and ggerganov as code owners March 28, 2026 04:27
@github-actions github-actions bot added the testing Everything test related label Mar 28, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Mar 28, 2026

Hi @malibujack, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@malibujack
Copy link
Copy Markdown
Author

Hopefully that's okay :)

@ggerganov
Copy link
Copy Markdown
Member

Nope, too much slop

@ggerganov ggerganov closed this Mar 28, 2026
@malibujack
Copy link
Copy Markdown
Author

Fair enough, your loss. Take care.

@edt-xx
Copy link
Copy Markdown

edt-xx commented Apr 1, 2026

@ggerganov This patch lets me run qwen3.5_Q4_K_M (50t/s) or Q8_0 (28t/s) with consistent t/s, Rate drops about 10% with 100000 tokens and memory used by context caching is bounded (about 5G for -c 262144). This pull may have been generated with an AI assist, but the results, here at least, do not act like AI slop. Strongly suggest you please take a look at how this works and the benefits it has.

7900XT 20G vram and 7700 96G ram. using b8609 vulkan (rocm gets a segmentation fault independent of this pull)

@ggerganov
Copy link
Copy Markdown
Member

I agree, my response wasn't very appropriate - apologies.

I took a second look at the implementation and the paper and I don't understand how this works. We still have to materialize at some point the entire K and V for the full sequence. So even if we store just the residuals, at compute time we again need the full memory for the materialized keys and values. So I am failing to see how this approach works.

The observation that we can store the residuals and from them recompute the KV cache is trivial. It's almost exactly the same as saying "we can store the tokens and from them recompute the KV cache". The reason to have a KV cache in the first place is exactly to avoid this recomputation.

@malibujack
Copy link
Copy Markdown
Author

This works by being selective about what is evicted, and narrow in what is restored if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants