kv-cache: bounded budget with residual-stream recomputation (KV Direct)#21097
kv-cache: bounded budget with residual-stream recomputation (KV Direct)#21097malibujack wants to merge 5 commits intoggml-org:masterfrom
Conversation
Replace the Phase 1 per-layer residual capture scaffolding with a single build_kv_direct_capture_embd() call in build_inp_embd(). This captures the input embedding tensor (layer-0 residual) into the graph result for post-compute readback by the residual pool. Changes: - llama-graph.h: kv_direct_residuals vector -> kv_direct_capture pointer, add t_kv_direct_capture to llm_graph_result, rename method - llama-graph.cpp: simplified constructor init, new capture method that allocates tensor + copies via compute graph, hook in build_inp_embd() - Model files (qwen3, qwen3moe, llama, deepseek, gemma3): remove per-layer build_kv_direct_capture() calls (now unused)
After graph_compute succeeds, read the captured layer-0 embedding tensor back from the GPU and store each token's embedding into the residual ring buffer pool. Touch LRU timestamps for KV cells that correspond to the batch positions, and increment the LRU step counter.
Add recompute_evicted() to llama_kv_cache that rebuilds evicted KV entries using saved residual embeddings via llama_decode's embd path. Add llama_kv_direct_recompute_misses() public API for Go bindings.
|
Hi @malibujack, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Hopefully that's okay :) |
|
Nope, too much slop |
|
Fair enough, your loss. Take care. |
|
@ggerganov This patch lets me run qwen3.5_Q4_K_M (50t/s) or Q8_0 (28t/s) with consistent t/s, Rate drops about 10% with 100000 tokens and memory used by context caching is bounded (about 5G for -c 262144). This pull may have been generated with an AI assist, but the results, here at least, do not act like AI slop. Strongly suggest you please take a look at how this works and the benefits it has. 7900XT 20G vram and 7700 96G ram. using b8609 vulkan (rocm gets a segmentation fault independent of this pull) |
|
I agree, my response wasn't very appropriate - apologies. I took a second look at the implementation and the paper and I don't understand how this works. We still have to materialize at some point the entire K and V for the full sequence. So even if we store just the residuals, at compute time we again need the full memory for the materialized keys and values. So I am failing to see how this approach works. The observation that we can store the residuals and from them recompute the KV cache is trivial. It's almost exactly the same as saying "we can store the tokens and from them recompute the KV cache". The reason to have a KV cache in the first place is exactly to avoid this recomputation. |
|
This works by being selective about what is evicted, and narrow in what is restored if needed. |
Adds optional bounded KV cache that doesn't permanently lose evicted tokens. Instead, it saves layer-0 embeddings to a host-memory ring buffer and recomputes K/V on demand when evicted positions are needed again.
The key insight is that the residual stream at layer 0 encodes everything needed to reconstruct K/V — so saving ~5 KB of residual per token lets you recover hundreds of KB of cached state. This is independently validated by arXiv:2603.19664 ("The Residual Stream Is All You Need"), which shows 100% token match at every budget level vs permanent-loss baselines like H2O, StreamingLLM, and SnapKV.
What it does:
--kv-budget-tokens Nor--kv-budget-mb Ncaps the cache at a fixed sizeAPI additions:
llama_kv_direct_evict(ctx)— enforce budget, evict LRU positionsllama_kv_direct_recompute_misses(ctx)— restore evicted positions from residual poolI've been running this in production for a few weeks on Qwen3 models with a 3090 Ti. Memory stays flat during long multi-turn conversations while output quality matches unbounded cache.
Disclosure: AI tools (Claude) assisted with code development and the initial PR description.