Skip to content

Misc. bug: regression: using cache-reuse slows down subsequent prompts (chat sessions) #17065

Description

@daitj

Name and Version

$./llama-cli --version
version: b6927 (6b9a524)
built with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server --port 8383 --host 0.0.0.0 -m qwen3_30B-A3B_Q6_K.gguf --cache-reuse 256 --no-mmap --ctx-size 131072 -fa 1 -ctk f16 -ctv f16 -ts 7/15/16 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from libggml-hip.so
load_backend: loaded RPC backend from libggml-rpc.so
load_backend: loaded CPU backend from libggml-cpu-haswell.so
main: setting n_parallel = 4 and kv_unified = true
build: 1 (6b9a524) with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu

Problem description & steps to reproduce

I know it is normal that as context grows in a single chat session, generation t/s becomes slower and slower.

After building the newer version, I noticed a weird change:
New chat sessions now start slow and get slower as the conversation continues.

What used to happen (before):
New chats always started fast (max speed), and only slowed down later as the conversation got longer.

The problem:
This new slowdown at the very beginning of a fresh chat didn’t happen in the old version.

I started tinkering around and found out that if set cache-reuse to 0 then it fixes the issue.

Broken builds are slow with --cache-reuse 256 but works find with --cache-reuse 0.

I bisected it and the result was commit cd5e3b5 (b6927), b6923 a version before b6927 seems to work.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions