server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265

ServeurpersoCom · 2025-12-21T20:06:58Z

Make sure to read the contributing guidelines before submitting a PR

Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed.

Setup (A 100% CPU model added on a testing-server for easier testing) :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off                 ; Disable automatic memory fitting
ngl = 999                 ; Full GPU offload
ctk = q8_0                ; KV cache key quantization
ctv = q8_0                ; KV cache value quantization
fa = on                   ; Enable flash attention
mlock = on                ; Lock model in RAM
np = 4                    ; Parallel request batching
kvu = on                  ; Unified KV cache buffer
sleep-idle-seconds = 3600 ; Unload weights on child process
b = 128                   ; Logical maximum batch size (default: 2048)
ub = 128                  ; Physical maximum batch size (default: 512)

; Testing prompt progress on CPU
[CPU-MoE-Qwen3-30B-A3B-Instruct-2507]
m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
ngl = 0                   ; No GPU offload
device = none             ; Disable GPU device
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
c = 32768

...Other GPU or hybrid models...

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072                ; Context size in tokens for this model
load-on-startup = 1       ; Load immediately on server startup
...

Backend testing command

# OpenAI test (big prompt to force slow preprocessing)
curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

# Anthropic test ("cache_prompt": false don't work!) OK, no need, it is a proprietary chunk for the WebUI

Close #17079

ngxson · 2025-12-21T21:26:15Z

IMO hooking into eval_cb can be quite risky and messy if the backend scheduler work in an asynchronous way.

Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once.

To have more frequent updates, simply lower the number of tokens for each batch (controlled via -b and -ub args)

ServeurpersoCom · 2025-12-21T21:56:23Z

Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once.

Right. Seen from this perspective, if it's I make fake-time interpolation on the backend, it's not even worth trying to make it smooth; it's better to just have progress tracking for each batch! I'll start again :

Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed. Much cleaner approach, no core callbacks required.

ExtReMLapin · 2025-12-22T06:49:27Z

Why not just use streamed prompt_progress object ???

ServeurpersoCom · 2025-12-22T10:56:53Z

Why not just use streamed prompt_progress object ???

Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach.

ServeurpersoCom · 2025-12-22T14:21:48Z

I track total batches and increment after each llama_decode(). Stream the existing prompt_progress object at batch boundaries with estimated token counts, Only activates when there are 2+ batches so large prompts automatically get progress updates,
Tested with b=128 on 509 token prompt, got 3 progress chunks showing 127, 254, 381 tokens processed :

(root|~/llama.cpp.pascal) curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

<- synthetic chunks :

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412609,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":127,"time_ms":1901}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412611,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":254,"time_ms":3937}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412613,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":381,"time_ms":6079}}

<- normal chunks :

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"It"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" looks"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" like"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"'ve"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" past"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"ed"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" long"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" sequence"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"length","index":0,"delta":{}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":509,"prompt_ms":8337.839,"prompt_per_token_ms":16.380823182711197,"prompt_per_second":61.04699311176433,"predicted_n":10,"predicted_ms":390.215,"predicted_per_token_ms":39.021499999999996,"predicted_per_second":25.626897992132545}}

data: [DONE]

server: add --prompt-progress-ms for throttled SSE emission (base)

19e74d5

ServeurpersoCom requested review from ggerganov and ngxson as code owners December 21, 2025 20:06

ServeurpersoCom marked this pull request as draft December 21, 2025 20:07

github-actions bot added examples server labels Dec 21, 2025

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #18265: server: add real-time prompt preprocessing progress via synthetic SSE chunks auroralabs-loci/llama.cpp#654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265

server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265

ServeurpersoCom commented Dec 21, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

ServeurpersoCom commented Dec 21, 2025 •

edited

Loading

Uh oh!

ExtReMLapin commented Dec 22, 2025

Uh oh!

ServeurpersoCom commented Dec 22, 2025

Uh oh!

ServeurpersoCom commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265

Are you sure you want to change the base?

server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265

Conversation

ServeurpersoCom commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup (A 100% CPU model added on a testing-server for easier testing) :

Backend testing command

Uh oh!

ngxson commented Dec 21, 2025

Uh oh!

ServeurpersoCom commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Dec 22, 2025

Uh oh!

ServeurpersoCom commented Dec 22, 2025

Uh oh!

ServeurpersoCom commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ServeurpersoCom commented Dec 21, 2025 •

edited

Loading

ServeurpersoCom commented Dec 21, 2025 •

edited

Loading