UPSTREAM PR #18265: server: add real-time prompt preprocessing progress via synthetic SSE chunks by loci-dev · Pull Request #654 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-21T20:36:40Z

Make sure to read the contributing guidelines before submitting a PR

Adds --prompt-progress-ms flag to stream synthetic progress chunks during prompt preprocessing for smooth UI progress bars.

We have a working GGML scheduler callback (cb_eval) that fires during graph execution, giving me precise timing for SSE throttling at X ms (customizable) intervals. Now I need accurate token-level progress tracking for the processed field.
Two approaches are possible:

Approach 1: Time-based estimation (self-adaptive)
Track last_progress_ms and last_processed to calculate a dynamic rate. Each chunk updates the rate based on total elapsed time divided by total processed tokens so far. This self-corrects over time and works without any core changes.
Problem: This relies on the assumption that processing speed is constant. If the GPU changes speed mid-prompt (thermal throttling, other processes, batch size variations), the estimation lags behind reality. Not acceptable for production quality.

Approach 2: Token-level callback in llama.cpp core
Add a proper callback mechanism in llama_context::decode() that reports actual token progress during the decoding loop. This would give real-time accuracy regardless of speed variations.
Currently exploring this approach: implementation in progress.

Setup (A 100% CPU model added on a testing-server for easier testing) :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off                 ; Disable automatic memory fitting
ngl = 999                 ; Full GPU offload
ctk = q8_0                ; KV cache key quantization
ctv = q8_0                ; KV cache value quantization
fa = on                   ; Enable flash attention
mlock = on                ; Lock model in RAM
np = 4                    ; Parallel request batching
kvu = on                  ; Unified KV cache buffer
sleep-idle-seconds = 3600 ; Unload weights on child process
prompt-progress-ms = 100  ; Emit progress chunks during prompt preprocessing

; Testing prompt-progress-ms
[CPU-MoE-Qwen3-30B-A3B-Instruct-2507]
m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
ngl = 0                   ; No GPU offload
device = none             ; Disable GPU device
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
c = 32768

...Other GPU or hybrid models...

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072                ; Context size in tokens for this model
load-on-startup = 1       ; Load immediately on server startup
...

Backend testing command

# OpenAI test (big prompt to force slow preprocessing)
curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

# Anthropic test ("cache_prompt": false don't work!) OK, no need, it is a proprietary chunk for the WebUI

Close #17079

server: add --prompt-progress-ms for throttled SSE emission (base)

19e74d5

loci-dev had a problem deploying to PROD__AL_DEMO December 21, 2025 20:36 — with GitHub Actions Failure

loci-dev force-pushed the main branch 5 times, most recently from 26a6f0f to cf53bc9 Compare December 22, 2025 14:09

DajanaV closed this Dec 22, 2025

DajanaV deleted the upstream-PR18265-branch_ServeurpersoCom-pascal/prompt-processing-progress branch December 22, 2025 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #18265: server: add real-time prompt preprocessing progress via synthetic SSE chunks#654

UPSTREAM PR #18265: server: add real-time prompt preprocessing progress via synthetic SSE chunks#654
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18265-branch_ServeurpersoCom-pascal/prompt-processing-progress

loci-dev commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

loci-dev commented Dec 21, 2025

Setup (A 100% CPU model added on a testing-server for easier testing) :

Backend testing command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants