Skip to content

UPSTREAM PR #18265: server: add real-time prompt preprocessing progress via synthetic SSE chunks#654

Closed
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18265-branch_ServeurpersoCom-pascal/prompt-processing-progress
Closed

UPSTREAM PR #18265: server: add real-time prompt preprocessing progress via synthetic SSE chunks#654
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18265-branch_ServeurpersoCom-pascal/prompt-processing-progress

Conversation

@loci-dev

Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18265

Make sure to read the contributing guidelines before submitting a PR

Adds --prompt-progress-ms flag to stream synthetic progress chunks during prompt preprocessing for smooth UI progress bars.

We have a working GGML scheduler callback (cb_eval) that fires during graph execution, giving me precise timing for SSE throttling at X ms (customizable) intervals. Now I need accurate token-level progress tracking for the processed field.
Two approaches are possible:

Approach 1: Time-based estimation (self-adaptive)
Track last_progress_ms and last_processed to calculate a dynamic rate. Each chunk updates the rate based on total elapsed time divided by total processed tokens so far. This self-corrects over time and works without any core changes.
Problem: This relies on the assumption that processing speed is constant. If the GPU changes speed mid-prompt (thermal throttling, other processes, batch size variations), the estimation lags behind reality.
Not acceptable for production quality.

Approach 2: Token-level callback in llama.cpp core
Add a proper callback mechanism in llama_context::decode() that reports actual token progress during the decoding loop. This would give real-time accuracy regardless of speed variations.
Currently exploring this approach: implementation in progress.

Setup (A 100% CPU model added on a testing-server for easier testing) :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off                 ; Disable automatic memory fitting
ngl = 999                 ; Full GPU offload
ctk = q8_0                ; KV cache key quantization
ctv = q8_0                ; KV cache value quantization
fa = on                   ; Enable flash attention
mlock = on                ; Lock model in RAM
np = 4                    ; Parallel request batching
kvu = on                  ; Unified KV cache buffer
sleep-idle-seconds = 3600 ; Unload weights on child process
prompt-progress-ms = 100  ; Emit progress chunks during prompt preprocessing

; Testing prompt-progress-ms
[CPU-MoE-Qwen3-30B-A3B-Instruct-2507]
m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
ngl = 0                   ; No GPU offload
device = none             ; Disable GPU device
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
c = 32768

...Other GPU or hybrid models...

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072                ; Context size in tokens for this model
load-on-startup = 1       ; Load immediately on server startup
...

Backend testing command

# OpenAI test (big prompt to force slow preprocessing)
curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

# Anthropic test ("cache_prompt": false don't work!) OK, no need, it is a proprietary chunk for the WebUI

Close #17079

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 26a6f0f to cf53bc9 Compare December 22, 2025 14:09
@DajanaV DajanaV closed this Dec 22, 2025
@DajanaV DajanaV deleted the upstream-PR18265-branch_ServeurpersoCom-pascal/prompt-processing-progress branch December 22, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants