Skip to content

Conversation

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Dec 21, 2025

Make sure to read the contributing guidelines before submitting a PR

Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed.

Setup (A 100% CPU model added on a testing-server for easier testing) :

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

[*]
fit = off                 ; Disable automatic memory fitting
ngl = 999                 ; Full GPU offload
ctk = q8_0                ; KV cache key quantization
ctv = q8_0                ; KV cache value quantization
fa = on                   ; Enable flash attention
mlock = on                ; Lock model in RAM
np = 4                    ; Parallel request batching
kvu = on                  ; Unified KV cache buffer
sleep-idle-seconds = 3600 ; Unload weights on child process
b = 128                   ; Logical maximum batch size (default: 2048)
ub = 128                  ; Physical maximum batch size (default: 512)

; Testing prompt progress on CPU
[CPU-MoE-Qwen3-30B-A3B-Instruct-2507]
m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
ngl = 0                   ; No GPU offload
device = none             ; Disable GPU device
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
c = 32768

...Other GPU or hybrid models...

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072                ; Context size in tokens for this model
load-on-startup = 1       ; Load immediately on server startup
...

Backend testing command

# OpenAI test (big prompt to force slow preprocessing)
curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

# Anthropic test ("cache_prompt": false don't work!) OK, no need, it is a proprietary chunk for the WebUI

Close #17079

@ngxson
Copy link
Collaborator

ngxson commented Dec 21, 2025

IMO hooking into eval_cb can be quite risky and messy if the backend scheduler work in an asynchronous way.

Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once.

To have more frequent updates, simply lower the number of tokens for each batch (controlled via -b and -ub args)

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Dec 21, 2025

Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once.

Right. Seen from this perspective, if it's I make fake-time interpolation on the backend, it's not even worth trying to make it smooth; it's better to just have progress tracking for each batch! I'll start again :

Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed. Much cleaner approach, no core callbacks required.

@ExtReMLapin
Copy link
Contributor

Why not just use streamed prompt_progress object ???

@ServeurpersoCom
Copy link
Collaborator Author

Why not just use streamed prompt_progress object ???

Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach.

@ServeurpersoCom
Copy link
Collaborator Author

I track total batches and increment after each llama_decode(). Stream the existing prompt_progress object at batch boundaries with estimated token counts, Only activates when there are 2+ batches so large prompts automatically get progress updates,
Tested with b=128 on 509 token prompt, got 3 progress chunks showing 127, 254, 381 tokens processed :

(root|~/llama.cpp.pascal) curl -N https://www.serveurperso.com/ia/webui/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CPU-MoE-Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "'"$(python3 -c "print('Test '*500)")"'"}],
    "stream": true,
    "max_tokens": 10,
    "cache_prompt": false
  }'

<- synthetic chunks :

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412609,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":127,"time_ms":1901}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412611,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":254,"time_ms":3937}}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412613,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","prompt_progress":{"total":509,"cache":0,"processed":381,"time_ms":6079}}

<- normal chunks :

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"It"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" looks"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" like"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"'ve"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" past"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"ed"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" long"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" sequence"}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"length","index":0,"delta":{}}],"created":1766412616,"id":"chatcmpl-kVthPtCaMejJrLD7eqHIBxnxGxGiOouC","model":"CPU-MoE-Qwen3-30B-A3B-Instruct-2507","system_fingerprint":"b7519-dd1a8d4ad","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":509,"prompt_ms":8337.839,"prompt_per_token_ms":16.380823182711197,"prompt_per_second":61.04699311176433,"predicted_n":10,"predicted_ms":390.215,"predicted_per_token_ms":39.021499999999996,"predicted_per_second":25.626897992132545}}

data: [DONE]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: webui: add parsing progress

3 participants