-
Notifications
You must be signed in to change notification settings - Fork 14.2k
server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
server: add real-time prompt preprocessing progress via synthetic SSE chunks #18265
Conversation
|
IMO hooking into Also, technically say, the backend never process token-by-token. The whole batch of tokens is represent as a 2D matrix and they will be processed all at once. To have more frequent updates, simply lower the number of tokens for each batch (controlled via |
Right. Seen from this perspective, if it's I make fake-time interpolation on the backend, it's not even worth trying to make it smooth; it's better to just have progress tracking for each batch! I'll start again : Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed. Much cleaner approach, no core callbacks required. |
|
Why not just use streamed |
Yes, I already had this working with high-frequency emission (100ms intervals). Now reimplementing it at batch frequency as suggested by ngxson: cleaner approach. |
|
I track total batches and increment after each llama_decode(). Stream the existing prompt_progress object at batch boundaries with estimated token counts, Only activates when there are 2+ batches so large prompts automatically get progress updates, |
Make sure to read the contributing guidelines before submitting a PR
Track total batches (n_tokens / n_batch) and increment after each llama_decode() call. Progress chunks will only appear when there are 2+ batches (automatically happens with large prompts), and users can reduce -b/-ub for finer granularity if needed.
Setup (A 100% CPU model added on a testing-server for easier testing) :
Backend testing command
Close #17079