Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034
Open
glaziermag wants to merge 6 commits intoEricLBuehler:masterfrom
Open
Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034glaziermag wants to merge 6 commits intoEricLBuehler:masterfrom
glaziermag wants to merge 6 commits intoEricLBuehler:masterfrom
Conversation
glaziermag
added a commit
to glaziermag/mistral.rs
that referenced
this pull request
Mar 28, 2026
… abandoning length-based Prompt bucketing during Decoding
… abandoning length-based Prompt bucketing during Decoding
glaziermag
added a commit
to glaziermag/mistral.rs
that referenced
this pull request
Mar 28, 2026
…nd Infinite Thread Spin
…nd Infinite Thread Spin
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 29, 2026
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 30, 2026
…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.
emanuele-divizio-quixant
added a commit
to quixantplc/mistral.rs
that referenced
this pull request
Mar 30, 2026
…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.
emanueleDiVizio
added a commit
to emanueleDiVizio/mistral.rs
that referenced
this pull request
Apr 2, 2026
…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.
emanueleDiVizio
added a commit
to emanueleDiVizio/mistral.rs
that referenced
this pull request
Apr 2, 2026
…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.
The PR introduced a 16384-token / 10-sequence hard limit on the prefill
batch with the comment 'Halt batch mapping if context size approaches
CuBLAS engine crash arrays.' This cap is:
- Not configurable — silently throttles 32k+ context workloads
- Hardcoded to a CUDA-specific limit that doesn't apply on Metal/CPU
- Redundant — max_num_seqs and KV block allocator already enforce
capacity and will break out of the loop correctly
Remove the cap and its tracking variables. Capacity control remains
through the existing max_num_seqs check and the KV allocator's
allocate_slots failure path.
Contributor
Author
|
Update (2026-04-15): Removed an undocumented hardcoded batch cap that was introduced in this PR. The original commit added: if (batched_prompt_tokens + num_new_tokens > 16384 || batched_sequences >= 10) && batched_sequences > 0 {
break;
}This cap was neither mentioned in the PR description nor the issue, and has significant problems:
The FCFS |
Code Metrics Report━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Language Files Lines Code Comments Blanks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ C Header 5 305 210 52 43 CSS 2 1181 1036 34 111 CUDA 59 17706 13869 1637 2200 Dockerfile 1 39 22 8 9 HTML 2 235 197 14 24 JavaScript 16 3580 2702 486 392 Jinja2 7 694 656 5 33 JSON 21 409 406 0 3 Makefile 1 6 5 0 1 Metal Shading Lan| 31 11647 9007 1064 1576 PowerShell 1 300 227 30 43 Python 125 8316 6808 412 1096 Shell 2 485 329 95 61 Plain Text 3 3723 0 2413 1310 TOML 27 1290 1124 35 131 YAML 3 25 23 2 0 ───────────────────────────────────────────────────────────────────────────────── Jupyter Notebooks 3 122 83 23 16 |- Markdown 1 60 30 22 8 |- Python 1 122 113 1 8 (Total) 304 226 46 32 ───────────────────────────────────────────────────────────────────────────────── Markdown 105 11197 0 8067 3130 |- BASH 72 934 691 149 94 |- Dockerfile 1 1 1 0 0 |- JSON 20 719 719 0 0 |- PowerShell 3 3 3 0 0 |- Python 23 1038 862 60 116 |- Rust 51 2048 1718 54 276 |- TOML 6 207 164 0 43 |- YAML 2 9 8 1 0 (Total) 16156 4166 8331 3659 ───────────────────────────────────────────────────────────────────────────────── Rust 547 236072 207590 6565 21917 |- Markdown 361 8962 452 7385 1125 (Total) 245034 208042 13950 23042 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 961 311435 249055 28614 33766 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
…tch cap removal The variable was only consumed by the removed 16384-token cap check. The prefix cache lookup and KV allocation below use num_tokens, not num_new_tokens. Remove both the binding and the incorrect suppression comment to eliminate the clippy dead-variable warning.
Contributor
Author
|
Second update (2026-04-15): Removed dead variable warning. The |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2033.
Two independent fixes
1. Fix FCFS priority ordering in paged-attention scheduler
The
sort_running_by_priority_fcfs()function sorted the running deque by timestamp ascending, then the loop calledpop_back()to select preemption candidates — which selected the newest sequence (highest timestamp), the opposite of FCFS. New sequences were being consistently preempted over old ones.Fix: Remove the
.reverse()call sopop_back()correctly reaches the newest (lowest-priority under FCFS) sequence for preemption.2. Remove undocumented hardcoded batch cap
A non-configurable cap of 16,384 tokens and 10 sequences per scheduling pass existed in the
schedule()prompt loop with no documentation, no config knob, and no relation to the VRAM capacity guard that already exists via the KV block allocator. On long-context workloads (> 16k token prompts), this cap silently stopped scheduling new sequences even when VRAM was available, causing the waiting queue to grow without bound.Fix: Remove the cap. The KV allocator's
allocate_slots()returningNoneis the correct and existing mechanism for backpressure — adding a redundant token count check on top of it is incorrect and harmful for long-context use cases.Files changed
mistralrs-core/src/paged_attention/scheduler.rs