Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination by glaziermag · Pull Request #2034 · EricLBuehler/mistral.rs

glaziermag · 2026-03-28T02:44:15Z

Closes #2033.

Two independent fixes

1. Fix FCFS priority ordering in paged-attention scheduler

The sort_running_by_priority_fcfs() function sorted the running deque by timestamp ascending, then the loop called pop_back() to select preemption candidates — which selected the newest sequence (highest timestamp), the opposite of FCFS. New sequences were being consistently preempted over old ones.

Fix: Remove the .reverse() call so pop_back() correctly reaches the newest (lowest-priority under FCFS) sequence for preemption.

2. Remove undocumented hardcoded batch cap

A non-configurable cap of 16,384 tokens and 10 sequences per scheduling pass existed in the schedule() prompt loop with no documentation, no config knob, and no relation to the VRAM capacity guard that already exists via the KV block allocator. On long-context workloads (> 16k token prompts), this cap silently stopped scheduling new sequences even when VRAM was available, causing the waiting queue to grow without bound.

Fix: Remove the cap. The KV allocator's allocate_slots() returning None is the correct and existing mechanism for backpressure — adding a redundant token count check on top of it is incorrect and harmful for long-context use cases.

Files changed

mistralrs-core/src/paged_attention/scheduler.rs

… abandoning length-based Prompt bucketing during Decoding

…nd Infinite Thread Spin

…upstream PRs EricLBuehler#2031/EricLBuehler#2034)

…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.

…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.

…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.

The PR introduced a 16384-token / 10-sequence hard limit on the prefill batch with the comment 'Halt batch mapping if context size approaches CuBLAS engine crash arrays.' This cap is: - Not configurable — silently throttles 32k+ context workloads - Hardcoded to a CUDA-specific limit that doesn't apply on Metal/CPU - Redundant — max_num_seqs and KV block allocator already enforce capacity and will break out of the loop correctly Remove the cap and its tracking variables. Capacity control remains through the existing max_num_seqs check and the KV allocator's allocate_slots failure path.

glaziermag · 2026-04-15T19:50:48Z

Update (2026-04-15): Removed an undocumented hardcoded batch cap that was introduced in this PR.

The original commit added:

if (batched_prompt_tokens + num_new_tokens > 16384 || batched_sequences >= 10) && batched_sequences > 0 {
    break;
}

This cap was neither mentioned in the PR description nor the issue, and has significant problems:

Not configurable — silently throttles models with >16384-token context windows (e.g. Llama 32k, Qwen 128k) with no warning
Hardware-specific — the 16384 limit is CUDA/cuBLAS-specific and incorrect for Metal or CPU backends
Redundant — max_num_seqs and the KV allocator's allocate_slots failure path already enforce capacity correctly

The FCFS .reverse() removal and the preemption loop improvements remain untouched. Only the batch cap and its tracking variables were removed.

github-actions · 2026-04-15T19:53:44Z

Code Metrics Report

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       2         1181         1036           34          111
 CUDA                     59        17706        13869         1637         2200
 Dockerfile                1           39           22            8            9
 HTML                      2          235          197           14           24
 JavaScript               16         3580         2702          486          392
 Jinja2                    7          694          656            5           33
 JSON                     21          409          406            0            3
 Makefile                  1            6            5            0            1
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  125         8316         6808          412         1096
 Shell                     2          485          329           95           61
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1290         1124           35          131
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                105        11197            0         8067         3130
 |- BASH                  72          934          691          149           94
 |- Dockerfile             1            1            1            0            0
 |- JSON                  20          719          719            0            0
 |- PowerShell             3            3            3            0            0
 |- Python                23         1038          862           60          116
 |- Rust                  51         2048         1718           54          276
 |- TOML                   6          207          164            0           43
 |- YAML                   2            9            8            1            0
 (Total)                            16156         4166         8331         3659
─────────────────────────────────────────────────────────────────────────────────
 Rust                    547       236072       207590         6565        21917
 |- Markdown             361         8962          452         7385         1125
 (Total)                           245034       208042        13950        23042
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   961       311435       249055        28614        33766
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

…tch cap removal The variable was only consumed by the removed 16384-token cap check. The prefix cache lookup and KV allocation below use num_tokens, not num_new_tokens. Remove both the binding and the incorrect suppression comment to eliminate the clippy dead-variable warning.

glaziermag · 2026-04-15T22:21:17Z

Second update (2026-04-15): Removed dead variable warning.

The num_new_tokens binding (and its let _ = num_new_tokens suppressor) was left over from the batch cap removal and was never actually used by the prefix cache lookup — that path uses num_tokens. Both lines have been removed. This also corrects the misleading // used by prefix cache lookup below comment.

glaziermag marked this pull request as draft March 28, 2026 02:46

glaziermag changed the title ~~Fix: Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033)~~ Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Mar 28, 2026

glaziermag changed the title ~~Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033)~~ Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination Mar 28, 2026

Fix EricLBuehler#2033: Web Chat id Path Traversal

31d2e1a

glaziermag force-pushed the fix-2033 branch from 9bff30d to 31d2e1a Compare March 28, 2026 02:51

glaziermag closed this Mar 28, 2026

glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026

Fix EricLBuehler#2034: Resolve FCFS Completion Starvation by natively…

b817254

… abandoning length-based Prompt bucketing during Decoding

Fix EricLBuehler#2034: Resolve FCFS Completion Starvation by natively…

b7b973a

… abandoning length-based Prompt bucketing during Decoding

glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026

Fix EricLBuehler#2034: Resolve PagedAttention Completion Starvation a…

c0bd884

…nd Infinite Thread Spin

Fix EricLBuehler#2033: Resolve PagedAttention Completion Starvation a…

5b156e7

…nd Infinite Thread Spin

glaziermag reopened this Mar 28, 2026

glaziermag marked this pull request as ready for review March 28, 2026 22:42

emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 29, 2026

PagedAttention scheduler: fix O(N^2) thrashing + FCFS priority (from …

53da949

…upstream PRs EricLBuehler#2031/EricLBuehler#2034)

emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026

Revert "PagedAttention scheduler: fix O(N^2) thrashing + FCFS priorit…

4713992

…y (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 53da949.

emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026

Reapply "PagedAttention scheduler: fix O(N^2) thrashing + FCFS priori…

5533806

…ty (from upstream PRs EricLBuehler#2031/EricLBuehler#2034)" This reverts commit 4713992.

emanueleDiVizio mentioned this pull request Apr 2, 2026

fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes #2047

Open

glaziermag added 2 commits April 7, 2026 19:30

chore: strip unrelated UI path traversal fixes from scheduler rewrite PR

376f0c0

glaziermag mentioned this pull request Apr 15, 2026

Fix PagedAttention Scheduler O(N^2) Thrashing #2031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034
glaziermag wants to merge 6 commits intoEricLBuehler:masterfrom
glaziermag:fix-2033

glaziermag commented Mar 28, 2026 •

edited

Loading

Uh oh!

glaziermag commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

glaziermag commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glaziermag commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two independent fixes

1. Fix FCFS priority ordering in paged-attention scheduler

2. Remove undocumented hardcoded batch cap

Files changed

Uh oh!

glaziermag commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glaziermag commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

glaziermag commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Apr 15, 2026 •

edited

Loading