Skip to content

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034

Open
glaziermag wants to merge 6 commits intoEricLBuehler:masterfrom
glaziermag:fix-2033
Open

Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination#2034
glaziermag wants to merge 6 commits intoEricLBuehler:masterfrom
glaziermag:fix-2033

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Mar 28, 2026

Closes #2033.

Two independent fixes

1. Fix FCFS priority ordering in paged-attention scheduler

The sort_running_by_priority_fcfs() function sorted the running deque by timestamp ascending, then the loop called pop_back() to select preemption candidates — which selected the newest sequence (highest timestamp), the opposite of FCFS. New sequences were being consistently preempted over old ones.

Fix: Remove the .reverse() call so pop_back() correctly reaches the newest (lowest-priority under FCFS) sequence for preemption.

2. Remove undocumented hardcoded batch cap

A non-configurable cap of 16,384 tokens and 10 sequences per scheduling pass existed in the schedule() prompt loop with no documentation, no config knob, and no relation to the VRAM capacity guard that already exists via the KV block allocator. On long-context workloads (> 16k token prompts), this cap silently stopped scheduling new sequences even when VRAM was available, causing the waiting queue to grow without bound.

Fix: Remove the cap. The KV allocator's allocate_slots() returning None is the correct and existing mechanism for backpressure — adding a redundant token count check on top of it is incorrect and harmful for long-context use cases.

Files changed

  • mistralrs-core/src/paged_attention/scheduler.rs

@glaziermag glaziermag marked this pull request as draft March 28, 2026 02:46
@glaziermag glaziermag changed the title Fix: Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Mar 28, 2026
@glaziermag glaziermag changed the title Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination (#2033) Re-architect FCFS Priorities and Bypass Completion Bucket Discrimination Mar 28, 2026
@glaziermag glaziermag closed this Mar 28, 2026
glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026
… abandoning length-based Prompt bucketing during Decoding
… abandoning length-based Prompt bucketing during Decoding
glaziermag added a commit to glaziermag/mistral.rs that referenced this pull request Mar 28, 2026
@glaziermag glaziermag reopened this Mar 28, 2026
@glaziermag glaziermag marked this pull request as ready for review March 28, 2026 22:42
emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026
emanuele-divizio-quixant added a commit to quixantplc/mistral.rs that referenced this pull request Mar 30, 2026
emanueleDiVizio added a commit to emanueleDiVizio/mistral.rs that referenced this pull request Apr 2, 2026
…duler

Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling
complexity when sequences are waiting, and add FCFS priority ordering
to prevent starvation.
emanueleDiVizio added a commit to emanueleDiVizio/mistral.rs that referenced this pull request Apr 2, 2026
…duler

Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling
complexity when sequences are waiting, and add FCFS priority ordering
to prevent starvation.
The PR introduced a 16384-token / 10-sequence hard limit on the prefill
batch with the comment 'Halt batch mapping if context size approaches
CuBLAS engine crash arrays.' This cap is:

  - Not configurable — silently throttles 32k+ context workloads
  - Hardcoded to a CUDA-specific limit that doesn't apply on Metal/CPU
  - Redundant — max_num_seqs and KV block allocator already enforce
    capacity and will break out of the loop correctly

Remove the cap and its tracking variables. Capacity control remains
through the existing max_num_seqs check and the KV allocator's
allocate_slots failure path.
@glaziermag
Copy link
Copy Markdown
Contributor Author

Update (2026-04-15): Removed an undocumented hardcoded batch cap that was introduced in this PR.

The original commit added:

if (batched_prompt_tokens + num_new_tokens > 16384 || batched_sequences >= 10) && batched_sequences > 0 {
    break;
}

This cap was neither mentioned in the PR description nor the issue, and has significant problems:

  • Not configurable — silently throttles models with >16384-token context windows (e.g. Llama 32k, Qwen 128k) with no warning
  • Hardware-specific — the 16384 limit is CUDA/cuBLAS-specific and incorrect for Metal or CPU backends
  • Redundantmax_num_seqs and the KV allocator's allocate_slots failure path already enforce capacity correctly

The FCFS .reverse() removal and the preemption loop improvements remain untouched. Only the batch cap and its tracking variables were removed.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 15, 2026

Code Metrics Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       2         1181         1036           34          111
 CUDA                     59        17706        13869         1637         2200
 Dockerfile                1           39           22            8            9
 HTML                      2          235          197           14           24
 JavaScript               16         3580         2702          486          392
 Jinja2                    7          694          656            5           33
 JSON                     21          409          406            0            3
 Makefile                  1            6            5            0            1
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  125         8316         6808          412         1096
 Shell                     2          485          329           95           61
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1290         1124           35          131
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                105        11197            0         8067         3130
 |- BASH                  72          934          691          149           94
 |- Dockerfile             1            1            1            0            0
 |- JSON                  20          719          719            0            0
 |- PowerShell             3            3            3            0            0
 |- Python                23         1038          862           60          116
 |- Rust                  51         2048         1718           54          276
 |- TOML                   6          207          164            0           43
 |- YAML                   2            9            8            1            0
 (Total)                            16156         4166         8331         3659
─────────────────────────────────────────────────────────────────────────────────
 Rust                    547       236072       207590         6565        21917
 |- Markdown             361         8962          452         7385         1125
 (Total)                           245034       208042        13950        23042
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   961       311435       249055        28614        33766
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

…tch cap removal

The variable was only consumed by the removed 16384-token cap check.
The prefix cache lookup and KV allocation below use num_tokens, not
num_new_tokens. Remove both the binding and the incorrect suppression
comment to eliminate the clippy dead-variable warning.
@glaziermag
Copy link
Copy Markdown
Contributor Author

Second update (2026-04-15): Removed dead variable warning.

The num_new_tokens binding (and its let _ = num_new_tokens suppressor) was left over from the batch cap removal and was never actually used by the prefix cache lookup — that path uses num_tokens. Both lines have been removed. This also corrects the misleading // used by prefix cache lookup below comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PagedAttentionScheduler priority sorting causes First-Come-Last-Served behavior

1 participant