paged_attention: O(N²) memory allocator thrashing and continuous batching failure

# paged_attention: O(N²) memory allocator thrashing and continuous batching failure via `bucket_and_preempt_sequences`

## Describe the bug
The `PagedAttentionScheduler` introduces an $O(N^2)$ allocation pattern and disrupts continuous batching due to the `bucket_and_preempt_sequences` logic in `mistralrs-core/src/paged_attention/scheduler.rs`.

When the scheduler encounters sequences of different lengths, it buckets them, selects only one bucket (typically resulting in batch size 1), and preempts all other sequences. This preemption alters their `SequenceState`, calls `kv_mgr.free(seq_id)` on their cached blocks, and returns them to the `waiting` queue.

Because `bucket_and_preempt_sequences` is invoked during both the `Prompt` and `Completion` scheduling phases, sequence length divergence disrupts PagedAttention batching and can place longer sequences into a continuous re-prefill loop.

### The Execution Path & Re-Prefill Cycle
1. Sequence A is currently generating (e.g., total context length 120, state = `RunningCompletion`).
2. Sequence B is submitted as a new prompt (e.g., length 50, state = `RunningPrompt`).
3. On the next tick, both are evaluated for scheduling. They are placed in different length buckets.
4. The scheduler selects Sequence B (shortest length) and calls `_preempt` on Sequence A. Sequence A's KV Cache blocks are freed, and it is pushed to the `waiting` queue.
5. On the subsequent tick, the scheduler retrieves Sequence A from `waiting`, allocates new blocks, and sets its state to `SequenceState::RunningPrompt` (Line 287).
6. **The Result:** The Engine processes Sequence A as a prefill prompt, recomputing the entire 120-token prior context.
7. Once recomputed, Sequence A returns to the completion phase. If its length again mismatches Sequence B's length, it is preempted, its KV cache is freed, and it is re-prefilled again.

Under concurrent requests of varying lengths, active longer generations can be repeatedly preempted and forced to perform prompt re-prefills without generating new completion tokens.

## Reproduction
1. Start `mistralrs-server` with Paged Attention enabled.
2. Submit a longer generation request (e.g., prompt length 100).
3. Concurrently submit multiple requests with varying smaller prompt lengths (e.g., lengths 50, 51, 52).
4. Observe the GPU usage increase and throughput decrease as the scheduler performs frequent KV Cache allocations/frees and operates at `bs=1`, causing the longer prompt to recompute repeatedly.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paged_attention: O(N²) memory allocator thrashing and continuous batching failure #2024

paged_attention: O(N²) memory allocator thrashing and continuous batching failure via `bucket_and_preempt_sequences`

Describe the bug

The Execution Path & Re-Prefill Cycle

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

paged_attention: O(N²) memory allocator thrashing and continuous batching failure #2024

Description

paged_attention: O(N²) memory allocator thrashing and continuous batching failure via bucket_and_preempt_sequences

Describe the bug

The Execution Path & Re-Prefill Cycle

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

paged_attention: O(N²) memory allocator thrashing and continuous batching failure via `bucket_and_preempt_sequences`