Skip to content

paged_attention: O(N²) memory allocator thrashing and continuous batching failure #2024

@glaziermag

Description

@glaziermag

paged_attention: O(N²) memory allocator thrashing and continuous batching failure via bucket_and_preempt_sequences

Describe the bug

The PagedAttentionScheduler introduces an $O(N^2)$ allocation pattern and disrupts continuous batching due to the bucket_and_preempt_sequences logic in mistralrs-core/src/paged_attention/scheduler.rs.

When the scheduler encounters sequences of different lengths, it buckets them, selects only one bucket (typically resulting in batch size 1), and preempts all other sequences. This preemption alters their SequenceState, calls kv_mgr.free(seq_id) on their cached blocks, and returns them to the waiting queue.

Because bucket_and_preempt_sequences is invoked during both the Prompt and Completion scheduling phases, sequence length divergence disrupts PagedAttention batching and can place longer sequences into a continuous re-prefill loop.

The Execution Path & Re-Prefill Cycle

  1. Sequence A is currently generating (e.g., total context length 120, state = RunningCompletion).
  2. Sequence B is submitted as a new prompt (e.g., length 50, state = RunningPrompt).
  3. On the next tick, both are evaluated for scheduling. They are placed in different length buckets.
  4. The scheduler selects Sequence B (shortest length) and calls _preempt on Sequence A. Sequence A's KV Cache blocks are freed, and it is pushed to the waiting queue.
  5. On the subsequent tick, the scheduler retrieves Sequence A from waiting, allocates new blocks, and sets its state to SequenceState::RunningPrompt (Line 287).
  6. The Result: The Engine processes Sequence A as a prefill prompt, recomputing the entire 120-token prior context.
  7. Once recomputed, Sequence A returns to the completion phase. If its length again mismatches Sequence B's length, it is preempted, its KV cache is freed, and it is re-prefilled again.

Under concurrent requests of varying lengths, active longer generations can be repeatedly preempted and forced to perform prompt re-prefills without generating new completion tokens.

Reproduction

  1. Start mistralrs-server with Paged Attention enabled.
  2. Submit a longer generation request (e.g., prompt length 100).
  3. Concurrently submit multiple requests with varying smaller prompt lengths (e.g., lengths 50, 51, 52).
  4. Observe the GPU usage increase and throughput decrease as the scheduler performs frequent KV Cache allocations/frees and operates at bs=1, causing the longer prompt to recompute repeatedly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions