Skip to content

server: router mode subprocess still occupies GPU after sleep-idle-seconds #19379

@leonardcser

Description

@leonardcser

Feature Description

In router mode, when --sleep-idle-seconds triggers, the child subprocess unloads the model from VRAM but the process remains alive and attached to the GPU, consuming ~600MiB per idle subprocess:

# Active
467282 dev 0 Compute  0% 10386MiB 11% 824MiB llama-server ...

# After sleep-idle triggers — process still on GPU
467282 dev 0 Compute N/A   614MiB  1% 369MiB llama-server ...

Motivation

Idle subprocesses should not remain as GPU processes when they are not needed. With multiple models in router mode, the residual ~600MiB per dormant process wastes significant VRAM.

Relation to #18189

Follow-up to #18189. PR #18228 implemented --sleep-idle-seconds, but it only unloads the model within the living process — it does not terminate the subprocess. The original request was closed as stale without this being addressed.

Possible Implementation

A new option (e.g. --stop-idle-seconds) that triggers full subprocess termination in router mode via the existing unload() path. The building blocks are already there:

  • server_queue already tracks idle time
  • server_models::unload() already handles graceful shutdown → force-kill
  • These two just need to be wired together, with the router re-spawning the process on the next request (same as --models-max LRU eviction already does)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions