-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
Feature Description
In router mode, when --sleep-idle-seconds triggers, the child subprocess unloads the model from VRAM but the process remains alive and attached to the GPU, consuming ~600MiB per idle subprocess:
# Active
467282 dev 0 Compute 0% 10386MiB 11% 824MiB llama-server ...
# After sleep-idle triggers — process still on GPU
467282 dev 0 Compute N/A 614MiB 1% 369MiB llama-server ...
Motivation
Idle subprocesses should not remain as GPU processes when they are not needed. With multiple models in router mode, the residual ~600MiB per dormant process wastes significant VRAM.
Relation to #18189
Follow-up to #18189. PR #18228 implemented --sleep-idle-seconds, but it only unloads the model within the living process — it does not terminate the subprocess. The original request was closed as stale without this being addressed.
Possible Implementation
A new option (e.g. --stop-idle-seconds) that triggers full subprocess termination in router mode via the existing unload() path. The building blocks are already there:
server_queuealready tracks idle timeserver_models::unload()already handles graceful shutdown → force-kill- These two just need to be wired together, with the router re-spawning the process on the next request (same as
--models-maxLRU eviction already does)