Skip to content

Power save mode for server --unload-timeout 120 #4598

@zwilch

Description

@zwilch

Feature Description

Would it be possible to set a --unload-timeout flag in "server" mode after that llama.cpp unload the model and free the GPU VRAM, so that it saves power. After a new request it will start to load the model automatically again and waiting for "timeout" period and in wich is no new api call then it will unload again.

Motivation

My GPU needs a lot of power when a model is loaded in RAM and it is waiting for new api task.
For example a NVIDIA P40 24GB needs 9W if nothing is loaded to VRAM.
When you use VRAM with some Bytes the power consumption increses to 50W. But still the GPU is not calculating.

RTX 3060 12GB standby power is 6W => 14W with unused loaded Model and "server" waits for api access

Possible Implementation

Example:
server --unload-timeout 120
after 120s with no api task the model will be unloaded to save energy

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions