-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
Feature Description
Would it be possible to set a --unload-timeout flag in "server" mode after that llama.cpp unload the model and free the GPU VRAM, so that it saves power. After a new request it will start to load the model automatically again and waiting for "timeout" period and in wich is no new api call then it will unload again.
Motivation
My GPU needs a lot of power when a model is loaded in RAM and it is waiting for new api task.
For example a NVIDIA P40 24GB needs 9W if nothing is loaded to VRAM.
When you use VRAM with some Bytes the power consumption increses to 50W. But still the GPU is not calculating.
RTX 3060 12GB standby power is 6W => 14W with unused loaded Model and "server" waits for api access
Possible Implementation
Example:
server --unload-timeout 120
after 120s with no api task the model will be unloaded to save energy