Skip to content

server: add auto-sleep after N seconds of idle#18228

Merged
ServeurpersoCom merged 14 commits intoggml-org:masterfrom
ngxson:xsn/server_sleep
Dec 21, 2025
Merged

server: add auto-sleep after N seconds of idle#18228
ServeurpersoCom merged 14 commits intoggml-org:masterfrom
ngxson:xsn/server_sleep

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 20, 2025

Sleeping on Idle

The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced in PR #18228, can be enabled using the --sleep-idle-seconds command-line argument. It works seamlessly in both single-model and multi-model configurations.

When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.

Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:

  • GET /health
  • GET /props

Implementation

The implementation of this feature consists of 3 main parts:

  • server_queue sleeping state
  • server_context sleeping state
  • server_res_generator hook

The main loop inside server_queue acts as a watchdog timer (so we can avoid spawning a dedicated thread just for the watchdog). Upon timing condition passed, it signals to server_context to unload the model.

server_res_generator hooks on any incoming request, and will ask the server_queue to resume if it is in sleeping state. Note that some requests like /health bypass this check (they can only access read-only data of server_context)

Upon requested to resume, server_queue signals server_context to reload models, then unblock server_res_generator to proceed with the rest of the request.

IMPORTANT: for thread-safety reason, any requests to server will be blocked/delayed by server_res_generator until the server exits sleeping state

@ngxson ngxson marked this pull request as ready for review December 20, 2025 15:04
@ngxson ngxson requested a review from ggerganov as a code owner December 20, 2025 15:04
@ServeurpersoCom
Copy link
Collaborator

Another cool feature! Rebased it on my testing-branch+master to test it out!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 20, 2025

Minimal test as a global:

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json
[*]
fit = off  ; Disable automatic memory fitting
ngl = 999  ; Full GPU offload
ctk = q8_0 ; KV cache key quantization
ctv = q8_0 ; KV cache value quantization
fa = on    ; Enable flash attention
mlock = on ; Lock model in RAM
np = 4     ; Parallel request batching
kvu = on   ; Unified KV cache buffer
sleep-idle-seconds = 60 ; Testing

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
c = 131072
load-on-startup = 1

[my-other-models]
...

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep-idle-seconds' doesn't work yet? Same for any models I load manualy.

This feature will be useful for my real use case: unloading large MoE models that spill over from VRAM into system RAM

[MoE-Uncensored-GLM-4.5-Air-Derestricted-106B]
m = bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF/ArliAI_GLM-4.5-Air-Derestricted-Q4_K_M-00001-of-00002.gguf
n-cpu-moe = 30
c = 32768
sleep-idle-seconds = 60

I try this also.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

this feature does not unload the model instance, it is independent from router mode

instead, monitor your log and you will see log lines like this:

que    start_loop: entering sleeping state
srv  handle_sleep: server is entering sleeping state

we don't unload the whole process because #18189 (comment)

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 20, 2025

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

this feature does not unload the model instance, it is independent from router mode

instead, monitor your log and you will see log lines like this:

que    start_loop: entering sleeping state
srv  handle_sleep: server is entering sleeping state

we don't unload the whole process because #18189 (comment)

Got it! I was expecting the child process to be killed, but it's an internal model unload within the process itself.
I'll monitor RSS and look for the "entering/exiting sleeping state" log lines.
Testing now with standalone mode first to validate the feature, then I'll integrate it with router mode. Thanks for the clarification!

The internal sleep approach (keeping process alive) is much cleaner than kill/respawn.
Looking forward to the future "sleep levels" feature to fine-tune which components get unloaded!

@ServeurpersoCom
Copy link
Collaborator

LOG are OK, but for now VRAM and RSS remain completely unchanged after sleep (even without --mlock)

./build/bin/llama-server \
  -m /var/www/ia/models/bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF/ArliAI_GLM-4.5-Air-Derestricted-Q4_K_M-00001-of-00002.gguf \
  -c 32768 \
  -ctk q8_0 \
  -ctv q8_0 \
  -fa on \
  -fit off \
  -ngl 999 \
  -np 4 \
  -kvu \
  --mlock \
  -ncmoe 30 \
  --sleep-idle-seconds 60 \
  --port 8082
watch -n1 'ps aux | grep "[l]lama-server" | awk "{printf \"PID: %-6s RSS: %6.0f MB\n\", \$2, \$6/1024}"'
PID: 75069  RSS:  43852 MB
srv  handle_sleep: server is entering sleeping state
PID: 75069  RSS:  43852 MB
./build/bin/llama-server \
  -m /var/www/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf \
  -c 131072 \
  -ctk q8_0 \
  -ctv q8_0 \
  -fa on \
  -fit off \
  -ngl 999 \
  -np 4 \
  -kvu \
  --mlock \
  --sleep-idle-seconds 60 \
  --port 8082
(root|~) nvidia-smi
Sat Dec 20 18:55:35 2025
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   39C    P8             11W /  575W |   31151MiB /  32607MiB |      0%      Default |
|    0   N/A  N/A           83513      C   ./build/bin/llama-server              30996MiB |

main: starting the main loop...
srv  update_slots: all slots are idle
que    start_loop: entering sleeping state
srv  handle_sleep: server is entering sleeping state

(root|~) nvidia-smi
Sat Dec 20 18:56:16 2025
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8             12W /  575W |   31151MiB /  32607MiB |      0%      Default |
|    0   N/A  N/A           83513      C   ./build/bin/llama-server              30996MiB |

Same without --mlock

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

The last commit works on my macbook, hope that it will work on other backends too (the peaks are when model is loaded)

image

@ServeurpersoCom
Copy link
Collaborator

Yes I check this one

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 20, 2025

Work !!!

Dense model on full GPU :
Sans titre

MoE model on GPU/CPU :

while true; do 
  printf "%s - " "$(date +%H:%M:%S)"; 
  ps aux | grep "[l]lama-server" | awk 'BEGIN{total=0} {printf "PID %s: %10d B  ", $2, $6*1024; total+=$6*1024} END{printf "Total: %d B  ", total}'; 
  nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk '{printf "VRAM: %d MB\n", $1}'; 
  sleep 5; 
done
(better script in bytes)
19:23:48 - RSS:  43836 MB  VRAM: 31811 MB
19:23:53 - RSS:  43836 MB  VRAM: 31811 MB
19:23:58 - RSS:  43836 MB  VRAM: 31811 MB
19:24:03 - RSS:  43836 MB  VRAM: 31811 MB
19:24:08 - RSS:  43836 MB  VRAM: 31811 MB
19:24:13 - RSS:  43836 MB  VRAM: 31811 MB
19:24:18 - RSS:    442 MB  VRAM: 733 MB <- sleep
19:24:23 - RSS:    442 MB  VRAM: 733 MB
19:24:28 - RSS:    442 MB  VRAM: 733 MB
19:24:33 - RSS:    442 MB  VRAM: 733 MB
19:24:38 - RSS:    442 MB  VRAM: 733 MB
19:24:43 - RSS:    442 MB  VRAM: 733 MB
19:24:48 - RSS:  44036 MB  VRAM: 31867 MB <- wake up
19:24:53 - RSS:  44036 MB  VRAM: 31867 MB
19:24:58 - RSS:  44036 MB  VRAM: 31867 MB
19:25:03 - RSS:  44036 MB  VRAM: 31867 MB
19:25:08 - RSS:  44036 MB  VRAM: 31867 MB
19:25:13 - RSS:  44036 MB  VRAM: 31867 MB

It seems to be working for RAM/VRAM on PC/Nvidia

Sans titre

@ServeurpersoCom
Copy link
Collaborator

I test router mode, and look ready for merge !

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

hmm I think I need more testing as I seen some use-after-free issues (maybe some pointers are not up-to-date after wake up)

@ServeurpersoCom
Copy link
Collaborator

OK, I'll put a global 10s sleep on all my models and use my server normally with RAM/VRAM monitoring

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

turns out the chat template system was using a freed pointer, should be fixed in the last commit. tested with sleep timeout of 10s and seems to work fine with web UI

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 20, 2025

Sans titre

Router, all 10 seconds timeout, "fix use-after-free" merged
Too agressive, I retry with 30 seconds

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

do you have the log? (if you build with -DLLAMA_SANITIZE_ADDRESS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo, it will show stack trace)

@ServeurpersoCom
Copy link
Collaborator

I retry with 10s timeout.
I think it's easy to reproduce. With -DLLAMA_SANITIZE_ADDRESS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

I think I catched it, still problem with chat template API.

I can try to hotfix it but it seems to me that the whole chat template API should be re-designed to prevent things like this from happening, should be related to #18215

My stack trace is here: https://gist.github.com/ngxson/e6ae6714fc2780df5c9a47edf03048fb

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 20, 2025

I'm on another problem, a lock without stacktrace. I already seen this, some time a model can't unload... but it's not this PR. it's a like the router can't stop a child

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

hmm ok so the problem as not entirely due to chat template, but because some handlers (like /chat/completions) handler doesn't acquire a server_res_generator right away when it was called. the server_res_generator contains a condition_variable that prevent any access to server_context during model re-loading.

testing on my side now and works so far. just a bit doubt as this PR adds some complexity in term of code maintenance, but I think it's acceptable for now. just having one rule added when writing a new endpoint:

    // IMPORTANT: all lambda functions must start with std::make_unique<server_res_generator>
    // this is to ensure that the server_res_generator can handle sleeping case correctly

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

I'm on another problem, a lock without stacktrace. I already seen this, some time a model can't unload... but it's not this PR. it's a like the router can't stop a child

I think we can have a stopping timeout, then force kill, similar to the 10s timeout on docker. feel free to create an issue for this

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 20, 2025

I'm on another problem, a lock without stacktrace. I already seen this, some time a model can't unload... but it's not this PR. it's a like the router can't stop a child

I think we can have a stopping timeout, then force kill, similar to the 10s timeout on docker. feel free to create an issue for this

Absolutely agree, it's a classic feature to have. I retry my router setup when you're ready

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 20, 2025

Should be ready for testing now. I already did some testing on my side via web UI and so far no crashes.

Though one thing quite inconvenient is that /v1/models now trigger wake up, but that will be fixed in another PR

@ServeurpersoCom
Copy link
Collaborator

Should be ready for testing now. I already did some testing on my side via web UI and so far no crashes.

Though one thing quite inconvenient is that /v1/models now trigger wake up, but that will be fixed in another PR

Thank you! I created the issue #18237 so as not to forget, I will re-test the router + global sleep mode on my end

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

Turns out -DBUILD_SHARED_LIBS=OFF -DGGML_LTO=ON in my build config was causing strange bug (After wakeup from sleep, observed a multi-minute freeze at a specific point between prompt completion and token generation start. Happens consistently after each sleep/wakeup cycle -> and only for some models !!!)

@ServeurpersoCom
Copy link
Collaborator

Router setup, global 30s timeout

02:08:29 - PID 1709:  193646592 B  PID 1749:  403648512 B  Total: 597295104 B  VRAM: 733 MB <- Dense Devstral sleeping
02:08:34 - PID 1709:  193646592 B  PID 1749:  403648512 B  Total: 597295104 B  VRAM: 733 MB
02:08:39 - PID 1709:  194494464 B  PID 1749:  403648512 B  Total: 598142976 B  VRAM: 733 MB
02:08:44 - PID 1709:  196591616 B  PID 1749: 1436270592 B  Total: 1632862208 B  VRAM: 31267 MB <- Wake
02:08:49 - PID 1709:  195637248 B  PID 1749: 1436270592 B  Total: 1631907840 B  VRAM: 31267 MB
02:08:54 - PID 1709:  195637248 B  PID 1749: 1436270592 B  Total: 1631907840 B  VRAM: 31267 MB
02:08:59 - PID 1709:  195637248 B  PID 1749: 1436270592 B  Total: 1631907840 B  VRAM: 31267 MB
02:09:04 - PID 1709:  195637248 B  PID 1749: 1436270592 B  Total: 1631907840 B  VRAM: 31267 MB
02:09:09 - PID 1709:  195637248 B  PID 1749: 1436270592 B  Total: 1631907840 B  VRAM: 31267 MB
02:09:14 - PID 1709:  195637248 B  PID 1749:  442404864 B  Total: 638042112 B  VRAM: 735 MB <- Sleep
02:09:19 - PID 1709:  195637248 B  PID 1749:  442404864 B  Total: 638042112 B  VRAM: 735 MB
02:09:24 - PID 1709:  195637248 B  PID 1749: 1155596288 B  Total: 1351233536 B  VRAM: 31175 MB <- Wake
02:09:29 - PID 1709:  196476928 B  PID 1749: 1436684288 B  Total: 1633161216 B  VRAM: 31267 MB
02:09:34 - PID 1709:  196476928 B  PID 1749: 1436684288 B  Total: 1633161216 B  VRAM: 31267 MB
02:09:39 - PID 1709:  196476928 B  PID 1749: 1436684288 B  Total: 1633161216 B  VRAM: 31267 MB
02:09:44 - PID 1709:  196476928 B  PID 1749: 1436684288 B  Total: 1633161216 B  VRAM: 31267 MB
02:09:49 - PID 1709:  196476928 B  PID 1749: 1436684288 B  Total: 1633161216 B  VRAM: 31267 MB
02:09:54 - PID 1709:  196476928 B  PID 1749: 1436684288 B  Total: 1633161216 B  VRAM: 31267 MB
02:09:59 - PID 1709:  196476928 B  PID 1749:  442728448 B  Total: 639205376 B  VRAM: 735 MB <- Sleep
02:10:04 - PID 1709:  196476928 B  PID 1749:  442728448 B  Total: 639205376 B  VRAM: 735 MB
02:10:09 - PID 1709:  197283840 B  PID 1749:  442728448 B  Total: 640012288 B  VRAM: 735 MB
02:10:14 - PID 1709:  197324800 B  Total: 197324800 B  VRAM: 174 MB <- Unload child
02:10:19 - PID 1709:  198131712 B  Total: 198131712 B  VRAM: 174 MB
02:10:24 - PID 1709:  199913472 B  Total: 199913472 B  VRAM: 174 MB
02:10:29 - PID 1709:  199913472 B  Total: 199913472 B  VRAM: 174 MB
02:10:34 - PID 1709:  200785920 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 677 MB <- Load GLM Air (OK awk overflow bytes)
02:10:39 - PID 1709:  200785920 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 677 MB
02:10:44 - PID 1709:  201269248 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31475 MB <- Completed
02:10:50 - PID 1709:  201207808 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31475 MB
02:10:55 - PID 1709:  201207808 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31475 MB
02:11:00 - PID 1709:  201207808 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:05 - PID 1709:  201207808 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:10 - PID 1709:  201207808 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:15 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:20 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:25 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:30 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:35 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:40 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31575 MB
02:11:45 - PID 1709:  202047488 B  PID 8639:  656777216 B  Total: 858824704 B  VRAM: 735 MB <- Sleep GLM Air
02:11:50 - PID 1709:  202047488 B  PID 8639:  656777216 B  Total: 858824704 B  VRAM: 735 MB
02:11:55 - PID 1709:  202047488 B  PID 8639:  656777216 B  Total: 858824704 B  VRAM: 735 MB
02:12:00 - PID 1709:  202047488 B  PID 8639:  656777216 B  Total: 858824704 B  VRAM: 735 MB
02:12:05 - PID 1709:  202047488 B  PID 8639:  656777216 B  Total: 858824704 B  VRAM: 735 MB
02:12:10 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 27407 MB <- Wake
02:12:15 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:20 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:25 - PID 1709:  202047488 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:30 - PID 1709:  202891264 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:35 - PID 1709:  202891264 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:40 - PID 1709:  202891264 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:45 - PID 1709:  202891264 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:50 - PID 1709:  202891264 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:12:55 - PID 1709:  202891264 B  PID 8639: 2147483647 B  Total: 2147483647 B  VRAM: 31577 MB
02:13:00 - PID 1709:  202891264 B  PID 8639:  667553792 B  Total: 870445056 B  VRAM: 737 MB
02:13:05 - PID 1709:  202891264 B  PID 8639:  667553792 B  Total: 870445056 B  VRAM: 737 MB
02:13:10 - PID 1709:  202891264 B  PID 8639:  667553792 B  Total: 870445056 B  VRAM: 737 MB
02:13:15 - PID 1709:  203698176 B  PID 8639:  667553792 B  Total: 871251968 B  VRAM: 737 MB
02:13:20 - PID 1709:  203718656 B  Total: 203718656 B  VRAM: 174 MB
^C

Look sane on a complete setup -> merge

@ServeurpersoCom ServeurpersoCom merged commit ddcb75d into ggml-org:master Dec 21, 2025
77 checks passed
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
* implement sleeping at queue level

* implement server-context suspend

* add test

* add docs

* optimization: add fast path

* make sure to free llama_init

* nits

* fix use-after-free

* allow /models to be accessed during sleeping, fix use-after-free

* don't allow accessing /models during sleep, it is not thread-safe

* fix data race on accessing props and model_meta

* small clean up

* trailing whitespace

* rm outdated comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants