server: add auto-sleep after N seconds of idle#18228
server: add auto-sleep after N seconds of idle#18228ServeurpersoCom merged 14 commits intoggml-org:masterfrom
Conversation
|
Another cool feature! Rebased it on my testing-branch+master to test it out! |
|
Minimal test as a global: Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep-idle-seconds' doesn't work yet? Same for any models I load manualy. This feature will be useful for my real use case: unloading large MoE models that spill over from VRAM into system RAM I try this also. |
this feature does not unload the model instance, it is independent from router mode instead, monitor your log and you will see log lines like this: we don't unload the whole process because #18189 (comment) |
Got it! I was expecting the child process to be killed, but it's an internal model unload within the process itself. The internal sleep approach (keeping process alive) is much cleaner than kill/respawn. |
|
LOG are OK, but for now VRAM and RSS remain completely unchanged after sleep (even without --mlock) Same without --mlock |
|
Yes I check this one |
|
I test router mode, and look ready for merge ! |
|
hmm I think I need more testing as I seen some use-after-free issues (maybe some pointers are not up-to-date after wake up) |
|
OK, I'll put a global 10s sleep on all my models and use my server normally with RAM/VRAM monitoring |
|
turns out the chat template system was using a freed pointer, should be fixed in the last commit. tested with sleep timeout of 10s and seems to work fine with web UI |
|
do you have the log? (if you build with |
|
I retry with 10s timeout. |
|
I think I catched it, still problem with chat template API. I can try to hotfix it but it seems to me that the whole chat template API should be re-designed to prevent things like this from happening, should be related to #18215 My stack trace is here: https://gist.github.com/ngxson/e6ae6714fc2780df5c9a47edf03048fb |
|
I'm on another problem, a lock without stacktrace. I already seen this, some time a model can't unload... but it's not this PR. it's a like the router can't stop a child |
|
hmm ok so the problem as not entirely due to chat template, but because some handlers (like /chat/completions) handler doesn't acquire a testing on my side now and works so far. just a bit doubt as this PR adds some complexity in term of code maintenance, but I think it's acceptable for now. just having one rule added when writing a new endpoint: // IMPORTANT: all lambda functions must start with std::make_unique<server_res_generator>
// this is to ensure that the server_res_generator can handle sleeping case correctly |
I think we can have a stopping timeout, then force kill, similar to the 10s timeout on docker. feel free to create an issue for this |
Absolutely agree, it's a classic feature to have. I retry my router setup when you're ready |
|
Should be ready for testing now. I already did some testing on my side via web UI and so far no crashes. Though one thing quite inconvenient is that /v1/models now trigger wake up, but that will be fixed in another PR |
Thank you! I created the issue #18237 so as not to forget, I will re-test the router + global sleep mode on my end |
|
Turns out -DBUILD_SHARED_LIBS=OFF -DGGML_LTO=ON in my build config was causing strange bug (After wakeup from sleep, observed a multi-minute freeze at a specific point between prompt completion and token generation start. Happens consistently after each sleep/wakeup cycle -> and only for some models !!!) |
|
Router setup, global 30s timeout Look sane on a complete setup -> merge |
* implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments




Sleeping on Idle
The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced in PR #18228, can be enabled using the
--sleep-idle-secondscommand-line argument. It works seamlessly in both single-model and multi-model configurations.When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.
Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:
GET /healthGET /propsImplementation
The implementation of this feature consists of 3 main parts:
server_queuesleeping stateserver_contextsleeping stateserver_res_generatorhookThe main loop inside
server_queueacts as a watchdog timer (so we can avoid spawning a dedicated thread just for the watchdog). Upon timing condition passed, it signals toserver_contextto unload the model.server_res_generatorhooks on any incoming request, and will ask theserver_queueto resume if it is in sleeping state. Note that some requests like/healthbypass this check (they can only access read-only data ofserver_context)Upon requested to resume,
server_queuesignalsserver_contextto reload models, then unblockserver_res_generatorto proceed with the rest of the request.IMPORTANT: for thread-safety reason, any requests to server will be blocked/delayed by
server_res_generatoruntil the server exits sleeping state