Address possible memory leak during model sleep/unload#2030
Open
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
Open
Address possible memory leak during model sleep/unload#2030glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
Conversation
Contributor
Author
|
Update (2026-04-15): The original branch ( Additionally, Old (aggressive): sys::cuMemPoolTrimTo(pool, 0); // releases ALL cached capacity — also evicts other models' idle blocksNew (targeted): // Query currently-used bytes first
let attr_ok = sys::cuMemPoolGetAttribute(pool, CU_MEMPOOL_ATTR_USED_MEM_CURRENT, &mut used_bytes ...);
if attr_ok == CUDA_SUCCESS {
sys::cuMemPoolTrimTo(pool, used_bytes as usize); // releases idle reserve only
}
// If attribute query fails, skip trim — OS reclaims via pool refcount evictionTrimming to 0 in a multi-model server evicts blocks that other still-active models may be about to reuse, forcing expensive re-allocations from the OS and potential OOM. Trimming to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tentatively addresses #545.
Problem
After calling
/v1/sleep(model unload), VRAM was not fully returned to the OS. The likely cause: tensors held inRebootStatewere dropped on the HTTP handler thread, which does not have the CUDA OS context bound to it.cuMemFreeAsyncrequires the context to be current on the calling thread; without this binding, deallocations silently fail at the driver level and the memory pool retains the allocations.Fix
Three steps, in order:
1. Join the engine worker thread before dropping
RebootState, ensuring the async engine has fully exited and released its own tensor references before the HTTP thread attempts the drop.2. Bind the CUDA context to the HTTP thread (
dev.cuda_stream().context().bind_to_thread()) beforedrop(reboot_state). This ensurescuMemFreeAsynchas a valid context to execute against.3. After the drop, call
device.synchronize()to flush any in-flight async frees, then trim the CUDA default memory pool to its currently-used watermark to release the idle reserve back to the OS.Pool trim approach
cuMemPoolTrimTo(pool, 0)(trim everything) is avoided because in a multi-model server the default memory pool is shared. Trimming to zero would evict blocks held idle by other still-active models, forcing expensive OS reallocations. Instead the trim queriesCU_MEMPOOL_ATTR_USED_MEM_CURRENTto get the active-use watermark and trims only to that value. If the query fails (old driver), the trim is skipped entirely.Clean branch
The original branch (
fix-memory-leak-unload-v4) was based on the fork's master rather thanorigin/master, so its diff includes ~200 unrelated upstream files. A clean branch containing only this commit has been pushed asfix-memory-leak-unload-cleanon the same fork.Files changed
mistralrs-core/src/lib.rsmistralrs-server-core/src/handlers.rsmistralrs-server-core/src/mistralrs_server_router_builder.rs