Skip to content

Address possible memory leak during model sleep/unload#2030

Open
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-memory-leak-unload-v4
Open

Address possible memory leak during model sleep/unload#2030
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-memory-leak-unload-v4

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Mar 25, 2026

Tentatively addresses #545.

Problem

After calling /v1/sleep (model unload), VRAM was not fully returned to the OS. The likely cause: tensors held in RebootState were dropped on the HTTP handler thread, which does not have the CUDA OS context bound to it. cuMemFreeAsync requires the context to be current on the calling thread; without this binding, deallocations silently fail at the driver level and the memory pool retains the allocations.

Fix

Three steps, in order:

1. Join the engine worker thread before dropping RebootState, ensuring the async engine has fully exited and released its own tensor references before the HTTP thread attempts the drop.

2. Bind the CUDA context to the HTTP thread (dev.cuda_stream().context().bind_to_thread()) before drop(reboot_state). This ensures cuMemFreeAsync has a valid context to execute against.

3. After the drop, call device.synchronize() to flush any in-flight async frees, then trim the CUDA default memory pool to its currently-used watermark to release the idle reserve back to the OS.

Pool trim approach

cuMemPoolTrimTo(pool, 0) (trim everything) is avoided because in a multi-model server the default memory pool is shared. Trimming to zero would evict blocks held idle by other still-active models, forcing expensive OS reallocations. Instead the trim queries CU_MEMPOOL_ATTR_USED_MEM_CURRENT to get the active-use watermark and trims only to that value. If the query fails (old driver), the trim is skipped entirely.

Clean branch

The original branch (fix-memory-leak-unload-v4) was based on the fork's master rather than origin/master, so its diff includes ~200 unrelated upstream files. A clean branch containing only this commit has been pushed as fix-memory-leak-unload-clean on the same fork.

Files changed

  • mistralrs-core/src/lib.rs
  • mistralrs-server-core/src/handlers.rs
  • mistralrs-server-core/src/mistralrs_server_router_builder.rs

@glaziermag glaziermag marked this pull request as ready for review March 25, 2026 23:00
@glaziermag glaziermag changed the title Draft: Address possible memory leak during model sleep/unload Address possible memory leak during model sleep/unload Mar 26, 2026
@glaziermag
Copy link
Copy Markdown
Contributor Author

Update (2026-04-15): The original branch (fix-memory-leak-unload-v4) is rebased on the fork's master, not upstream, so the diff shows ~200 unrelated files. A clean isolated branch (fix-memory-leak-unload-clean) has been pushed that cherry-picks only the single commit onto current origin/master.

Additionally, cuMemPoolTrimTo(pool, 0) has been replaced with a safer alternative:

Old (aggressive):

sys::cuMemPoolTrimTo(pool, 0);  // releases ALL cached capacity — also evicts other models' idle blocks

New (targeted):

// Query currently-used bytes first
let attr_ok = sys::cuMemPoolGetAttribute(pool, CU_MEMPOOL_ATTR_USED_MEM_CURRENT, &mut used_bytes ...);
if attr_ok == CUDA_SUCCESS {
    sys::cuMemPoolTrimTo(pool, used_bytes as usize);  // releases idle reserve only
}
// If attribute query fails, skip trim — OS reclaims via pool refcount eviction

Trimming to 0 in a multi-model server evicts blocks that other still-active models may be about to reuse, forcing expensive re-allocations from the OS and potential OOM. Trimming to USED_MEM_CURRENT releases only the idle over-reserve from the unloaded model without touching live allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant