feat: support for memory-mapping model weights#1414
feat: support for memory-mapping model weights#1414wbruna wants to merge 7 commits intoleejet:masterfrom
Conversation
97190f6 to
776fea2
Compare
Instead of disabling mmap, we turn the mapping writable.
Without an explicit posix_fadvise(POSIX_FADV_DONTNEED), the Linux kernel keeps a model file's pages cached as buff/cache long after we're done with it, so loading the LLM (13.7 GB) followed by the DiT (17 GB) piles up to 30+ GB of cached pages on a 32 GB box and triggers the OOM-killer. - Keep the file descriptor alive in MmapWrapperImpl so we can posix_fadvise(POSIX_FADV_DONTNEED) on it before munmap. madvise alone only unmaps the address range — it does not evict pagecache. - Add POSIX_FADV_SEQUENTIAL on open: nudges the kernel toward a smaller working set during the read. - Make the "using mmap" log line INFO instead of DEBUG so the user can confirm at a glance. - Bound the lazy-load worker count to 2: the per-thread staging buffers grow to the largest tensor seen, so n_threads=8 doubles RAM peak for no measurable read-throughput gain. Result on 32 GB box: peak RSS ~6 GB, peak buff/cache ~12 GB during LLM lazy load — comfortably within budget.
- drop superfluous validity tests from the mmap handler destructor, since by design they are always valid on the manager object - check against zero-sized files - control read-ahead and discard hints through an environment variable: on my own system, with a warm cache, all these flags actually hurt performance for common sd-cli runs (~10-20% worse loading times), so they should probably be enabled on a case-by-case basis
|
@pwilkin , I've cherry-picked b8d1c99 here to make it easier to test mmap behavior. I'm not sure why, but the performance flags made loading times consistently worse for me, so I've made them opt-in through an env var. For consistency, and because consecutive sd-cli runs would also benefit from a cached model, I've made the cache eviction opt-in too; but I don't feel strongly about it. |
|
Hi @wbruna, thanks for this PR — I've been running a merged build (master + this branch) for image generation/edit workloads. Hit a consistent failure with Qwen-Image GGUF models + sd-server enters listen state but with Root causeWhen all tensors in // ggml/src/ggml-alloc.c L1210-1215
if (n_buffers == 0) {
#ifndef NDEBUG
GGML_LOG_DEBUG("%s: all tensors in the context are already allocated\n", __func__);
#endif
GGML_ASSERT(!buffers);
return NULL;
}But This is consistent with the failing components in the log above: Proposed fixAdd a check before the failure path: if all tensors in bool alloc_params_buffer() {
size_t num_tensors = ggml_tensor_num(params_ctx);
params_buffer = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend);
// mmap-aware path: ggml returns NULL when all tensors are already allocated
// (typical for memory-mapped weights). See ggml-alloc.c n_buffers==0 branch.
if (params_buffer == nullptr && num_tensors > 0) {
bool all_have_data = true;
for (ggml_tensor * t = ggml_get_first_tensor(params_ctx); t != nullptr; t = ggml_get_next_tensor(params_ctx, t)) {
if (t->data == nullptr && t->view_src == nullptr) {
all_have_data = false;
break;
}
}
if (all_have_data) {
LOG_DEBUG("%s all params already mmap-allocated (no separate buffer needed)", get_desc().c_str());
rebuild_params_tensor_set();
return true;
}
}
if (params_buffer == nullptr) {
LOG_ERROR("%s alloc params backend buffer failed, num_tensors = %i",
get_desc().c_str(), num_tensors);
return false;
}
rebuild_params_tensor_set();
ggml_backend_buffer_set_usage(params_buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
// ... rest unchanged
}
Verification
Happy to open a separate PR if you'd prefer, or you can incorporate it directly. The underlying ggml-alloc behavior is backend-agnostic, so I expect this generalizes to CUDA/Metal as well — confirmation from users on those backends would be welcome. |
A follow-up for #1059, this adds support for pointing tensor storage buffers directly into memory-mapped model files.
Apart from the expected limitations (e.g. weight types need to match), for now a lot of stars need to be properly aligned:only enabled for 100% CPU backends, to avoid the complexity of tracking backend information per tensor; so e.g.--clip-on-cpuwon't benefit from it. On the other hand, it does work with--offload-to-cpuonly enabled if LoRA apply mode isat_runtime(even if no LoRAs are loaded). I've reused the I/O mmap support, which is read-only, so it needs to avoid trying to modify the mapped weights in place.Edit: added device compatibility detection in the same way as llama.cpp, and per-tensor tracking; so all compatible devices should be supported, including with
--clip-on-cpuand--vae-on-cpu.Edit 2: for LoRA apply mode
immediately, turn the mapping writable. With certain LoRAs, the weight patching may cancel most of the mmap savings, but it will still work for some of the unchanged tensors (note: working fine on Linux, but I couldn't test it on Windows).The existing mmap support on the I/O path isn't affected.