Skip to content

UPSTREAM PR #16653: llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization#146

Closed
DajanaV wants to merge 5 commits intomainfrom
upstream-PR16653-branch_JohannesGaessler-llama-memory-fit-9
Closed

UPSTREAM PR #16653: llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization#146
DajanaV wants to merge 5 commits intomainfrom
upstream-PR16653-branch_JohannesGaessler-llama-memory-fit-9

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 9, 2025

Mirrored from ggml-org/llama.cpp#16653

This PR adds automation for setting parameters in such a way that maximizes memory utilization when the full model cannot be fit. The short version is that the code first tries reducing the context size and then starts moving weights from device memory to system memory. For MoE models dense weights are prioritized for allocation in device memory since system memory is usually slower than device memory. Example log snippet:

llama_params_fit: projected memory use with initial parameters [MiB]:
llama_params_fit:   - ROCm0 (AMD Radeon Graphics)     :  16304 total,  39959 used,  24341 deficit
llama_params_fit:   - ROCm1 (AMD Radeon RX 6800)      :  16368 total,  42480 used,  26296 deficit
llama_params_fit:   - ROCm2 (AMD Instinct MI60 / MI50):  32752 total,  76200 used,  43626 deficit
llama_params_fit: projected to use 158641 MiB of device memory vs. a total of 65424 MiB
llama_params_fit: cannot fulfill margin of 1024 MiB on all devices, need to use 97337 MiB less in total
llama_params_fit: context size reduced from 65536 to 4096 -> need 13440 MiB less memory
llama_params_fit: with only dense weights in device memory there is a total surplus of 53432 MiB
llama_params_fit: set to use 35 dense-only and 22 full GPU layers in total, projected memory use:
llama_params_fit:   - ROCm0 (AMD Radeon Graphics)     :  0 dense-only layers,  5 full layers,  14373 MiB used,   1244 MiB free
llama_params_fit:   - ROCm1 (AMD Radeon RX 6800)      : 13 dense-only layers,  5 full layers,  14354 MiB used,   1829 MiB free
llama_params_fit:   - ROCm2 (AMD Instinct MI60 / MI50): 22 dense-only layers, 12 full layers,  30917 MiB used,   1656 MiB free

User Interface

  • The llama C API has a new function llama_params_fit that adjusts the provided llama_model_params and llama_context_params in such a way that upon use to create a corresponding llama_model and llama_context the program will not run out of memory.
  • llama_model_params has a new flag no_alloc that is false by default but results in a llama_model and llama_context with only metadata if set to true.
  • New CLI argument --fit [on|off] to control whether parameters should be fit to free device memory, enabled by default. The overall intent is to have optimistic defaults that would require a large amount of resources and to then cut down on the use if insufficient resources are available.
  • New CLI argument --fit-ctx to control the minimum context size that can be set by the code in order to reduce memory use, defaults to 4096.
  • New CLI argument --fit-margin to set the margin in free MiB per device that should be left over after allocation, defaults to 1024 MiB.
  • The default context size is set to 0, meaning models use the maximum context size by default.
  • If the context size is set manually it is not changed.
  • If the number of GPU layers, a tensor split, or tensor buft overrides are set, then the way tensors are allocated is not changed.
  • The log output of the dummy models and contexts is not shown unless the --verbose flag is set.

Implementation Details

  • No actual device memory is being allocated when determining memory limits. Instead the new no_alloc flag is used to create dummy models and contexts from which the optimal parameters can be determined. This makes use of the recently added memory_breakdown methods which have been extended to handle dummy allocations.
  • The overhead in the simplest case where the initial parameters do not need to be changed is ~0.1 s with the creation of a single dummy model and context (determined by how the runtime changes with --fit on vs. --fit off). At most 6 dummy models and contexts will be created by the function when loading a MoE model where only the dense layers fit into memory. Most of the overhead comes I think from loading the vocabulary. Initially I intended to skip loading the vocabulary entirely but that seems to cause issues when then trying to construct the compute graph. I'm not sure how to proceed with this: on the one hand it would be nice to reduce the overhead if possible but on the other hand one could possibly unify the vocab_only and no_alloc flags for a simpler interface.
  • When creating dummy objects the log is temporarily filtered to avoid spamming the console. I made it so that by default only error messages are shown normally because some models produce a large number of warnings that render the prints about memory use effectively unreadable (warnings and below are moved to the debug log). I am concerned that if the program were to crash during the creation of the dummy objects then the default console output would be less useful to determine the issue. Though my impression is that nowadays crashes in llama.cpp itself are relatively rare so maybe it's okay? In any case, we should adjust the issue template to instruct users to always provide a --verbose log.
  • Due to the temporary change in logging the function llama_params_fit is not thread safe. I don't have a good understanding of the current state of thread safety for the llama C API so I would appreciate guidance regarding how much of an issue this is.
  • For the tensor split and the tensor buft overrides one needs to allocate memory and pass a pointer when creating a model via the llama C API. The way I've handled this in llama_params_fit is that the user needs to pass such pointers to the function or else those properties cannot be modified. I think this is preferable over allocating memory in the function itself. I've considered modifying the data pointed at by e.g. model_params::tensor_split directly but given the risk of a segfault I think it's preferable to be explicit with the user having to provide buffers.
  • llama_context now tracks how much memory should be allocated at most for the compute graph over its lifetime (I'm using this to determine projected memory use). On destruction of llama_context the size of the actually allocated buffers is compared to the expectation and a warning is issued if it was exceeded.

Backend Changes

  • The ggml API has been extended with a new function ggml_log_get to retrieve the current state of the logger.
  • The ggml backend API has been extended with new functions which return the amount of memory that would be allocated without actually doing any allocations.
  • The ggml backend scheduler now no longer tries to allocate tensors for which buffer but not data has already been set. This enables creating a dummy buffer and then setting that dummy buffer for the weight and KV cache tensors to prevent them from being allocated for the compute graph (or being considered for allocation when trying to determined how much memory would need to be allocated for the compute graph).

@DajanaV DajanaV force-pushed the main branch 24 times, most recently from 4d32c36 to ade99be Compare November 12, 2025 18:12
@DajanaV DajanaV force-pushed the main branch 10 times, most recently from 24733fb to 4b4bb7c Compare November 13, 2025 12:15
@DajanaV DajanaV closed this Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants