Conversation
4d32c36 to
ade99be
Compare
24733fb to
4b4bb7c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#16653
This PR adds automation for setting parameters in such a way that maximizes memory utilization when the full model cannot be fit. The short version is that the code first tries reducing the context size and then starts moving weights from device memory to system memory. For MoE models dense weights are prioritized for allocation in device memory since system memory is usually slower than device memory. Example log snippet:
User Interface
llama_params_fitthat adjusts the providedllama_model_paramsandllama_context_paramsin such a way that upon use to create a correspondingllama_modelandllama_contextthe program will not run out of memory.llama_model_paramshas a new flagno_allocthat is false by default but results in allama_modelandllama_contextwith only metadata if set to true.--fit [on|off]to control whether parameters should be fit to free device memory, enabled by default. The overall intent is to have optimistic defaults that would require a large amount of resources and to then cut down on the use if insufficient resources are available.--fit-ctxto control the minimum context size that can be set by the code in order to reduce memory use, defaults to 4096.--fit-marginto set the margin in free MiB per device that should be left over after allocation, defaults to 1024 MiB.--verboseflag is set.Implementation Details
no_allocflag is used to create dummy models and contexts from which the optimal parameters can be determined. This makes use of the recently addedmemory_breakdownmethods which have been extended to handle dummy allocations.--fit onvs.--fit off). At most 6 dummy models and contexts will be created by the function when loading a MoE model where only the dense layers fit into memory. Most of the overhead comes I think from loading the vocabulary. Initially I intended to skip loading the vocabulary entirely but that seems to cause issues when then trying to construct the compute graph. I'm not sure how to proceed with this: on the one hand it would be nice to reduce the overhead if possible but on the other hand one could possibly unify thevocab_onlyandno_allocflags for a simpler interface.--verboselog.llama_params_fitis not thread safe. I don't have a good understanding of the current state of thread safety for the llama C API so I would appreciate guidance regarding how much of an issue this is.llama_params_fitis that the user needs to pass such pointers to the function or else those properties cannot be modified. I think this is preferable over allocating memory in the function itself. I've considered modifying the data pointed at by e.g.model_params::tensor_splitdirectly but given the risk of a segfault I think it's preferable to be explicit with the user having to provide buffers.llama_contextnow tracks how much memory should be allocated at most for the compute graph over its lifetime (I'm using this to determine projected memory use). On destruction ofllama_contextthe size of the actually allocated buffers is compared to the expectation and a warning is issued if it was exceeded.Backend Changes
ggml_log_getto retrieve the current state of the logger.bufferbut notdatahas already been set. This enables creating a dummy buffer and then setting that dummy buffer for the weight and KV cache tensors to prevent them from being allocated for the compute graph (or being considered for allocation when trying to determined how much memory would need to be allocated for the compute graph).