-
Notifications
You must be signed in to change notification settings - Fork 15.6k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
version: 7539 (83b3b1c)
built with GNU 15.2.0 for Linux x86_64
Looks like when we specify the context size to zero (eg: -c0 ), it's now defaulting to 4096 instead of the maximum size of the model.
Operating systems
No response
Which llama.cpp modules do you know to be affected?
No response
Command line
llama-server --port 10004 --host 127.0.0.1 --api-key xxxxx --jinja -fa on -hf mradermacher/Ling-lite-GGUF --cache-type-k q8_0 --cache-type-v q8_0 -c 0Problem description & steps to reproduce
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 600000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
It is taking 4096 context size instead of 16384.
First Bad Commit
No response