Skip to content

Misc. bug: Maximum Context Size #18376

@AlphaMo99

Description

@AlphaMo99

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
version: 7539 (83b3b1c)
built with GNU 15.2.0 for Linux x86_64

Looks like when we specify the context size to zero (eg: -c0 ), it's now defaulting to 4096 instead of the maximum size of the model.

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

llama-server --port 10004 --host 127.0.0.1  --api-key xxxxx --jinja -fa on -hf mradermacher/Ling-lite-GGUF --cache-type-k q8_0 --cache-type-v q8_0 -c 0

Problem description & steps to reproduce

llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 600000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized

It is taking 4096 context size instead of 16384.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions