-
Notifications
You must be signed in to change notification settings - Fork 306
Description
Feature request description
Mistral released Devstral 2 last month. I'd like to run it with ramalama but I can't get it working with ramalama version 0.16.0.
Attempting to use a quantised version in GGUF format gives:
$ ramalama serve hf://unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF
...
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'mistral3'
llama_model_load_from_file_impl: failed to load model
Suggest potential solution
It looks to me like we would need to update llama.cpp to at least release b7371 which appears to merge the PR adding support for Mistral.
Have you considered any alternatives?
I initially tried running Mistral's (unquantised/ non-gguf) release on hugging face but this fails:
$ ramalama serve hf://mistralai/Devstral-Small-2-24B-Instruct-2512
...
main: loading model
srv load_model: loading model '/mnt/models'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX PRO 4500 Blackwell) (0000:e1:00.0) - 29042 MiB free
gguf_init_from_file_impl: failed to read magic
llama_model_load: error loading model: llama_model_loader: failed to load model from /mnt/models
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/mnt/models', try reducing --n-gpu-layers if you're running out of VRAM
srv load_model: failed to load model, '/mnt/models'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
I presume that this is expected since llama.cpp requires the GGUF format, but it wasn't apparent from the ramalama documentation. The section on transports gives the impression that any HF URI would work - and I was hoping ramalama might be able to use vLLM in this case.
Indeed Mistral recommends using vLLM. I tried selecting this runtime explicitly but I get a different error:
$ ramalama --runtime=vllm --debug serve hf://mistralai/Devstral-Small-2-24B-Instruct-2512
2026-01-06 14:12:35 - DEBUG - Checking if 8080 is available
2026-01-06 14:12:35 - DEBUG - run_cmd: nvidia-smi
2026-01-06 14:12:35 - DEBUG - Working directory: None
2026-01-06 14:12:35 - DEBUG - Ignore stderr: False
2026-01-06 14:12:35 - DEBUG - Ignore all: False
2026-01-06 14:12:35 - DEBUG - env: None
2026-01-06 14:12:35 - DEBUG - Command finished with return code: 0
2026-01-06 14:12:35 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.16
2026-01-06 14:12:35 - DEBUG - Working directory: None
2026-01-06 14:12:35 - DEBUG - Ignore stderr: False
2026-01-06 14:12:35 - DEBUG - Ignore all: True
2026-01-06 14:12:35 - DEBUG - env: None
2026-01-06 14:12:35 - DEBUG - run_cmd: nvidia-smi
2026-01-06 14:12:35 - DEBUG - Working directory: None
2026-01-06 14:12:35 - DEBUG - Ignore stderr: False
2026-01-06 14:12:35 - DEBUG - Ignore all: False
2026-01-06 14:12:35 - DEBUG - env: None
2026-01-06 14:12:35 - DEBUG - Command finished with return code: 0
2026-01-06 14:12:35 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.16
2026-01-06 14:12:35 - DEBUG - Working directory: None
2026-01-06 14:12:35 - DEBUG - Ignore stderr: False
2026-01-06 14:12:35 - DEBUG - Ignore all: True
2026-01-06 14:12:35 - DEBUG - env: None
2026-01-06 14:12:35 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=hf://mistralai/Devstral-Small-2-24B-Instruct-2512 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=serve --runtime /usr/bin/nvidia-container-runtime --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer --device /dev/dri --device /dev/kfd --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 -p 8080:8080 --label ai.ramalama --name ramalama-IKWCeY3WQm --env=HOME=/tmp --init --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-88664799f03f24dde112fe0005bb4529abf2198d,destination=/mnt/models/config.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-ce24962642faa5680cd421e65d94a4d67c905433,destination=/mnt/models/VIBE_SYSTEM_PROMPT.txt,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-1392a509a427ef57d8ba43608925e55b424cf2aa,destination=/mnt/models/CHAT_SYSTEM_PROMPT.txt,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-01c8776b5b3496af72e92a53a3bf92e113f66f2c,destination=/mnt/models/chat_template.jinja,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-54437e69136d9f46c140dd9cec6162e1bb87bc44,destination=/mnt/models/consolidated.safetensors.index.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-319bf12a84cdcdc5445cc039d4f3d0ef20ab4f9a,destination=/mnt/models/generation_config.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-8efdf4d1c2425a2a7956bf43ae343f44a825a90a87e341ff02f708da2923a0b1,destination=/mnt/models/model-00006-of-00006.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-8180612f9e5a296d012b5e11bec7d5cca4606ce0,destination=/mnt/models/model.safetensors.index.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-b96ca6fc9cf937078113af615ddc15c89ff0f4d3,destination=/mnt/models/params.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-a37d728b12fd27ac60a437894bd51de83449bf30,destination=/mnt/models/processor_config.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-380035da60c7cc474cb7358888a1c50c70679bb3fb7f70870c2400f93ac51d70,destination=/mnt/models/model-00001-of-00006.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-286acad9b0e27fce778ac429763536accf618ccb6ed72963b6f94685e531c5c7,destination=/mnt/models/tokenizer.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-a7843c180f2b39d43303e7eba55d2e34fd600a8f,destination=/mnt/models/tokenizer_config.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-e29d19ea32eb7e26e6c0572d57cb7f9eca0f4420e0e0fe6ae1cf3be94da1c0d6,destination=/mnt/models/tekken.json,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-2caed6d3fb5af9c97b8c70e1424a9e517454e01451332834fba4fdb4e7a18280,destination=/mnt/models/model-00002-of-00006.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-63c422f7a5c1460967068c0ceff65eb31f136f64872e281841313e8c669e7c50,destination=/mnt/models/model-00004-of-00006.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-ec99fd6a7faf35b43e38e60f531e9ee5d67c4292773d71246038b9eb508e373a,destination=/mnt/models/model-00005-of-00006.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-dfa96c3ccb824ac308eeeaa86fd1ce01aca4e3311e1aaa27a498ec3b7302e165,destination=/mnt/models/consolidated-00001-of-00002.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-e2bab376f49baa1da58c0a737f688cbfe185dc6a994fa2870d62b7c8b36e3360,destination=/mnt/models/model-00003-of-00006.safetensors,ro --mount=type=bind,src=/var/home/robin/.local/share/ramalama/store/huggingface/mistralai/Devstral-Small-2-24B-Instruct-2512/blobs/sha256-b783163b5ee6fb9595fde29d6072e81be8fcc24ea576d09ecc3dc7611ababb97,destination=/mnt/models/consolidated-00002-of-00002.safetensors,ro quay.io/ramalama/cuda:latest "/opt/venv/bin/python3 -m vllm.entrypoints.openai.api_server" --model /mnt/models --max_model_len 2048 --port 8080
ERROR (catatonit:51): failed to exec pid1: No such file or directory
I get this same error for the GGUF URI too.
Presumably this is the same issue as #1948.
I noticed a comment on #1204 suggesting we might need to specify a different image. I found this cuda-vllm/Containerfile which looked promising but if this is what I need it doesn't appear to have been published to quay.io yet.
I also tried running on the host:
$ ramalama --nocontainer serve hf://unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF
Since I installed ramalama with homebrew (as per the Aurora instructions), this picks up the homebrew formula for llama.cpp which is a recent enough build (7640) to successfully load and serve the model. Sadly this version only supports BLAS/CPU inference (giving a very slow 2 t/s) which defeats the point of using ramalama to provide CUDA support in the first place (I'm guessing this option is just for testing).
Additional context
Linux aurora 6.17.8-300.fc43.x86_64