Releases · ggml-org/llama.cpp

14 May 23:24

3e037f3

b9158 Latest

Latest

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (#22880)

Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.

I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2026-05-14T23:24:16Z
cudart-llama-bin-win-cuda-13.1-x64.zip

sha256:f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

384 MB 2026-05-14T23:24:30Z
llama-b9158-bin-310p-openEuler-aarch64.tar.gz

sha256:8594c7fee3ca4b9a4dc3e0623196cec2981bdb3c12fea778c55c09dc6af8cc86

10.4 MB 2026-05-14T23:24:42Z
llama-b9158-bin-310p-openEuler-x86.tar.gz

sha256:e08517a60aede10e73132746ceb1bac196972b4343c2db914e8157877a80c39f

11 MB 2026-05-14T23:24:43Z
llama-b9158-bin-910b-openEuler-aarch64-aclgraph.tar.gz

sha256:7759e45c771e61743b797ba4b34612c3d73eeb651b06dbbf740511938f4962d9

10.4 MB 2026-05-14T23:24:44Z
llama-b9158-bin-910b-openEuler-x86-aclgraph.tar.gz

sha256:4350c4c83e01327f48fd6a640a6ffaae5ada13567d6e143a0edfbc32d9090c5c

11 MB 2026-05-14T23:24:45Z
llama-b9158-bin-android-arm64.tar.gz

sha256:dff928f3a98d344ecb640ab446799fd8916d446893bbd5e8845766934410c8e9

62.2 MB 2026-05-14T23:24:46Z
llama-b9158-bin-macos-arm64-kleidiai.tar.gz

sha256:bb6250176087e44452da3bbca4295a34f8faa52d95ba91304d64f1c3bee29c4f

8.1 MB 2026-05-14T23:24:48Z
llama-b9158-bin-macos-arm64.tar.gz

sha256:5310a5231f7253f24440f8bbd81711807dc19ad3ab837c97a444d67cfafd80d6

8.08 MB 2026-05-14T23:24:49Z
llama-b9158-bin-macos-x64.tar.gz

sha256:aafb7c7440edc3600e6daf445e5a98979802898288ce0586c2f79f4774c0f9e9

8.13 MB 2026-05-14T23:24:50Z
Source code (zip)

2026-05-14T20:58:58Z
Source code (tar.gz)

2026-05-14T20:58:58Z

14 May 22:42

github-actions

b9156

834a243

b9156

ggml-webgpu: Enable NVIDIA self-hosted CI (#22976)

Enabel nvidia ci for webgpu
Address precision issues
fix placement
Relax more set_rows and div
Try relaxing all f16
formatting and naming
Add comment explaining max_nmse_err logic

Added comment referencing pull request for clarification.

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 17:22

github-actions

b9151

67b2b7f

b9151

logs : reduce (#23021)

logs : reduce
args : fix envs
server : fix build
common : print verbosity level at start
server : clean-up logs
server : print prompt processing timings + sampling params
minor : whitespaces

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 15:46

github-actions

b9150

81b0d88

b9150

ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 15:11

github-actions

b9148

42532af

b9148

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110)

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

Add unicode_regex_split_custom_qwen35() to src/unicode.cpp, a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
Add models/ggml-vocab-qwen35.gguf (test vocab), models/ggml-vocab-qwen35.gguf.inp (test cases), and models/ggml-vocab-qwen35.gguf.out (expected output) for regression testing.
Update tests/CMakeLists.txt to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
cont : remove trailing whitespace

Co-authored-by: Kabir kabir@example.com
Co-authored-by: Alde Rojas hello@alde.dev

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 08:13

github-actions

b9145

9ed6e19

b9145

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597)

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

SYCL: address review feedback - remove try/catch, check device types, deduplicate

Remove try/catch from malloc/free/memcpy helpers, check backend and
device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
and declare in common.hpp to eliminate code duplication
Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
host-staged path for iGPU-to-dGPU transfers
Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
in CMakeLists.txt (co-authored with @arthw)

SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
Zero code is wrapped in #ifdef so the build works on systems without
the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
loader library and headers are checked before enabling.
Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
whether Level Zero or SYCL memory APIs are used. Only one API style is
used per session, no mixing. If Level Zero is enabled but the devices
don't support the Level Zero backend, it auto-disables with a warning.
Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
is not called anywhere in the backend) and used try/catch for flow control.
Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

The Level Zero backend check was iterating all SYCL devices
including CPU. The OpenCL CPU device caused Level Zero to be
disabled for the GPUs, defeating the fix on multi-GPU systems.
Added is_gpu() filter so only GPU devices are checked.
sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
were still calling sycl::malloc/sycl::free directly, bypassing the
Level Zero path. Routed through ggml_sycl_malloc_device/free_device
for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

SYCL: address arthw review feedback on Level Zero memory API structure

Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
Switch both helpers to use g_ggml_sycl_enable_level_zero global
instead of per-call queue backend checks
Remove #ifdef wrapper from global definition; always declare at 0,
add #else branch in init block so it stays 0 when L0 not compiled in
Update init loop comment to explain GPU-only device check
CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 noreply@anthropic.com

Apply suggestions from code review

Co-authored-by: Neo Zhang zhang.jianyu@outlook.com

SYCL: preserve Level Zero allocation path during early malloc
ci: fix Level Zero package conflict in Intel Docker build
ci: find Level Zero loader in oneAPI package step
ci: allow Windows SYCL package without Level Zero DLL

Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Co-authored-by: Neo Zhang zhang.jianyu@outlook.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Contributors

arthw

Assets 30

14 May 02:44

github-actions

b9144

4c1c3ac

b9144

ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (#23020)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 01:45

github-actions

b9143

7f3f843

b9143

Fix for issue #22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (#22994)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 01:34

github-actions

b9142

ec562eb

b9142

opencl: add q5_0 and q5_1 MoE for Adreno (#22985)

opencl: add q5_0 moe support
opencl: add q5_1 moe support
opencl: avoid potential leak
opencl: suppress unused var warning when building for non-Adreno

Co-authored-by: Li He lih@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

14 May 01:12

github-actions

b9141

95d469a

b9141

server, webui: accept continue_final_message flag for vLLM API compat (#23012)

server, webui: accept continue_final_message flag for vLLM API compat

Add the continue_final_message body flag from the vLLM and transformers
API. When set together with add_generation_prompt false, it triggers the
existing prefill_assistant code path, regardless of the server side
opt.prefill_assistant option. Mutual exclusion with add_generation_prompt
true is enforced, matching vLLM behavior.

WebUI sends continue_final_message and add_generation_prompt false on
the Continue button, with the matching opt in option on the chat service.

Pure API alignment, no change to the prefill logic itself. Paves the way
for the upcoming per-template prefill plumbing in common/chat.

test: add coverage for continue_final_message vLLM compat flag

Two cases on top of the existing assistant prefill coverage. First,
continue_final_message true with add_generation_prompt false produces
the same rendered prompt as the prefill_assistant heuristic, proving
the new flag is a correct alias of the existing path. Second, both
flags set to true is rejected with HTTP 400, matching the
vLLM/transformers mutual exclusion contract.

chore: update webui build output

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

Releases: ggml-org/llama.cpp

b9158

Uh oh!

b9156

Uh oh!

b9151

Uh oh!

b9150

Uh oh!

b9148

Uh oh!

b9145

Contributors

Uh oh!

b9144

Uh oh!

b9143

Uh oh!

b9142

Uh oh!

b9141

Uh oh!