ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend by alex-spacemit · Pull Request #22863 · ggml-org/llama.cpp

alex-spacemit · 2026-05-09T06:47:40Z

Overview

This change primarily adds support for IME2 instructions to the ggml SpacemiT backend.
Building on the existing backend capabilities, it extends support for multiple quantization formats on the IME2 path, broadening the range of quantized models supported on the target platform.
In addition to enabling the core operators and execution path, it also improves the related runtime support to ensure that the new path can be integrated and used reliably.
This work is intended to improve inference availability and performance coverage on the SpacemiT platform, laying the groundwork for future support of quantized models and further platform optimizations.

Additional information

The changes are mainly concentrated in the SpacemiT CPU backend implementation, build configuration, and accompanying documentation.
This submission includes not only IME2 enablement, but also several supporting adjustments required for the quantized inference pipeline.
This backend is intended to provide efficient large model inference for SpacemiT K1 and K3.
The changes made in ggml-cpu.c are intended to enable the use of TCM (Tightly-Coupled Memory) during inference by adding lifecycle management before and after thread execution.

Performance

Spacemit(R) X60

model name      : Spacemit(R) X60
isa             : rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm_zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkt_sscofpmf_sstc_svinval_svnapot_svpbmt
mmu             : sv39
uarch           : spacemit,x60
mvendorid       : 0x710
marchid         : 0x8000000058000001

model	size	params	backend	threads	n_ubatch	fa	test	t/s
qwen35 2B Q4_1	1.19 GiB	1.88 B	CPU	4	128	1	pp128	10.32 ± 0.02
qwen35 2B Q4_1	1.19 GiB	1.88 B	CPU	4	128	1	tg128	3.07 ± 0.01
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	4	128	1	pp128	49.15 ± 0.25
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	4	128	1	tg128	11.73 ± 0.02

Spacemit(R) A100

model name      : Spacemit(R) A100
isa             : rv64imafdcvh_zicbom_zicbop_zicboz_zicntr_zicond_zicsr_zifencei_zihintntl_zihintpause_zihpm_zimop_zaamo_zalrsc_zawrs_zfa_zfh_zfhmin_zca_zcb_zcd_zcmop_zba_zbb_zbc_zbs_zkt_zvbb_zvbc_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkb_zvkg_zvkned_zvknha_zvknhb_zvksed_zvksh_zvkt_smaia_smstateen_ssaia_sscofpmf_sstc_svinval_svnapot_svpbmt_sdtrig
mmu             : sv39
mvendorid       : 0x710
marchid         : 0x8000000041000002
mimpid          : 0x10000000d5686200
hart isa        : rv64imafdcv_zicbom_zicbop_zicboz_zicntr_zicond_zicsr_zifencei_zihintntl_zihintpause_zihpm_zimop_zaamo_zalrsc_zawrs_zfa_zfh_zfhmin_zca_zcb_zcd_zcmop_zba_zbb_zbc_zbs_zkt_zvbb_zvbc_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkb_zvkg_zvkned_zvknha_zvknhb_zvksed_zvksh_zvkt_smaia_smstateen_ssaia_sscofpmf_sstc_svinval_svnapot_svpbmt_sdtrig

model	size	params	backend	threads	n_ubatch	fa	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	8	128	1	pp128	565.83 ± 0.31
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	8	128	1	tg128	55.77 ± 0.02
qwen3 4B Q4_0	2.21 GiB	4.02 B	CPU	8	128	1	pp128	79.74 ± 0.04
qwen3 4B Q4_0	2.21 GiB	4.02 B	CPU	8	128	1	tg128	11.29 ± 0.00
qwen3moe 30B.A3B Q4_0	16.18 GiB	30.53 B	CPU	8	128	1	pp128	57.88 ± 0.31
qwen3moe 30B.A3B Q4_0	16.18 GiB	30.53 B	CPU	8	128	1	tg128	12.79 ± 0.00
qwen3moe 30B.A3B Q4_0	16.18 GiB	30.53 B	CPU	8	128	1	pp2048	39.04 ± 0.02
qwen3moe 30B.A3B Q4_0	16.18 GiB	30.53 B	CPU	8	128	1	tg2048	10.83 ± 0.00
qwen35 2B Q4_1	1.19 GiB	1.88 B	CPU	8	128	1	pp128	115.23 ± 0.04
qwen35 2B Q4_1	1.19 GiB	1.88 B	CPU	8	128	1	tg128	16.49 ± 0.01
gemma4 E4B Q4_K - Medium	4.76 GiB	7.52 B	CPU	8	128	1	pp128	21.13 ± 0.01
gemma4 E4B Q4_K - Medium	4.76 GiB	7.52 B	CPU	8	128	1	tg128	5.66 ± 0.00

- Introduced `spine_barrier` for thread synchronization with atomic operations. - Implemented `spine_mem_pool` for efficient memory management, supporting various backends including POSIX, transparent hugepages, hugetlb, and shared memory. - Added TCM (Tightly Coupled Memory) interface for memory allocation and management. - Created header files for `spine_mem_pool` and `spine_tcm` to define public APIs. - Enhanced memory allocation safety with alignment checks and error handling. Change-Id: I2028b1622f3ec831e909e6f41e0dd51da86bad7d

…nel_i8i4_hp_m1 Change-Id: I8716215db1c55fb532cd6bda0b94f3eef7834a4e

Change-Id: I6b3ccee0c264405bc977c5941492a7e73ffb25e5

Change-Id: Ic390913708cc87394cce9e8b27e666d93461140d

Change-Id: I75c87cec8d2fae859451a58c4bc0f70b3b38e918

- Updated logging messages in `ime_env.cpp`, `rvv_kernels.cpp`, and `spine_mem_pool.cpp` to include the "CPU_RISCV64_SPACEMIT" prefix for better identification of issues related to the RISC-V64 spacemit backend. - Improved error handling in memory allocation functions to provide clearer context in log messages, aiding in debugging and maintenance. Change-Id: I217cb4d2e6a197681008f4c0f6e8e24088a70bcf

…figuration for SMT support Change-Id: Ie729ad8d853ae78f51b56d82edcc6cbb33cd2cb8

Change-Id: I15bf64cd6d7013d9e28118727281daf234b69381

Change-Id: I99a4974c16275a4e4c0c7acb56fccc7381b1a123

ggml-gh-bot · 2026-05-09T06:51:43Z

Hi @alex-spacemit, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds IME2 instruction-path enablement and supporting runtime infrastructure for the SpacemiT RISC-V CPU backend, including quantized format coverage and TCM/shared-memory lifecycle integration.

Changes:

Adds SpacemiT runtime components (TCM loader, shared mem pool, barriers, env discovery) and RVV/IME repack support.
Extends IME kernel interface to include IME2 kernels and additional quantized formats.
Updates ggml CPU threading to apply SpacemiT-specific NUMA/thread affinity setup/teardown and augments build + docs.

Reviewed changes

Copilot reviewed 16 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
ggml/src/ggml-cpu/spacemit/spine_tcm.h	Adds a header-only (or direct-link) TCM runtime loader API for optional libspine_tcm integration.
ggml/src/ggml-cpu/spacemit/spine_mem_pool.h	Declares SpacemiT memory pool APIs including TCM and shared-memory allocation helpers.
ggml/src/ggml-cpu/spacemit/spine_mem_pool.cpp	Implements chunked memory pools (posix/THP/hugetlb/shared-mem) and TCM wrappers.
ggml/src/ggml-cpu/spacemit/spine_barrier.h	Introduces a cache-aligned atomic barrier primitive for SpacemiT runtime coordination.
ggml/src/ggml-cpu/spacemit/rvv_kernels.h	Declares RVV kernel entry points used by SpacemiT backend paths.
ggml/src/ggml-cpu/spacemit/repack.h	Declares templated repack API used for SpacemiT quantized layout conversions.
ggml/src/ggml-cpu/spacemit/repack.cpp	Adds/optimizes multiple quantized repack paths leveraging RVV for IME/IME2 data layouts.
ggml/src/ggml-cpu/spacemit/ime_kernels.h	Extends kernel API surface to cover IME2 kernels and additional quantization formats.
ggml/src/ggml-cpu/spacemit/ime_env.h	Adds SpacemiT runtime environment discovery/configuration declarations (cores, IME1/2, TCM, barriers).
ggml/src/ggml-cpu/spacemit/ime_env.cpp	Implements env probing via /proc/cpuinfo and env vars; configures cores/TCM/shared barrier.
ggml/src/ggml-cpu/spacemit/ime.h	Exposes new C API for SpacemiT thread affinity + shared-memory allocation hooks.
ggml/src/ggml-cpu/ggml-cpu.c	Integrates SpacemiT-specific thread affinity and teardown into ggml compute threads.
ggml/src/ggml-cpu/cmake/FindSMTIME.cmake	Adds compiler-feature probing for IME/IME2-related asm intrinsics and sets build defines.
ggml/src/ggml-cpu/CMakeLists.txt	Wires SpacemiT sources and IME2 detection into the CPU backend build.
docs/build-riscv64-spacemit.md	Updates build instructions and documents quantization support/perf for X60/A100.
cmake/riscv64-spacemit-linux-gnu-gcc.cmake	Updates RISC-V march flags and disables GCC auto-vectorization for predictable RVV behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if (spine_perfer_core_arch_str != nullptr && spine_perfer_core_arch_str != "") {
+        perfer_core_arch_id = spine_core_arch_id{ hex_string_to_u16(spine_perfer_core_arch_str) };
+    }
+
+    char *           spine_perfer_core_id_str = getenv("SPACEMIT_PERFER_CORE_ID");
+    std::vector<int> perfer_core_id_vec;
+    if (spine_perfer_core_id_str != nullptr && spine_perfer_core_id_str != "") {


+        for (int core_id : perfer_core_id_vec) {
+            if (core_id < 0 || core_id >= num_cores) {
+                GGML_ABORT("invalid core id in SPACEMIT_PERFER_CORE_ID: %d, should be between 0 and %d\n", core_id,
+                           num_cores - 1);
+            }
+            auto core_info    = core_info_list[core_id];
+            auto core_arch_id = core_info.arch_id;
+            if (core_arch_id == perfer_core_arch_id) {
+                cpu_mask |= (1ULL << core_id);
+                perfer_core_ids.push_back(core_id);
+            } else {
+                GGML_ABORT(
+                    "core id %d in SPACEMIT_PERFER_CORE_ID has arch id %x which does not match "
+                    "SPACEMIT_PERFER_CORE_ARCH %x\n",
+                    core_id, (uint16_t) core_arch_id, (uint16_t) perfer_core_arch_id);
+            }
+        }


+        GGML_LOG_WARN("CPU_RISCV64_SPACEMIT: failed to allocate init_barrier from shared mem, falling back to heap\n",
+                      __func__);


+    if (has_processor && has_marchid) {
+        for (auto & cpu_info : cpu_info_list) {
+            if (cpu_info[0] != spine_invalid_core_id &&
+                spine_march_mapping_.find(cpu_info[1]) != spine_march_mapping_.end()) {
+                auto core_info    = spine_core_info();
+                core_info.core_id = cpu_info[0];
+                core_info.arch_id = spine_core_arch_id(spine_march_mapping_[cpu_info[1]]);
+
+                result.push_back(core_info);
+            }
+        }
+    }
+
+    return has_processor && has_marchid;


+static inline int spine_tcm_open_handle(const char * so_path) {
+    spine_tcm_handle_t * resolved = spine_tcm_default_handle();
+    const char *         library  = (so_path != NULL && so_path[0] != '\0') ? so_path : "libspine_tcm.so";
+
+    if (resolved->module_handle != NULL || resolved->use_global_scope) {
+        return 0;
+    }
+
+    if (spine_tcm_try_bind_global(resolved) == 0) {
+        return 0;
+    }
+
+    spine_tcm_handle_reset(resolved);


+        auto &           chunk     = chunks_[best.chunk_index];
+        const free_block block     = chunk.free_blocks[best.block_index];
+        const size_t     padding   = best.aligned_offset - block.offset;
+        const size_t     alloc_end = best.aligned_offset + size;
+        const size_t     block_end = block.offset + block.size;


+inline void spine_barrier_init(spine_barrier_t * b, int num_barriers, uint64_t thread_count) {
+    for (int i = 0; i < num_barriers; i++) {
+        b[i].total_ = thread_count;
+        b[i].pending_.store(thread_count);
+        b[i].rounds_.store(0);
+    }
+}


@@ -0,0 +1,32 @@
+include(CheckCSourceRuns)


+            perfer_core_id_vec_str += std::to_string(core_id) + ",";
+        }
+        perfer_core_id_vec_str.pop_back();
+        GGML_LOG_DEBUG("SPACEMIT_PERFER_CORE_ID is set, perferred core ids: %s\n", perfer_core_id_vec_str.c_str());


+    std::vector<int>             perfer_core_ids;
+    int                          aicpu_id_offset{ 0 };
+    int                          num_cores{ 0 };
+    int                          num_perfer_cores{ 0 };
+    spine_core_arch_id           perfer_core_arch_id{ spine_core_arch_id::core_arch_none };


…enable new CPU options Change-Id: I3f182a82cd9f7592bfa253bf8fe13f94c5c9d1da

alex-spacemit · 2026-05-09T07:04:26Z

@ggerganov Hi, it has been quite a while since the ggml SpacemiT backend was last updated. With the mass-production rollout of the K3, which supports the RVA23 standard, we are also shipping K3 IME2 support for ggml. We have done some refactoring to make future multi-architecture compatibility easier and to provide higher-performance kernels. Going forward, we will continue to support more quantization formats, potentially even including MXFP4.

taronaeo · 2026-05-09T07:17:25Z

14,000 LoC is very hard to review. Are you able to break the PR up into multiple, smaller chunks instead?

alex-spacemit · 2026-05-09T07:49:19Z

14,000 LoC is very hard to review. Are you able to break the PR up into multiple, smaller chunks instead?

If I remove ime2_kernels.cpp and rvv_kernel.cpp, so that this submission only contains the refactoring and the IME2 interface scaffolding, the code size would be reduced by about 9K lines, and the remaining 5K lines would all be necessary. Would that be acceptable?

Also, although this submission is still quite large, we have already been running this version internally for five months, so its correctness and stability are sufficient. As for code style, most of the style alignment is in ime.cpp, so that file may be worth focusing on. If necessary, I can also provide a detailed explanation of this refactoring and the IME2 support.

Change-Id: I03dab99acb8f4611f0b07550731702d139c4fb37

…affinity functions Change-Id: I014aec747c57ceca8b51dd1246b22072ad3c4526

taronaeo · 2026-05-09T14:09:20Z

Oh, I didn't know that this includes a refactor, thus the heavy changes. I would prefer if the refactor was in a PR by itself and not combined with this new feature PR. For example,

PR 1: SpacemiT refactor
PR 2: Add IME2 instruction support
PR 3: Update docs

But if its not feasible then please let me know and I'll try to go through this entire PR.

ggerganov · 2026-05-09T14:18:09Z

@taronaeo I think we can go through just the ime.h, ime.cpp and CMake changes. Maybe also the repack. The rest is too specific - I don't think I can provide meaningful feedback, even if it was in smaller chunks.

ggerganov · 2026-05-09T14:19:27Z

+#ifdef GGML_USE_CPU_RISCV64_SPACEMIT
+    ggml_backend_cpu_riscv64_spacemit_clear_numa_thread_affinity_threaded(state->ith);
+#endif


What is the reason to have to clear the affinity after each call?

The naming of ggml_backend_cpu_riscv64_spacemit_clear_numa_thread_affinity_threaded follows the same convention as ggml_backend_cpu_riscv64_spacemit_set_numa_thread_affinity.
set_numa_thread_affinity is responsible for setting thread affinity and binding the specified scratchpad memory. It also acquires the access right of the scratchpad memory block via spine_mem_pool_tcm_mem_wait. In contrast, clear_numa_thread_affinity_threaded releases the access right of the designated scratchpad memory block through spine_mem_pool_tcm_mem_release.
The TCM Memory manager is designed as a Linux kernel resource. Each process or thread can fully obtain the scratchpad memory block with a specific ID (the ID is bound to the CPU ID). This enables llama-server to load multiple model processes while holding complete scratchpad memory resources. At runtime, the acquire and release interfaces ensure the lifecycle integrity of each scratchpad memory block. Once a single ggml_graph inference finishes, the access right of the scratchpad memory block will be released, allowing ggml_graph inference from other processes to preempt the freed resources.
Currently, the resource waiting timeout of spine_mem_pool_tcm_mem_wait is configured to around 1 second. If no other processes are running and no waiting is needed, the overall latency of this interface is approximately 500–1000 ns.
Additionally, the Spacemit(R) A100 is a RISC-V core with 1024-bit VLEN. Thread switching incurs extremely high overhead for register context preservation. We are considering designing a better scheduling scheme for AI computing threads targeting the Linux + AI CPU stack in the future. The current strategy of reserving dedicated thread resources for specific computing tasks (e.g., ggml_graph) is conservative but reliable and safe.

taronaeo

Couldn't review everything but in general, remove the struct keywords. Ref:

llama.cpp/CONTRIBUTING.md

Lines 80 to 89 in 5755a10

    
           - In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary 
        
           ```cpp 
        
           // OK 
        
           llama_context * ctx; 
        
           const llama_rope_type rope_type; 
        
           // not OK 
        
           struct llama_context * ctx; 
        
           const enum llama_rope_type rope_type; 
        
           ```

I think some if/else statements can be condensed into a simple switch statement too.

…_OP_MUL_MAT and GGML_OP_MUL_MAT_ID by using switch-case statements. - Improved error handling with GGML_ABORT for unsupported tensor types. - Enhanced clarity in kernel implementation checks for various tensor types. - Consolidated repeated logic in tensor operations (ADD, SUB, MUL, DIV) to streamline the codebase. - Updated memory allocation and buffer management to eliminate unnecessary context structures. Change-Id: I3d67e3be418d41baa3f9dadb27bdb84552143b49

…tensor instead of struct ggml_tensor. - Updated the div_round_up function in rvv_kernels.h to use auto for parameter types. Change-Id: Ib978a16dd9732524b3384a0ff8f60406c5a56143

…_cpu_riscv64_spacemit_nbytes Change-Id: Id4c65215e653faeb856254eead6e9246adc53215

alex-spacemit · 2026-05-11T07:51:37Z

Hi, I’ve addressed all the review comments. Could you please check if there’s anything else I need to adjust? @taronaeo

taronaeo · 2026-05-11T15:18:13Z

Btw, would be nice to have some llama-bench benchmarks in the PR description as well :)

alex-spacemit · 2026-05-12T02:34:27Z

Btw, would be nice to have some llama-bench benchmarks in the PR description as well :)

The PR includes some llama-bench data in docs/build-riscv64-spacemit.md, and I have also added the same data to the PR description.

Change-Id: I0da8ba51faf2bfbb126a968f642541eeca04fb1d

alex-spacemit · 2026-05-13T03:12:49Z

@taronaeo @ggerganov I’ve addressed the current comments. May I ask if you have any further review feedback?

taronaeo

Thanks for pinging :)

taronaeo · 2026-05-14T07:15:17Z

@ggml-org/maintainers Another approval please.

…-org#22863)

@arthw

* ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (ggml-org#22681) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled * llama-eval : enable type check (ggml-org#22988) * spec : update CLI arguments for better consistency (ggml-org#22964) * spec : update CLI arguments for better consistency * cont : fix CLI arg message * ci: validate model naming convention (ggml-org#22680) * ci: validate model naming convention * bring back dedicated ec workflow * add missing jobs --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server, webui: support continue generation on reasoning models (ggml-org#22727) * server, webui : support continue generation on reasoning models (ggml-org#22727) Remove the throw blocking assistant prefill on reasoning models and orchestrate thinking tags around the prefilled message so the parser routes the next stream chunks correctly. WebUI drops the reasoning guard on the Continue button, sends reasoning_content with the prefilled message and persists partial reasoning on stop so the CoT survives reload and resume. Scope : templates with a simple thinking_start_tag / thinking_end_tag pair. Channel-based templates like GPT-OSS are out of scope, pending a per-template prefill API in common/chat. First step toward ggml-org#21754. * chore: update webui build output * server: reject reasoning prefill on channel based templates * download: do not exit() on error (ggml-org#23008) * hexagon: add unary tanh op (ggml-org#22999) * docs : Update OPENVINO.md (ggml-org#22959) Updated OPENVINO.md with Validated models and quantizations Co-authored-by: Haarika Madaka <haarika.madaka@intel.com> * webui: preserve system message on edit cancel (ggml-org#22911) * webui: preserve system message on edit cancel when content is not the placeholder * chore: update webui build output * webui: Deduplicate model aliases in data + handle single/multiple aliases in UI (ggml-org#22979) * fix: Deduplicate aliases + display single alias instead of default name or 2+ aliases as tags * refactor: Address review comments * flush the gpu profile timestamp before the queryset is overflowed (ggml-org#22995) * opencl: fix crash when warming up MoE on Adreno (ggml-org#22876) * server, webui: accept continue_final_message flag for vLLM API compat (ggml-org#23012) * server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false, it triggers the existing prefill_assistant code path, regardless of the server side opt.prefill_assistant option. Mutual exclusion with add_generation_prompt true is enforced, matching vLLM behavior. WebUI sends continue_final_message and add_generation_prompt false on the Continue button, with the matching opt in option on the chat service. Pure API alignment, no change to the prefill logic itself. Paves the way for the upcoming per-template prefill plumbing in common/chat. * test: add coverage for continue_final_message vLLM compat flag Two cases on top of the existing assistant prefill coverage. First, continue_final_message true with add_generation_prompt false produces the same rendered prompt as the prefill_assistant heuristic, proving the new flag is a correct alias of the existing path. Second, both flags set to true is rejected with HTTP 400, matching the vLLM/transformers mutual exclusion contract. * chore: update webui build output * opencl: add q5_0 and q5_1 MoE for Adreno (ggml-org#22985) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * Fix for issue ggml-org#22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (ggml-org#22994) * ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (ggml-org#23020) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (ggml-org#21597) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * fix: Autoscroll detection (ggml-org#23026) * vulkan: fix matmul integer pipeline selection (ggml-org#23005) * vulkan: fix matmul integer pipeline selection * gate pipeline creation with the right bools * unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (ggml-org#22110) * unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks). - Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes ggml-org#21919). - Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing. - Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry. This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows. Closes ggml-org#21919. * fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks * cont : remove trailing whitespace --------- Co-authored-by: Kabir <kabir@example.com> Co-authored-by: Alde Rojas <hello@alde.dev> * docker : revert stable version of intel compute-runtime (ggml-org#22968) * ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (ggml-org#22863) * logs : reduce (ggml-org#23021) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces * webui: Move static build output from repo code to HF Bucket (ggml-org#22937) * ci: add workflow to publish webui to Hugging Face bucket * ci: add webui release job to release workflow * ci: test webui release job * chore: Return to default minification strategy for build output files * ci: extract webui build into separate workflow and job * chore: Ignore webui static output + clean up references * chore: Delete legacy webui static output * chore: Ignore webui build static output * fix: Workflow * fix: Versioning naming * chore: Update package name * test: Test CI fix * refactor: Naming * server: implement webui build strategy with HF Bucket support * chore: Remove test workflow * chore: Use WebUI build workflow call in other workflows * server: HF Buckets fallback for WebUI build * refactor: App name variable * refactor: Naming * fix: Retrieve loading.html * fix: workflow syntax * fix: Rewrite malformed release.yml * fix: Req param * test: Re-add missing Playwright installation for CI tests * refactor: Logic & security improvements * refactor: Retrieve publishing jobs and DRY the workflows * fix: Test workflow syntax * fix: Upstream Release Tag for test workflow * chore: Remove test workflow * ci: Run WebUI jobs on `ubuntu-24.04-arm` * refactor: Post-CR cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: CI cleanup * refactor: Cleanup * test: Test workflow * refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads * server: add fallback mechanism for HF Bucket webui downloads from latest directory * fix: Incorrect argument order in file(SHA256) calls for checksum verification * refactor: Use cmake script for handling the HF Bucket download on build time * feat: support local npm build for WebUI assets * refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning * refactor: Cleanup * chore: Remove test workflow * fix: remove s390x from release workflow * fix: add webui-build dependency to ubuntu-22-rocm and windows-hip * Revert "fix: remove s390x from release workflow" This reverts commit debcfff. * fix: Release workflow file * fix: Proper release tag used for HF Bucket upload * fix: Remove duplicate steps in release workflow --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * contributing: new contributors should not submit trivial fixes (ggml-org#23045) * fix: Propagate version tag to WebUI asset download in self-hosted CI (ggml-org#23051) * fix: Propagate version tag to WebUI asset download in self-hosted CI * refactor: Apply suggestions from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: Skip npm build when Node.js is not installed Avoid 'no such file or directory' errors on CI runners that lack Node.js. Check if npm is available via find_program before attempting npm install + npm run build. Falls back to HF Bucket download. * fix: Use + separator for ASSETS list to fix Windows build Replace fragile \; escaping with a + separator when passing the WebUI asset list via -DASSETS to the download script. On Windows, the \; escaping was not reliably preserved through the CMake build system, causing all asset filenames to be concatenated into one (e.g., 'index.html;bundle.js;bundle.css;loading.html' as a single file), which broke the HF Bucket download and subsequent xxd.cmake step. + is safe because it is not special in cmd.exe (unlike | which is a pipe operator), not special in CMake's -D argument parser, and not a valid Windows filename character. CMakeLists.txt joins assets with + and webui-download.cmake splits them back via regex. * fix: Validate HF_WEBUI_VERSION environment variable with regex Add input validation for the HF_WEBUI_VERSION env var to prevent CMake list separator or path-traversal issues in stamp filenames and download URLs. Rejects non-conforming characters early. * fix: Remove 'latest' fallback for HF_WEBUI_VERSION When needs.determine-tag.outputs.tag_name is empty, let CMake's default resolution handle it (empty -> git-based version lookup) instead of falling back to 'latest'. This ensures the sentinel stamp file is consistent with CMake's resolution logic. * fix: Demote checksum verification failure to warning instead of hard gate * fix: End line character --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * ggml-webgpu: makes the flash attn vec path subgroup-aware (ggml-org#23040) * ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size. * ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap * ggml-webgpu: Enable NVIDIA self-hosted CI (ggml-org#22976) * Enabel nvidia ci for webgpu * Address precision issues * fix placement * Relax more set_rows and div * Try relaxing all f16 * formatting and naming * Add comment explaining max_nmse_err logic Added comment referencing pull request for clarification. * CI : support IOT device (IQ9) (ggml-org#22987) * update test scripts * align CI behavior between linux and android * remove automatically cancel in 15min * enable cancel-in-progress * fix ty check issue * update and fix pylint issue * update runner such that we are not restricted by the 15min limit rule * fix flake8 lint issue * update runner according to review feedback * code update according to review feedback * switch from llama-cli to llama-completion binary with -no-cnv flag * HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (ggml-org#22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout. I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128. --------- Co-authored-by: Sachin Sharma <sachin@zettabolt.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Pascal <admin@serveurperso.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: Haarika Madaka <haarika.madaka@intel.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Masashi Yoshimura <yoshimura.masashi.frbs@gmail.com> Co-authored-by: lhez <lih@qti.qualcomm.com> Co-authored-by: shaofeiqi <shaoqi@qti.qualcomm.com> Co-authored-by: scutler-nv <scutler@nvidia.com> Co-authored-by: Zheyuan Chen <sephirotheca17@gmail.com> Co-authored-by: Katostrofik <georgiopapairo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> Co-authored-by: Ruben Ortlam <rortlam@redhat.com> Co-authored-by: Kabir Potdar <kabirpotdar7@gmail.com> Co-authored-by: Kabir <kabir@example.com> Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Co-authored-by: Zack Li <39573601+zhiyuan8@users.noreply.github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

alex-spacemit and others added 9 commits May 8, 2026 14:33

fix(spacemit): remove unused dst_scale variable and optimize gemm_ker…

98dce8f

…nel_i8i4_hp_m1 Change-Id: I8716215db1c55fb532cd6bda0b94f3eef7834a4e

feature:add i8i4_hp_m4_v2

4188421

Change-Id: I6b3ccee0c264405bc977c5941492a7e73ffb25e5

feature:add i8i4_hp_m4_v3

5217b98

Change-Id: Ic390913708cc87394cce9e8b27e666d93461140d

fix(spacemit): simplify TCM initialization logic by removing ime2 check

d467a57

Change-Id: I75c87cec8d2fae859451a58c4bc0f70b3b38e918

fix(spacemit): restore rvv_kernels.cpp inclusion and update CMake con…

2d80b97

…figuration for SMT support Change-Id: Ie729ad8d853ae78f51b56d82edcc6cbb33cd2cb8

fix(spacemit): update quantization support table and performance metrics

08a9f62

Change-Id: I15bf64cd6d7013d9e28118727281daf234b69381

fix(spacemit): correct duplicate entry in quantization support table

f1b794a

Change-Id: I99a4974c16275a4e4c0c7acb56fccc7381b1a123

Copilot AI review requested due to automatic review settings May 9, 2026 06:47

alex-spacemit requested a review from ggerganov as a code owner May 9, 2026 06:47

Copilot AI reviewed May 9, 2026

View reviewed changes

fix(spacemit): update CMake configuration to disable GGML_OPENMP and …

5ea8132

…enable new CPU options Change-Id: I3f182a82cd9f7592bfa253bf8fe13f94c5c9d1da

alex-spacemit requested a review from a team as a code owner May 9, 2026 06:58

github-actions Bot added documentation Improvements or additions to documentation build Compilation issues devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels May 9, 2026

alex-spacemit added 2 commits May 9, 2026 17:21

fix(spacemit): improve code formatting and consistency in header files

5b3753a

Change-Id: I03dab99acb8f4611f0b07550731702d139c4fb37

fix(spacemit): rename thread_id parameter to thread_n in NUMA thread …

5c920fd

…affinity functions Change-Id: I014aec747c57ceca8b51dd1246b22072ad3c4526

Copilot started reviewing on behalf of alex-spacemit May 9, 2026 10:27 View session

ggerganov reviewed May 9, 2026

View reviewed changes

taronaeo reviewed May 10, 2026

View reviewed changes

alex-spacemit added 2 commits May 11, 2026 14:30

- Changed function signatures in repack.cpp and repack.h to use ggml_…

247244e

…tensor instead of struct ggml_tensor. - Updated the div_round_up function in rvv_kernels.h to use auto for parameter types. Change-Id: Ib978a16dd9732524b3384a0ff8f60406c5a56143

refactor(spacemit): simplify memory calculation logic in ggml_backend…

75d5f8d

…_cpu_riscv64_spacemit_nbytes Change-Id: Id4c65215e653faeb856254eead6e9246adc53215

taronaeo reviewed May 11, 2026

View reviewed changes

Comment thread ggml/src/ggml-cpu/spacemit/ime.cpp

Comment thread ggml/src/ggml-cpu/spacemit/ime.cpp

Comment thread ggml/src/ggml-cpu/spacemit/ime.cpp Outdated

Comment thread ggml/src/ggml-cpu/spacemit/ime.cpp Outdated

Comment thread ggml/src/ggml-cpu/spacemit/ime.cpp

refactor(spacemit): simplify error handling in tensor operations

58745bd

Change-Id: I0da8ba51faf2bfbb126a968f642541eeca04fb1d

taronaeo approved these changes May 13, 2026

View reviewed changes

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 13, 2026

xctan approved these changes May 14, 2026

View reviewed changes

taronaeo merged commit 81b0d88 into ggml-org:master May 14, 2026
44 of 46 checks passed

xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 14, 2026

ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (ggml…

98bf37b

…-org#22863)

		GGML_LOG_WARN("CPU_RISCV64_SPACEMIT: failed to allocate init_barrier from shared mem, falling back to heap\n",
		__func__);

	- In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary
	```cpp
	// OK
	llama_context * ctx;
	const llama_rope_type rope_type;

	// not OK
	struct llama_context * ctx;
	const enum llama_rope_type rope_type;
	```

Conversation

alex-spacemit commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Performance

Uh oh!

ggml-gh-bot Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

alex-spacemit commented May 9, 2026

Uh oh!

taronaeo commented May 9, 2026

Uh oh!

alex-spacemit commented May 9, 2026

Uh oh!

taronaeo commented May 9, 2026

Uh oh!

ggerganov commented May 9, 2026

Uh oh!

ggerganov May 9, 2026

Choose a reason for hiding this comment

Uh oh!

alex-spacemit May 10, 2026

Choose a reason for hiding this comment

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alex-spacemit commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taronaeo commented May 11, 2026

Uh oh!

alex-spacemit commented May 12, 2026

Uh oh!

alex-spacemit commented May 13, 2026

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

taronaeo commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alex-spacemit commented May 9, 2026 •

edited

Loading