CANN: Add suport for Qwen35 ops by hipudding · Pull Request #21204 · ggml-org/llama.cpp

hipudding · 2026-03-31T03:40:33Z

Overview

This PR adds support for several missing operators in the CANN (Ascend NPU) backend for qwen3.5

New operators

GGML_OP_FILL — fills a tensor with a constant scalar value via aclnnInplaceFillScalar
GGML_OP_DIAG — creates a diagonal matrix from a vector using a strided view with InplaceCopy
GGML_OP_SOLVE_TRI — triangular linear system solve (AX=B) via aclnnTriangularSolve (lower-triangular, non-unit)
GGML_UNARY_OP_SOFTPLUS — implemented via aclnnSoftplus (beta=1.0, threshold=20.0);
GGML_OP_CUMSUM — cumulative sum via aclnnCumsum
GGML_OP_TRI — all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG) using Tril/MaskedFillScalar to work around CANN sparse-zero bugs
GGML_OP_SET — inplace tensor set via aclnnInplaceCopy, modeled after the existing ACC implementation; enables the scheduler to assign SET ops to CANN and avoids cross-device write issues in delta-net hybrid models
GGML_OP_GATED_DELTA_NET — recurrent state-space operator using a batched aclnnBatchMatMul approach that groups all H attention heads into a single rank-3 matmul per recurrence step, reducing kernel launches from O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens)

Other changes

memset_tensor interface — implement ggml_backend_cann_buffer_memset_tensor and wire it into ggml_backend_cann_buffer_interface to ensure correct zero-initialization of cache buffers
Graph cache fix — always compare op_params for all ops in the graph cache key, not just a whitelist; previously, ops with differing params could incorrectly match a cached graph

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES — AI tools (Claude) were used in an assistive capacity only. Specifically, AI was used to analyze problems, suggest implementation approaches, and provide explanations of relevant CANN/ACL APIs. All code was written, reviewed line-by-line, and validated by the human contributor, who takes full responsibility for the correctness and design of the changes.

Add ggml_backend_cann_buffer_memset_tensor and wire it into `ggml_backend_cann_buffer_interface`. This ensures backend tensor memset operations are supported and avoids incorrect behavior when tensors need explicit zero-initialization (e.g. cache buffers).

Add SET operator support using aclnnInplaceCopy, modeled after the existing ACC implementation. This enables the scheduler to assign SET ops to CANN when the output tensor resides on device memory, avoiding cross-device write issues with delta-net hybrid models. All 12 test-backend-ops SET tests pass (f32/i32, inplace/non-inplace, dim 1/2/3).

Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a batched approach that groups all attention heads into a single 3-D BatchMatMul per recurrence step, reducing kernel launches from O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens). Key design decisions: - Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch all H heads together for M×k, outer-product, and M×q steps - Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused across all time steps to avoid per-step allocations - Support both scalar gate (g shape [1,H]) and KDA per-dim gate (g shape [S_v,H]) via appropriate broadcast shapes - Fall back to naive per-head scalar loop for permuted/GQA/non-F32 inputs that don't meet batched path requirements - Relax CANN precision tolerance to 1e-6 in tests to account for different FP32 accumulation order in BatchMatMul vs scalar loops

- Remove dead code: _math and _naive variants are no longer needed - Rename _batched to the public entry point ggml_cann_gated_delta_net - In supports_op, return false for non-contiguous / GQA / non-F32 cases so the framework falls back to CPU instead of running the slow naive path - The single remaining implementation uses aclnnBatchMatMul over all H heads per timestep, reducing kernel launches to O(n_seqs * n_tokens)

- Implement GGML_OP_CUMSUM using aclnnCumsum - Implement GGML_OP_TRI with all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG) using Tril/MaskedFillScalar approach to work around CANN sparse-zero bugs - Fix graph cache to always compare op_params for all ops, not just a whitelist

Implement FILL using aclnnInplaceFillScalar to fill a tensor with a constant scalar value from op_params.

Create diagonal matrix from vector by filling dst with zeros then copying src onto the diagonal via a strided view with InplaceCopy.

Implement triangular linear system solve (AX=B) using aclnnTriangularSolve for the lower-triangular, non-unit case.

Implement GGML_UNARY_OP_SOFTPLUS using aclnnSoftplus with beta=1.0 and threshold=20.0. This enables hybrid models like Qwen3.5 to run entirely on the CANN backend without graph splitting, which fixes graph cache instability caused by the backend scheduler fragmenting the computation graph when SOFTPLUS falls back to CPU.

hipudding · 2026-03-31T03:52:39Z

tests/test-backend-ops.cpp

        : type(type), head_count(head_count), head_size(head_size), n_seq_tokens(n_seq_tokens), n_seqs(n_seqs),
          v_repeat(v_repeat), permuted(permuted), kda(kda) {}

+    double max_nmse_err() override {


revert temporary debug changes.

hipudding added 10 commits March 28, 2026 06:47

CANN: add GATED_DELTA_NET op support

140c5a3

CANN: add GGML_OP_FILL support

4a7bb25

Implement FILL using aclnnInplaceFillScalar to fill a tensor with a constant scalar value from op_params.

CANN: add GGML_OP_DIAG support

871ffea

Create diagonal matrix from vector by filling dst with zeros then copying src onto the diagonal via a strided view with InplaceCopy.

CANN: add GGML_OP_SOLVE_TRI support

168d05f

Implement triangular linear system solve (AX=B) using aclnnTriangularSolve for the lower-triangular, non-unit case.

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Mar 31, 2026

hipudding changed the title ~~Qwen35 op~~ CANN: Add suport for Qwen35 ops Mar 31, 2026

hipudding commented Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CANN: Add suport for Qwen35 ops#21204

CANN: Add suport for Qwen35 ops#21204
hipudding wants to merge 10 commits intoggml-org:masterfrom
hipudding:qwen35_op

hipudding commented Mar 31, 2026 •

edited

Loading

Uh oh!

hipudding Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hipudding commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

hipudding Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hipudding commented Mar 31, 2026 •

edited

Loading