Skip to content

CANN: Add suport for Qwen35 ops#21204

Draft
hipudding wants to merge 10 commits intoggml-org:masterfrom
hipudding:qwen35_op
Draft

CANN: Add suport for Qwen35 ops#21204
hipudding wants to merge 10 commits intoggml-org:masterfrom
hipudding:qwen35_op

Conversation

@hipudding
Copy link
Copy Markdown
Contributor

@hipudding hipudding commented Mar 31, 2026

Overview

This PR adds support for several missing operators in the CANN (Ascend NPU) backend for qwen3.5

New operators

  • GGML_OP_FILL — fills a tensor with a constant scalar value via aclnnInplaceFillScalar
  • GGML_OP_DIAG — creates a diagonal matrix from a vector using a strided view with InplaceCopy
  • GGML_OP_SOLVE_TRI — triangular linear system solve (AX=B) via aclnnTriangularSolve (lower-triangular, non-unit)
  • GGML_UNARY_OP_SOFTPLUS — implemented via aclnnSoftplus (beta=1.0, threshold=20.0);
  • GGML_OP_CUMSUM — cumulative sum via aclnnCumsum
  • GGML_OP_TRI — all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG) using Tril/MaskedFillScalar to work around CANN sparse-zero bugs
  • GGML_OP_SET — inplace tensor set via aclnnInplaceCopy, modeled after the existing ACC implementation; enables the scheduler to assign SET ops to CANN and avoids cross-device write issues in delta-net hybrid models
  • GGML_OP_GATED_DELTA_NET — recurrent state-space operator using a batched aclnnBatchMatMul approach that groups all H attention heads into a single rank-3 matmul per recurrence step, reducing kernel launches from O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens)

Other changes

  • memset_tensor interface — implement ggml_backend_cann_buffer_memset_tensor and wire it into ggml_backend_cann_buffer_interface to ensure correct zero-initialization of cache buffers
  • Graph cache fix — always compare op_params for all ops in the graph cache key, not just a whitelist; previously, ops with differing params could incorrectly match a cached graph

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES — AI tools (Claude) were used in an assistive capacity only. Specifically, AI was used to analyze problems, suggest implementation approaches, and provide explanations of relevant CANN/ACL APIs. All code was written, reviewed line-by-line, and validated by the human contributor, who takes full responsibility for the correctness and design of the changes.

Add ggml_backend_cann_buffer_memset_tensor and wire it into
`ggml_backend_cann_buffer_interface`.

This ensures backend tensor memset operations are supported
and avoids incorrect behavior when tensors need explicit
zero-initialization (e.g. cache buffers).
Add SET operator support using aclnnInplaceCopy, modeled after the
existing ACC implementation. This enables the scheduler to assign
SET ops to CANN when the output tensor resides on device memory,
avoiding cross-device write issues with delta-net hybrid models.

All 12 test-backend-ops SET tests pass (f32/i32, inplace/non-inplace, dim 1/2/3).
Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a
batched approach that groups all attention heads into a single 3-D
BatchMatMul per recurrence step, reducing kernel launches from
O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens).

Key design decisions:
- Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch
  all H heads together for M×k, outer-product, and M×q steps
- Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused
  across all time steps to avoid per-step allocations
- Support both scalar gate (g shape [1,H]) and KDA per-dim gate
  (g shape [S_v,H]) via appropriate broadcast shapes
- Fall back to naive per-head scalar loop for permuted/GQA/non-F32
  inputs that don't meet batched path requirements
- Relax CANN precision tolerance to 1e-6 in tests to account for
  different FP32 accumulation order in BatchMatMul vs scalar loops
- Remove dead code: _math and _naive variants are no longer needed
- Rename _batched to the public entry point ggml_cann_gated_delta_net
- In supports_op, return false for non-contiguous / GQA / non-F32 cases
  so the framework falls back to CPU instead of running the slow naive path
- The single remaining implementation uses aclnnBatchMatMul over all H
  heads per timestep, reducing kernel launches to O(n_seqs * n_tokens)
- Implement GGML_OP_CUMSUM using aclnnCumsum
- Implement GGML_OP_TRI with all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG)
  using Tril/MaskedFillScalar approach to work around CANN sparse-zero bugs
- Fix graph cache to always compare op_params for all ops, not just a whitelist
Implement FILL using aclnnInplaceFillScalar to fill a tensor with
a constant scalar value from op_params.
Create diagonal matrix from vector by filling dst with zeros then
copying src onto the diagonal via a strided view with InplaceCopy.
Implement triangular linear system solve (AX=B) using
aclnnTriangularSolve for the lower-triangular, non-unit case.
Implement GGML_UNARY_OP_SOFTPLUS using aclnnSoftplus with beta=1.0
and threshold=20.0. This enables hybrid models like Qwen3.5 to run
entirely on the CANN backend without graph splitting, which fixes
graph cache instability caused by the backend scheduler fragmenting
the computation graph when SOFTPLUS falls back to CPU.
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Mar 31, 2026
@hipudding hipudding changed the title Qwen35 op CANN: Add suport for Qwen35 ops Mar 31, 2026
: type(type), head_count(head_count), head_size(head_size), n_seq_tokens(n_seq_tokens), n_seqs(n_seqs),
v_repeat(v_repeat), permuted(permuted), kda(kda) {}

double max_nmse_err() override {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert temporary debug changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant