Skip to content

Add Triton fallback for fused_rope_rms (QKNorm+RoPE)#2227

Open
sunway513 wants to merge 5 commits intoROCm:mainfrom
sunway513:feat/fused-rope-rms-triton-fallback
Open

Add Triton fallback for fused_rope_rms (QKNorm+RoPE)#2227
sunway513 wants to merge 5 commits intoROCm:mainfrom
sunway513:feat/fused-rope-rms-triton-fallback

Conversation

@sunway513
Copy link
Collaborator

Summary

  • Implements fused_rope_rms() as a pure-Triton fallback for the HIP fused QKNorm+RoPE kernel
  • Uncomments the call site in RotaryEmbeddingFusedQKNorm (previously raised NotImplementedError)
  • Exports fused_rope_rms from aiter/__init__.py

Motivation

In CK-free builds (e.g., clean Docker images for faster CI/deployment), the HIP fused kernel fused_qk_norm_rope_cache_quant_shuffle is unavailable. This Triton implementation enables the fused_rope_rms path to work without any HIP/CK dependencies, using existing Triton kernels:

  • rmsnorm_forward_inference for per-head RMSNorm
  • rope_cached_thd_positions_2c_fwd_inplace for RoPE

Changes

  • aiter/rotary_embedding.py: Add fused_rope_rms() function (67 lines), uncomment call site
  • aiter/__init__.py: Export fused_rope_rms

Test plan

  • Verify numerical correctness vs HIP fused kernel on MI300X/MI355X
  • Verify CK-free inference path (companion to ATOM CK-free Docker work)

Implement fused_rope_rms as a standalone function using existing Triton
RMSNorm and RoPE kernels. This unblocks Qwen3-MoE on CK-free builds
where the HIP fused kernel is unavailable.

The fallback applies per-head RMSNorm via reshape [T,H*D]->[T*H,D]
then in-place RoPE, all operating through QKV tensor views.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Triton-based fallback for the fused QKNorm+RoPE path so CK-free/HIP-kernel-missing builds can still run the fused_rope_rms flow.

Changes:

  • Enables the previously disabled fused_rope_rms call path in RotaryEmbeddingFusedQKNorm.forward().
  • Introduces a new Triton fallback implementation combining per-head RMSNorm + in-place RoPE.
  • Exports fused_rope_rms at the package level and adds GitHub workflows for fork syncing and auto-rebasing PRs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
aiter/rotary_embedding.py Implements Triton fallback fused_rope_rms() and wires it into the forward path.
aiter/init.py Re-exports fused_rope_rms from the top-level package.
.github/workflows/sync-fork.yml Adds a scheduled/manual workflow to sync a fork’s default branch.
.github/workflows/auto-rebase-prs.yml Adds a workflow to rebase open PR branches after sync workflows complete.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

k_size = num_heads_k * head_size
v_size = num_heads_v * head_size

qkv_2d = qkv.view(num_tokens, q_size + k_size + v_size)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor.view() will throw at runtime if qkv is non-contiguous (which can happen after slicing/transpose or some fused ops). Since this is a fallback path meant to be robust, prefer reshape(...) here (or make qkv contiguous before viewing) to avoid hard failures on valid inputs.

Copilot uses AI. Check for mistakes.
return q, k, v


def fused_rope_rms(
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is exported publicly (via aiter/__init__.py) but its API contract is currently ambiguous: it mutates qkv in-place and returns None, and callers may expect outputs similar to other embedding helpers. Please document the in-place behavior and expected tensor shapes/dtypes in the docstring (and/or consider returning (q, k) or (q, k, v) views explicitly) so external consumers don’t misuse it.

Copilot uses AI. Check for mistakes.
Comment on lines +1334 to +1336
"""Fused QK-RMSNorm + RoPE on packed QKV tensor (in-place).
Triton fallback for the HIP fused kernel.
"""
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is exported publicly (via aiter/__init__.py) but its API contract is currently ambiguous: it mutates qkv in-place and returns None, and callers may expect outputs similar to other embedding helpers. Please document the in-place behavior and expected tensor shapes/dtypes in the docstring (and/or consider returning (q, k) or (q, k, v) views explicitly) so external consumers don’t misuse it.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
name: Sync Fork

on:
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow relies on GITHUB_TOKEN to sync and potentially update the default branch, but it does not declare permissions. On many repos/orgs the default token permissions are read-only, causing gh repo sync to fail. Add an explicit permissions: block (at least contents: write) to ensure the sync can push updates.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
name: Auto-Rebase PRs

on:
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow force-pushes branches and also comments/labels PRs via gh pr comment / gh pr edit, but it does not declare required permissions. Without explicit permissions, pushes and PR edits commonly fail with the default GITHUB_TOKEN. Add workflow/job permissions such as contents: write (push), pull-requests: write (comment), and issues: write (labels).

Copilot uses AI. Check for mistakes.
Copy link
Member

@gyohuangxin gyohuangxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, are sync-fork.yml and auto-rebase-prs.yml meant for your personal fork? I don't think they should be merged into the upstream repo. Could you drop these two files from the PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants