Skip to content

optimize the fa fwd and bwd kernels#2265

Open
scxiao wants to merge 2 commits intomainfrom
scxiao/opt_fa
Open

optimize the fa fwd and bwd kernels#2265
scxiao wants to merge 2 commits intomainfrom
scxiao/opt_fa

Conversation

@scxiao
Copy link
Contributor

@scxiao scxiao commented Mar 12, 2026

Motivation

This PR is to optimize the Triton flash attention kernel used by public flash attn repo. The following changes are made:

  • Add one or two tuning configs for the fwd and bwd kernel to boost the performance by 2x of some input shapes
  • Change two kernel argument to tl.constexpr for the bwd kernel, which further improve the perf by 15%.
  • Apply the xcd remapping to the num_heads dimension for better L2 cache hit.

Technical Details

Test Plan

There are existing tests for these kernels

Test Result

Submission Checklist

@scxiao scxiao requested a review from a team March 12, 2026 21:36
@github-actions
Copy link
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:multi-gpu Multi-GPU op tests (8 GPU)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2265 --add-label <label>

@scxiao scxiao requested a review from micmelesse March 12, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant