[WIP] Enable fp8 attention for triton unified attention by weitliao · Pull Request #2235 · ROCm/aiter

weitliao · 2026-03-10T05:22:20Z

Motivation

Triton unified attention already has the capability for doing fp8 compute, especially when using fp8 KV-cache.
However, due to the limitation of Q's precision, the attention is often computed in fp16 even when kv-cache is in fp8. The fp16 route is computational inefficient as it involves two upcasting steps for K and V and computing attention in fp16, instead of fp8.

Technical Details

In this PR, we plan to enable fp8 unified attention when all of the below are checked

KV-cache is in fp8
ENABLE_FP8_UNIFIED_ATTENTION is true (env variable)

When both conditions are met, we down cast Q during the load step, which gives the best perf.

Test Plan

Performance benchmarks run
Accuracy run
Tested on MI350X

Test Result

Submission Checklist

github-actions · 2026-03-10T05:22:31Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:multi-gpu`	Multi-GPU op tests (8 GPU)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2235 --add-label <label>

cagrikymk · 2026-03-10T16:34:18Z

Hello,
Thank you for the PR!

However, the current implementation actually covers this case. When q and k/v are all in fp8, the dot product will also be in fp8, no descaling doesnt apply to that case.

weitliao · 2026-03-12T03:43:19Z

Hi @cagrikymk,
Thanks for the discussion offline.
To summarize, since Q is likely not in fp8 after rope step, so current attention is using the fp16 compute.

In the workload test for gpt-oss-120b, we see kernel level improved by 10-15% and e2e improved by 2-8%.
As suggested, we will cast q to dtype of kv cache.

initial commit for fp8 unified attention

18c0cce

weitliao requested a review from a team March 10, 2026 05:22

weitliao marked this pull request as draft March 10, 2026 05:23

azaidy requested a review from cagrikymk March 10, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Enable fp8 attention for triton unified attention#2235

[WIP] Enable fp8 attention for triton unified attention#2235
weitliao wants to merge 1 commit intomainfrom
wtl/enable_fp8_unified_attention

weitliao commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

cagrikymk commented Mar 10, 2026

Uh oh!

weitliao commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weitliao commented Mar 10, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 10, 2026

🏷️ CI Guide

Uh oh!

cagrikymk commented Mar 10, 2026

Uh oh!

weitliao commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants