[Draft] [Preview] Support gfx1201 by tjtanaa · Pull Request #1681 · ROCm/aiter

tjtanaa · 2025-12-18T07:20:31Z

Motivation

There has been huge interest in the vLLM community in using AITER triton kernels for Radeon. The testing has shown significant performance benefits on Radeon GPU as well. vllm-project/vllm#28649 (comment)

We are working with @hongxiayang on enabling and upstreaming all the tuning configs and tuning scripts to AITER Radeon as a proper support.

The work will then be broken down into multiple PRs to upstream to AITER.

Technical Details

Phase 1 (Done)

Tasks

Understand which triton op is failing for gfx1201
Run unit tests using the community patch vllm-project/vllm#28649

Results is based on this commit `f4e4188`

All of the important triton kernels can run on RDNA 4:

test_gemm_a8w8.log all passed.
test_gemm_a8w8_per_token_scale.log all passed.
test_gemm_a8w8_block_scale.log all passed.
test_batched_gemm_a8w8.log All passed, just some OOM
test_batched_gemm_bf16.log All passed, just OOM
test_moe.log All passed with just 4/850 fails. Great
test_unified_attention.log All passing with 41/823 failures (hardware config failure, has been solved in EmbeddedLLM@1574097 of our branch)
test_rmsnorm.log All pass, just very small mismatch and OOM .
test_mha.log all workings for forwards one, just with some OOM

Phase 2 (Enable GPU Arch on gfx1201)

Tasks:

Add the gfx1201 to GPU arch.
Run all unit tests and make sure the kernels that are important for actual deployments are passing. Fix any issues related to the failure.
Add Tuning scripts for Radeon GPU with proper search space.
Evaluate the performance gain on vLLM.

The checklist of this phase is the list of ops to enable

Current progress:

gemm_a16w16 ✅
gemm_a8w8_block_scale ✅
gmm
gemm_a8w8✅
gemm_a8w8_per_token_scale ✅
batched_gemm_a8w8
batched_gemm_bf16
unified_attention✅
moe
rmsnorm
mha (forward)
gemm_a16w16_gated

Test Plan

Ensure we can run the unit tests.
Kernels are also tuned.

Test Result

Phase 1 (DONE): Test Results of unit tests using the community patch

Run all unit tests and make sure the kernels that are important for actual deployments are passing. Fix any issues related to the failure.

All of the important triton kernels can run on RDNA 4:

test_gemm_a8w8.log all passed.
test_gemm_a8w8_per_token_scale.log all passed.
test_gemm_a8w8_block_scale.log all passed.
test_batched_gemm_a8w8.log All passed, just some OOM
test_batched_gemm_bf16.log All passed, just OOM
test_moe.log All passed with just 4/850 fails. Great
test_unified_attention.log All passing with 41/823 failures (hardware config failure, has been solved in EmbeddedLLM@1574097 of our branch)
test_rmsnorm.log All pass, just very small mismatch and OOM .
test_mha.log all workings for forwards one, just with some OOM

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

…into support_gfx1201

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

…into support_gfx1201

…ion kernel Signed-off-by: Amir Balwel <amoooori04@gmail.com>

…into support_gfx1201

tjtanaa · 2025-12-18T07:26:40Z

CC @hongxiayang @mgehre-amd

tjtanaa · 2025-12-18T07:35:58Z

@valarLip could we get some preliminary thoughts about this? Will there be any concerns in us upstreaming the configs and tuning scripts, and fixes like this unified attention fix is acceptable or not (EmbeddedLLM@1574097)?
We will try our best to make sure the changes are as little as possible as the main repo AITER is designed for Instinct GPUs.

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

Co-authored-by: Jeff Aw <jeffaw99@hotmail.com> Signed-off-by: Amir Balwel <amoooori04@gmail.com>

androiddrew · 2026-01-04T22:15:41Z

@tjtanaa if I built the vllm/docker/Dockerfile.rocm_base with your branch what a VLLM_ ennvars would I need to set to test these kernels?

tjtanaa · 2026-01-05T03:20:56Z

@androiddrew Currently there are two things that has to happen on vLLM to use enable the use of these triton kernels

all aiter functions on vLLM upstream has safeguard conditions where non-gfx9 arch will not be able to trigger AITER kernels.
vLLM only integrated HIP, ASM and CK kernels for most of the ops. The triton AITER ops has to be integrated separately.

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

…into support_gfx1201

iAmir97 and others added 14 commits November 27, 2025 05:57

init

7000ac8

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

Merge remote-tracking branch 'origin' into support_gfx1201

503034c

Merge branch 'ROCm:main' into support_gfx1201

6935f3c

Merge branch 'support_gfx1201' of https://github.com/EmbeddedLLM/aiter …

c1f7a0b

…into support_gfx1201

added logging for gemm_a16w16 tests and atomic kernel tune

6b72cab

improve tuning script and add test for qwen shapes

9492f4d

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

proper tuning and results

4c99eff

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

updating atomic kernel

8f4551e

Merge branch 'support_gfx1201' of https://github.com/EmbeddedLLM/aiter …

8ce3ef2

…into support_gfx1201

added gemm_a8w8_blockscale tuning

8c77318

fix: shared memory issue on R9700 for certain sizes in unified attent…

1574097

…ion kernel Signed-off-by: Amir Balwel <amoooori04@gmail.com>

testing a8w8 tuning

42137a1

tuning for a8w8blockscale

e1eca71

Merge branch 'support_gfx1201' of https://github.com/EmbeddedLLM/aiter …

80ecc77

…into support_gfx1201

iAmir97 added 2 commits December 22, 2025 05:25

Merge branch 'main' into support_gfx1201

acc06f7

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

Create tuning script base and tune moe

f39d95e

Co-authored-by: Jeff Aw <jeffaw99@hotmail.com> Signed-off-by: Amir Balwel <amoooori04@gmail.com>

iAmir97 and others added 3 commits January 21, 2026 07:13

moe routing sigmoid + moe op tuning

326afd5

Signed-off-by: Amir Balwel <amoooori04@gmail.com>

added batch gemm tuned configs

9d2131b

Merge branch 'support_gfx1201' of https://github.com/EmbeddedLLM/aiter …

5c07517

…into support_gfx1201

androiddrew mentioned this pull request Feb 1, 2026

[Question] Is GFX1201 support planned? #900

Open

This was referenced Mar 10, 2026

[gfx1201] Add tuned kernel configs and FP8 attention support for AMD RDNA4 (Radeon AI PRO R9700) #2242

Draft

[ROCm] Enable FP8 inference on gfx1201 AMD RDNA4 (Radeon AI PRO R9700) with aiter kernels vllm-project/vllm#36659

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] [Preview] Support gfx1201#1681

[Draft] [Preview] Support gfx1201#1681
tjtanaa wants to merge 19 commits intoROCm:mainfrom
EmbeddedLLM:support_gfx1201

tjtanaa commented Dec 18, 2025 •

edited

Loading

Uh oh!

tjtanaa commented Dec 18, 2025

Uh oh!

tjtanaa commented Dec 18, 2025

Uh oh!

androiddrew commented Jan 4, 2026

Uh oh!

tjtanaa commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tjtanaa commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Phase 1 (Done)

Tasks

Results is based on this commit f4e4188

Phase 2 (Enable GPU Arch on gfx1201)

Test Plan

Test Result

Phase 1 (DONE): Test Results of unit tests using the community patch

Submission Checklist

Uh oh!

tjtanaa commented Dec 18, 2025

Uh oh!

tjtanaa commented Dec 18, 2025

Uh oh!

androiddrew commented Jan 4, 2026

Uh oh!

tjtanaa commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tjtanaa commented Dec 18, 2025 •

edited

Loading

Results is based on this commit `f4e4188`