Update/add to qr_ks_vs_whole_k_prefetch pipeline #3485

qianfengz · 2025-12-24T10:00:35Z

About qr_ks_vs_whole_k_prefetch pipeline

The pipeline qr_ks_vs_whole_k_prefetch is mainly used for the situations where total number of work-groups is not enough to occupy the CUs. When the total number of work-groups is low, use MTile size (kM0) 64 rather than 128 can improve the CU occupancy. And with kM0=64, less registers are consumed to save P and O, thus enough vgprs are left for prefetch the whole k_tile from next iteration in the main-loop, and thus performance can be improved compared to the usual method of using kM0=128,
Except for prefetching whole k tile when kM0=64, the pipeline also has the path to use kM0=128, in which case, 1/2 of n0_loops slices of the k tile are prefetched for next iteration. Path of kM0=128 can be used as a replacement of using pipeline qr_ks_vs_async

What this PR does

Update in the pipeline policy to ensure best mfma instructions are used on MI350
Add the qr_ks_vs_whole_k_prefetch_trload pipeline instance so that V can be loaded using transposed loading on MI350 (avoid the need of lots of shuffling instructions)
Using n0_loop to implement Gemm0 instead of the commonly used k0_loop. n0_loop brings the benefits of less move_tile_window() call, and removing the need of clear_tile(s_acc) in the main loop.
Complete support of naive tile loading for hdim96 and hdim160, which means loading tile of hdim96/hdim160 without having to pad them to hdim128/hdim256
Other fine-grained improvement (eg. use explict partition_index to guarantee warp_id is allocated on vgpr for store_tile/load_tile to/from LDS tile_window)

Performance results

For attention shapes which leads to kM0=64, qr_ks_vs_async_whole_k_prefetch_trload shows much better performance than qr_ks_vs_async_trload on the same case (execution time 41.02ms by whole_k_prefetch_trload & 58.50ms by async_load)
For attention shapes which leads to kM0=128, qr_ks_vs_async_whole_k_prefetch_trload show a little bit better performance than qr_ks_vs_async on mi350 (execution time 104.50ms by whole_k_prefetch_trload & 106.50ms by qr_ks_vs_async). And they shows completely on-par performance on MI300

Test/Verify

Use the ROCM xformers branch test_whole_k_prefetch_n0loop to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far
Use the following command-line for building/testing xformers

#> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers
#> git submodule update --init --recursive   
#> pip  install --no-build-isolation -e ./
#> pytest tests/test_mem_eff_attention.py::test_forward

Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines

#> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3 and qr_ks_vs_async_trload pipeline
#> export FMHA_DISABLE_ASYNC_PIPELINE=1 #> to disable using qr_ks_vs_async pipeline

Discussion

…oping Gemm0 along n0 dimension

…8 on mi350

… ...)

…e_k_prefetch pipeline

…n whole_k_prefetch path)

…n whole_k_prefetch path in trload pipeline)

… next iteration in the non-whole-k-perfetch path

qianfengz added 25 commits December 4, 2025 09:09

Initial re-implementation of pipeline qr_ks_vs_whole_k_prefetch in lo…

5fada1c

…oping Gemm0 along n0 dimension

Add prefetching whole next iteration K path in the pipeline

98f9b4a

Change in GetKVBlockGemm to let gemm1 to use WarpTile-16x16x16/32x32x…

c32949b

…8 on mi350

Switch the codes based on the iteration index (first/intermediate/last)

25521a7

Simplify the block_gemm codes

8b85919

[Performance] Change __builtin_amdgcn_sched_barrier() in block_gemm

5722f8a

Refine the interleaving in the loop of Gemm0

044f554

Using explicit vgpr-saved partition_index with store_tile(lds_window,…

2ea8d83

… ...)

Separate kN0Sub from kK0 to be used for flexible tile tuning for whol…

12c8873

…e_k_prefetch pipeline

Load Q through Lds

c3d3487

Fix move_tile_window(k_dram_window, ..) step in the pipeline

409ec3b

Remove replicated codes in the pipeline

370d386

Adjust in GetNumPrefetchV()

d281c51

Add support of loading QK tiles of hdim96 without padding to hdim128

384f470

Add qr_ks_vs_whole_k_prefetch_trload pipeline

eb598a9

Using is_using_trload_v to check the kUseTrLoad from pipeline

57abd10

Load Q directly from global memory to registers for BlockGemm

3f6d26e

Fix the static_assert expression in the pipeline

1ef76a6

Update to the non-whole-k-prefetch path in the whoke_k_prefetch pipeline

db5c12d

Update to only pre-load one v_tile during Gemm0 loop

57cf989

Move the loading of k_file for next iteration into the Gemm1 loop (no…

b77fdbf

…n whole_k_prefetch path)

Update to GetNumPrefetchV()

e7e6ebc

Move the loading of k_tile for next iteration into the Gemm1 loop (no…

6c91b0c

…n whole_k_prefetch path in trload pipeline)

Update to GetNumPrefetchV() for kM0=64 path

489e255

Update in whole_k_prefetch_trload pipeline to prefetch two k_tile for…

f5b4d5d

… next iteration in the non-whole-k-perfetch path

qianfengz requested review from aosewski, carlushuang, geyyer, illsilin and poyenc as code owners December 24, 2025 10:00

qianfengz requested review from ThomasNing, afagaj, andriy-ca, asleepzzz, bartekxk, cgmillette, coderfeli, shumway, tenpercent and vidyasagar-amd as code owners December 24, 2025 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update/add to qr_ks_vs_whole_k_prefetch pipeline #3485

Update/add to qr_ks_vs_whole_k_prefetch pipeline #3485

qianfengz commented Dec 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update/add to qr_ks_vs_whole_k_prefetch pipeline #3485

Are you sure you want to change the base?

Update/add to qr_ks_vs_whole_k_prefetch pipeline #3485

Conversation

qianfengz commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About qr_ks_vs_whole_k_prefetch pipeline

What this PR does

Performance results

Test/Verify

Discussion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qianfengz commented Dec 24, 2025 •

edited

Loading