Merge upstream by MatthewBonanni · Pull Request #119 · vllm-project/flash-attention

MatthewBonanni · 2026-02-10T22:30:05Z

Sync with upstream to grab recent commits. Of particular interest is Dao-AILab#2235, which is needed for MLA prefill using FA4

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1.

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml

…1881)

* squashed * fixes * fixes * Fix narrow * Add TORCH_STABLE_ONLY flag * new_empty + zero_ --> new_zeros * revert flash_api.cpp and add flash_api_stable.cpp * update setup.py * Only pass TORCH_STABLE_ONLY for stable build * Address Jane's comments * > to >=

* fix typo * Update setup.py * Update setup.py * Update setup.py * Update setup.py

…#1893) * add spt scheduler for causal bwd determinism * add new torch check for det hdim 256 to stable api

…recated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906)

Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * Update flash_api_stable.cpp * upstream cutlass v4.2.1 csrc

* Refactors to enable FlexAttention * Thread throught the buffers to the score_mod * add-test * add fastdivmod * comments * comments

* Fused Bwd (Dao-AILab#137) * Fused with Good perf and stride fixed Fix fused bugs isolate failing case fix bug bring back test cases rm split impl in fused use exp2 is global variable now try oom fix save make fused the default limit to reproduce failure return default to split fix head size bug use exp2 back to true * new grid * BLK_SLICE_FACTOR = 1 * add tflops * new commit * test in parrallel * strides added by jusson * disable alibi * fix bugs again * default to fused * add bwd options for varlen * backend filter * default to jingning and batch 4 * best fwd config * fix TRITON_PRINT_AUTOTUNING flag bug * tune * Tuning fwd prefill * add if else * use flag * Minor mask fix * FLIP GRID * use best config for default * print when autotuning * test bfloat16 * fix k and v stride bugs * skip bfloat16 * test kvpacked * disable internal tests * pick default config based on arch * Add alibi in the new bwd kernel (Dao-AILab#139) * enable alibi for jinging kernel enable alibi for jinging kernel match * save bad configs * fix alibi and causal bug * disable autotune by default * auto tune when benching is good * set best config * remove env var * Update amd_tests.yml * upgrad to triton==3.3.0 * increase shm * use 64 x 64 for now * save * handle 1d alibi * Add fp8 to fused kernel (Dao-AILab#140) * fp8 stuff find test case compute delta fp8 basic fp8 config passing non causal path works * isolate bad case * fix fp8 bug * didnot fix fp8 bug * back to failing test * fp8 tests passing * skip * skip ref tests --------- Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> * head, seq, batch (Dao-AILab#141) * Fix keys (Dao-AILab#144) * save * rm keys * fix keys * use GHA_RENDER_DEVICES * normal docker * Pad LSE (Dao-AILab#148) * add round multiple * fix fwd * backward fix * use rounded lse flag * passing ROUNDED_LSE * default is new rounded mode * rename to fused_atmoics and fused_no_atomics * add test for torch_compile * add varlen torch compile test * add old one kernel for ref * fix varlen mismatch bug * fix shape issue in varlen but mismatch * sync torch compile kernel launch * simple varlen test * add debug code * rm old * ignore old impls * DEBUG flag works in interface only * ref uses the righ shape for lse * rm oldest bwd kernel * fix typo * fix varlen bug * fix bug. Get info from q for now * simple shape and stride checkout * add more tests * test kvcache * kvcache safe * match case * fix segfault due to bad return_softmax * run bench * run seperate for the main functions * just output benchmark * default csv format and time stamp files * non verbsoe bench * Sliding Window Forward (Dao-AILab#151) * Compress SWA work test case set up debug inputs add fwd ref one mask ref fwd first pass save ref doesnot work for bigger seqlens save new version some causal cases failing found bad cases working new attn new atten works new attn_fwd works reorg n_extra_tokens use seqlen_delta_qk ref fwd works add sliding window to bwd ref test kvcache decode ref work with everything except sliding window add debug code for 12 failing sliding window cases for decode attention_decode_forward_ref_impl mostly works except for alibi fix alibi in attention_decode_forward_ref_impl ref works with normal, varlen & kvcache move stuff around figure out masking old attn inner two inner functions remove load_fn do Lk - Lq like ref unify IS_CAUSAL code in epilogue clean up add args rm inference stuff simplify compute_masking simpler compute mask stub out returning front masking variables remove pointer pass compute ptrs inloop compute block min and max window stub inside inner mask loop trying to use attn_fwd_mask causes issues fix compiler bug when front masking gen specifc types add sliding window and debug statements use identity for v add more taste cases add comments save use k_max_token for clarity disable debug configs basic NON-CAUSAL SLIDING WINDOW non causal sliding window works on the all the shapes non sliding window working in fwd clean up fused bwd seperate old fwd_prefill move configs to utils.py * fix bwd ref bug * skip local cases so that fa output * no sliding window causal green * add backward test skip for sliding window * clean reduce in fwd_kvcache. no is_CASUAL branching * add kvcache masking * kvcache working * fix some bugs in test.py * clean up * Fix Device Segfault (Dao-AILab#152) * Compress segfault work fix backward segfault rework offset ignore .profile ignore .analysis save * assert the kernel launch device and tensor devices are the same * fix failing asserts * add asserts to fwd * Fix SDMASK bug * Log triton, torch and fa version * Fix fp8 import issues * fix docs (Dao-AILab#154) * Sliding Window block classification logic (Dao-AILab#155) * add aiter code * remove aiter stuff * sliding window non causal masking works * causal and sliding window block masking * extract common * clean up typo * helper for swa * ignore .amd * fix last block bug * Enable FA V3 (Dao-AILab#157) * Compress PA work narrow pa test ref works on most cases inplace ref with new_kv inplace paged attention add pa ref save pa basic paged works save fix swa + causal in pa. Also new_kv only on pa path passing build fa v3 import interface from fa v3 copy fa tests use v3 api clean up rename to match old test support different head sizes remove fp8 basisc passing v3 cases test_flash_attn_varlen_output v3 working isolate bad case for kvcache case passing save use decode is seqused/ cacheseql is given use decode if not varlen basci kvcache v3 working kvcache enable more cases detect kvcache case if seqused_q is non and sequese_k is not None skip failing test find fp8 failing case mha fp8 works fix fp8 MQA/GQA bug clean up more clean up clean up more don't need fp8 dead code remove train code with fp8 stuff fp8 working in kvcache paged + fp8 seems to be working new_kv allowed * clean up * skip hopper race test * clean up more * fix paged + alibi * similar inner paged api * unify _attn_fwd_inner * AITER integration (Dao-AILab#159) * clean up v2 interface * assert fp8 scale shapes * rotary working * move rotary to impl layers * remove einops * enable rotarry in v3 * create interface * fix descale assert * unify bwd * lint from aiter * clean fp8 api * add api change * assert shapes for v2 * remove ref and bench.py * remove metadata class and clean up * bwd_prefill * one bwd.py * rename * lint * add bwd_change (Dao-AILab#156) * Tune FP8 Perf (Dao-AILab#160) * check cu count for gfx942 * create get_cu_count * update repo root * update forward tune * clean up load * use float8_e4m3fnuz * save * show bwd mode * recommend fp8 * use torch.float32 for fp8 kernel * add both best fp16 and fp8 config * tune fp8 backward * descale factors should be b, hk * fp8 bwd working on all primus configs * tune bwd configs * fa v3 tests passing * better warning * clean up bwd launcher * v3 passing * tune more * improve perf * clean up * lint * clean * start tuning gfx950 * tune non causal path * fix bug * save * Skip configs where BLOCK_M2 % BLOCK_N2 != 0 * skip more * stop tuning * fix varlen bug * fix dropout & causal/swa segfault * update the to machine new changes * save * fix more bugs * remove random seed * clean up * update readme * print tensor stats for debug * disable sliding window tests * add rdna configs * fix k partial bug * fix block_size_n bug * fix type check bug --------- Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com>

A collaboration between Flash-Attention, PyTorch and xFormers is trying to provide pre-built wheels for FA3 across as many platforms/environments as possible (e.g., ARM, Windows, CUDA 13, ...). To simplify the installation workflow, it would help to tag these packages as stable, but the current main version is tagged as beta. FA3 hasn't received substantial updates in a while (the latest was a bugfix almost two months ago), and most new development is happening in FA4. Thus, in this PR, I propose we just claim that the current main version _is_ stable. I have heard concerns that the feature set of FA3 doesn't currently match FA2 (e.g., dropout is missing). I think this concern is partly addressed by the fact that the new wheels will have a different name than the FA2 ones (`flash_attn_3` and `flash_attn` respectively), hence the former does _not_ claim to be a replacement for the latter, and the two can coexist (and they provide different modules).

…ab#2239) * Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON env var support Allows users to override triton config when not autotuning. * Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON to readme * Rename to FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON

Co-authored-by: Cursor <cursoragent@cursor.com>

Cute-dsl now supports i64 strides by default

* Enable Fwd and Backward Enable Fwd and Backward Enable Fwd and Backward Enable fwd and varlen_fwd on AMD (vllm-project#63) * flash_attn_func works Compress This is a combination of 12 commits. add scripts save add our kernel import our kernel round trip use bshd layout figure out segfault fix show backward failure with prints save backward work run forward only test smallest config on everything add test fix remove pre commit install triton skip dropout pin d 32 factor d just run power of 2 remove timeout run serially clean up clean up 2 * Varlen works This is a combination of 6 commits. save some tests passing enable more enable everything move around alibi works * keep interface and kernel seperate * clean up enable flash_attn_with_kvcache (vllm-project#68) * Compress kvcache work This is a combination of 11 commits. kvcache work This is a combination of 4 commits. kvcache is not supported save save decode save clean up merge save cases save save save save key mask on triton side fix q size issue test combos save * fix causal. use cache_seqlens * clean and test what works * some configs work on new_kv but fails on 1,8 * cache overwrite correct * new_kv works more or less * test local * work on paged kv attention * prefill paged attention * fix has_batch_idx and skip local and rotatary emb * save * save * save * save * handle new_kv when paged kv cache * all except has_batch_idx works * major options are green * test all * add tests * save * clean up * minor clean up * simplest config * save debug true * save * refactor slightly * save work * need key masking * force hip * use is_hip * save * fix cache_seq_len issue * work on new_kv * pass new_kv data * save * benchmark fwd only * disable debug * pandas pdf * save * set methods * record number of heads * use configs * flexiable dim, n-heads, headofdim * better benchmarking * basic inplace update working * works upto 64 * new_kv supported! * test case for has_batch_idx * has_batch_idx works! * save * save * save * save ref * fix mqa and gqa by duplicating * GQA and MQA working by kernel modifications * fix new_kv with gqa * cache index * deal with nans on fwd_splitk * save * causal working on basic case * causal works! * alibi works! * clean up * clean prefill changes * remove bwd stuff * limit decode test to test_op_fwd * add ref * use bfloat Fixes after rebase Fixes after rebase rebase fixes deal with kvcache failure new run for branch cancel-in-progress fix varlen_fwd bug enable packed layouts and all configs (vllm-project#72) Clean up for Upstream (vllm-project#81) * Clean Clean This is a combination of 4 commits. clean 1 clean 2 clean more match main typo fix * use is_hip() * clean up more * skip odd d only * fix bug * skip randomly * use Flag * update readme * remove quantization * remove bwd * minor * print * remove verbose print * qunatize zero's out the d stride Enable Vanilla Bwd and Refactor (vllm-project#86) * Vanilla BWD Vanilla BWD This is a combination of 79 commits. save test_flash_attn_output use impl functions pass layout add ref move arround impls fix stride issue save oai kernel add baseline impl save bwd kernel working remove old impl remove block_ptrs from bwd pass padded dmodel and apply masking. the old test cases work but cases with small d don't work save save more prints rename to M to L save add notes add old_bwd back fa failure fails in kernels too isolate new bwd and keep old bwd in place clean up softmax_lse doesnot match refernce LOG flag softmax_lse with LN2 move qk_scale to loop pass ln2 to fwd just print kernel input test softmax output from forward test exp_scores_triton save all the ref create ref USE_EXP2 path return scores mask scores when returning them. Basic impl test passes scores and output match show max_diff return score needs to be adjusted as we find new maxes all good outputs. old style RCP2 example prep bwd_impl test save try openai save fix softmax_lse bug test_op_bwd_impl starting to work! new kernel. exp2 works but exp is faliing fix bwd exp2 add m and n masks. small cases still don't work match old and new kernel prints compare old and new print inputs save old kernel match on dv dq works compare to pytorch including softmax in forward fix bwd impl bug small sizes in bwd impl work old bwd test pass. Moving on to kernel tests dq, dk and dv are filled in place if given. Need to match cast to match fa fix non bug fix dv mismatch. use_exp2 was set to true in fwd fix case up 128 refactor and clean up a bit more issue is that dq and dk are not zeros dq must be zeroed out ignore segfaults fa ref and my ref match! all tests run use tolerance 1e-3 we need to figure out preprocessing save clean up save test delta diff move old impl out new preprocess function preprocessing_use_o flag working _bwd_preprocess_use_p basic cases pass all green fwd exp2 usage is done right before exp * refactor * refactor 2 * refactor 3 * fix bug * try ci * add flag * rename to utils * skip test_op_fwd_decode_int4_kv * reduce head size * try again * go back to old head sizes * Use Strides Use Strides This is a combination of 11 commits. use strides in bwd add layout test in forward fix shape layout function smaller tests save fix varlen error no headsize passed to bwd deal with varlen layout save save save save * use gen scripts * varlen fwd passing * core fwd ref impl * fix minor bugs * wrap varlen- launcher attention_forward_pytorch_ref_impl * varlen backward ref added * add offsets for varlen * fix delta bug * varlen bwd working * save * runs on Mi200 * just test basics * save * fix bug * fix varlen in64 bug * add ref * test_impl working with causal * fix qkvpacked issue * qkvpacked run tests * remove test_backward * save * just test output * dump into tensors * softmaxlse layout for varlen * small cases working * bwd thd green. although maybe some oom * forward out and lse are good. Something wrong with backward ref * make varlen ref work * save work, ref is working mostly * 91 failed, 6542 passed, 6336 skipped, 1 warning * ref is all green * debug flag in utils * found bad softmax_lse in varlen fwd * fix bug in softmax lse. strides in varlen werenot right * add causal tests and 32*32 bwd doesnot have segfault * save * fix oom by reducing block size for small heads * bwd ref with causal working * test impl * causal test passes * causal working * fix tests * nicer bench * fix qvpacked error * fix varlen qvpacked bug * fix minor bug * bench prefill and prefill_old using the same script * autotune configs for fwd * autotune flag * clean up decode impl * clean up * clean up more * bench everything by default and return time * clean up readmes REBASE: fix interface changes in rebase rename test to test_flash_attn_triton_amd REBASE: fix unpad diffs minor clean up in setup FLASH_ATTENTION_TRITON_AMD flags bench fwd and bwd fix sequence_parallel * Enable sequence_parallel in bwd (vllm-project#89) * sequence_parallel working on bwd_impl test * fix qkv error * save * save * save * bwd 3 times faster * clean up * fix varlen bug * use copy back dict * fix qkvpacked bug * reduce bench sizes * print copy back * Autotune off by default (vllm-project#90) * Autotune off by default * rework tests * Update Triton Version (vllm-project#91) * ignore ck code * update triton * update Triton commit readme (vllm-project#92) * Fix README (vllm-project#96) * Update README.md * fix readme * Enable MQA/GQA in backward (vllm-project#100) * simple failing test * ref is working * fix bug * save * find failing case * fowrad varlen mqa/gqa works * add mqa configs to bwd test * varlen bwd ref fixed * save failing case * GQA flag * ones passes * go back to values * save * bhsd working with mqa * remove repo * test layouts * clean up * test back to normal * clean up more * use zeros_like * zero out * Added Support for Rotary Positional Embeddings (vllm-project#99) * feat: added rotary support in kvcache * confirmed non-fused rotary passes all tests * add RDNA CI (vllm-project#105) * Add RDNA CI This is a combination of 4 commits. try navi try matrix small change try minimal change * limit navi tests * stop casting to fp32 which leads to oom on navi * enable all causal * revert all causal * skip compiler bug on navi * Dropout (vllm-project#101) * Alex's work This is a combination of 11 commits. save fix: dropout=0.0 woorks feat: dropout restrictions removed. failing tests test: reduced tests to simple cases test: failure is due to query + key padding mask NOT varlen itself feat: varlen dropout fwd passes fix: varlen bwd dropout works! test: discovered bwd error for non-dropout cases for large seqlen save save use triton commit 3ca2f498e98ed7249b82722587c511a5610e00c4 -- now batched layout passes * Almost Everything works. This is a combination of 16 commits. Work so far This is a combination of 63 commits. pick test case save philox offsets into metadata pass offset to ref common dropout mask simple droput out mask start dropout ref. work on returning SD_Mask next with negative numbers refernce is working dropout bwd ref faling case transfer rng_state properly save changes one dropout mask function save save minizmize diff save use torch.where in backward save save save dk works! passes reference is working. TODO" attn_ref is broken varlen ref working attn failing case with ones. attn_ref matches. fails with randn. we are seeing failure with large sizes from dv. save skip attn matrices compare the masks and find failing case rm cdiv_fn put dropout and alibi in common save compare masks save save pytorch ref is using tiles save save tl_rand_ref cache ref dropout mask new generate_dropout_mask_ref using tiling issolate failing varlen case simple dropout loop on k print rng_outputs save fwd kernel works save dv passed close to dk simple ref save seperate droped and scaled in ref and triton kernel ref changes working delta with dp find failing dv failures find failing case due to delta save delta from dp working bwd impl green enable test fwd save save delete kernels save probably mask application mismatch dump forward dropout pass dropout mask tensor to bwd_core different dropout fraction in fwd and bwd mismatch found on columns greater than 64 fix dropout bug. philox was not offset run full suite stop debug and approximate delta fix drop_mask non issue skip attn check clean up common bad varlen config fix varlen bug save * fix datatype mismatch * clean up * use pytorch dropout * It works on MI300. * remove _bwd_preprocess_use_p * fix torch interface bug --------- Co-authored-by: Alex Kranias <alex.kranias@amd.com> * fp8 forward (vllm-project#116) * disable navi * start test * test fp16 against fp8 * save scaling code so far * global scaling * add per_head_scaling * dump qk * save dumping q, k and qk to fp32 tensor * fix pointer bug * save reproducer * dump p and acc * fp8 working with my debug input * save * change api for dequant * pass descale_p * clean up * most working * save * save * varlen half way * some varlen examples work * improve varlen debug input * varlen mostly working * push working cases * fix ref bug * fix backward bug * fix varlen backward bug * use descale to set fp8 * check arch fp8 support * cache arch * try again * skip bad config on MI200 * skip decode nan config on MI200 * fix mistake * skip more * run full suit * Update amd_tests.yml * address comments * navi ci is broken * raise error tolerance to 2.5e-1 * target MI300 directly * show gfx * try again * don't fail matrix if one path fails * try upstream triton * just get MI300 working * Fix install bug This is a combination of 5 commits. try this use --no-build-isolation put route at .python run full suite remove triton * run ref on cpu * move ref test to navi machines * pin triton * add bench deps * Update readme * Minor fixes (vllm-project#107) * Clean up This is a combination of 4 commits. update base image disable navi for now all causal seems to work on MI300 skip MI200 causal bugs * remove MI200 skips * just run on prs or manually * add navi back * try again * update readme * mark flakey test * ref bug * Performant backward Triton implementation with separated dkdv and dq kernels (Dao-AILab#122) * added the split file * overhauled split file, need to add new kernels * copied triton fa over for reference * added comments * preprocess and dkdv done * fixed dkdv, added dq * fixed assumption on q, kv length different, run but incorrect * added standalone test for split bwd kernel * minor change on the ptr arith * separated the dkdv and dq kernels * GQA works now, onto seqlen q != k * dk,dq working, dv still failing * fixed the masking and num_step calc, now q==k works * added debug print with interpreter, might not work entirely w/o next commit * fixed all issues with q != k * fixed varlen issue * fixup on debug print * fixed dropout, esp w/ varlen * added USE_EXP2 toggle * added noncausal kernel * updated internal test for noncausal and use_exp2 * formatting * fixed dropout from seed bug * added envvar USE_SPLIT to toggle btw bwd kernels * fixed the qkv pack issue and removed hack * added the split kernel into interface_fa.py * change USE_SPLIT to USE_SINGLE_BWD_KERNEL to make split default * removed redundant file * fixed missing import in test * fixed import in interface_fa.py * revert changes in flash_attn_interface.py * updated strides to adapt to various tensor init shape * fixed issue that dqkv not zero'd * disabled the AMD local test * Quick Fixes (Dao-AILab#124) * fix fp8 bug * fix type bug * forgot nones * docker file * reenable gfx1100 ci (vllm-project#121) * reenable * randomly sample * clean up ci * add pytest-randomly * try again * update triton commit (Dao-AILab#128) * update triton commit * disable navi * update base docker image (Dao-AILab#129) * Rebase to v2.7.4.post1 CI on push to main_perf fix bugs and update ci * Clean up README (Dao-AILab#131) * use triton==3.2.0 (Dao-AILab#132) * Update README.md (Dao-AILab#134) * Update README.md * update second readme * fp8 backward (vllm-project#119) * fp8 BWD after figuring out varlen problem This is a combination of 21 commits. fp8 BWD Enable BWD fp8 with split kernel Enable BWD fp8 with per block scale factors for p and ds This is a combination of 9 commits. Enable BWD fp8 This is a combination of 12 commits. add backward test case save clean up disable ci lse is good dv matches reduce diff use do fp8 for dv kinda working group size is a constexpr clean up a bit everything except mqa/gqa works skip mqa cases 20 cases have nan on dropout save what you have disable tests failing enable tests per block descale_p and descale_ds use max(abs(()) clean up tests a bit more fix bug disable ci for now pass variables add flags add alternate path. Still need to load descale factors dv working dk works save add type info for backward fix DEBUG flag bug fix bug with backward. Normal forward works with dropout. Segfault with causal. Varlen has some issues. Might be related to strides. pass descale strides test causal fix causal compiler assert. min head should be 32 remove descale_p save explict name as causal isolate bad case just run fp8 tests bench with autotune min changes cast_fp8 helper cast_varlen_to_fp8 save minor highlight failing configs increase test cases mark failing recategorize misc tests group failing gqa configs add more tests add vis code min ci changes dump folder single image per tensors add tensor comparison gen varlen tensor vis varlen tensors varlen diff nice varlen vis vis function show seqlen in varlen add vis_tensors function simplify add color bars rm vis from test set canvas size. descale values are optional add ck tests add flag to build ck rm ck test assert requires grad ensure q, k, and v require gradients split vis rm interp, 8k and 300 dpi slice per page disable ci for now add more vis code tensor per image is better for vis_close, don't vis if no error. also vis all failing varlen tests varlen failures due to different seqlens rm vis code * rm require grad * decast fp8 for ref input, use fp16 as input fix minor things match readme decast fp8 for ref input, use fp16 as input * disable causal * fix bug * pass strides * DEBUG modes work only with interp * zero out varlen bwd grads * zero out everything * varlen dropout and causal works * add descale factors to other apis * save * unify tests * add packing flag * fix copy grad bug * add types, flags for zeroing tensors and accumlating fp32 This is a combination of 5 commits. extend ci time clean more minimize difference add types ZERO_TENSORS and ACCUMLATE_FP32 flags * just pass the output tensors * accumlate forwad in fp32 * fp8 in and fp8 out * return descale factors works for out * start fp8 return for bwd * return dq, dv, dk descale factors * save what you have * custom fp8 api function * add varlen function * test backward with varlen * test fp8 * kv cache fix * clean up interface * add packed api * fix qkv bug * disable bench * run big tests at the end * run in parrallel * Update utils.py * Update amd_tests.yml * add train script * use local configs for testing * Casting Kernel (Dao-AILab#130) * test and bench work compressed enable more tests match test add tests add more tests add nightly and do triton 3.2.0 add deps for benching min diff with og test reset changes rm readme changes reduce splitkv cases enable deterministic, kvpacked, swap_sq_sk & disable local, bfloat increase timeout 720 disable kvpacked skip flaky test be verbose skip config with 1 n_groups use grad strides rename maxseqlen and nonvarlen input helper bench mark api directly min diffs * mv test_op_prefill_bwd_split_impl * save test * test ir for sanity * test qkv ir * use input helper * kvpacked benching added * output do from the lower level functions * clean up packing input changing * clean up bwd * add qkv packed * add causal and dropout as a config * test all normal configs * add types * gen configs * improve configs * fix varlen bug * bench fp8 functions * combine benches * add varlen casting triton kernel * save varlen dataset * debug new cast * 2d casting kernel start & fix layout stride issue * basic cases passing in 2d kernel * all basic cases working * everything working * show correct mode for kvcache * train non varlen * update nightly tests * just latest torch * help text * skip new tests for now * add fns * match tests to main_perf * swap_sq_sk = False * limit to 8 workers * combine when bench fns are more than 1 * start on expanding casting kernel * bshd path for casting kernel * fix casting bshd bug * casting kernel working * Update interface_fa.py * clean up * run all bench marks * Update amd_tests.yml * remove -n 2 from fp8 tests * fix oom configs * remove all -n * Bench (Dao-AILab#135) * FP8 Bench work pass fp8 dtype gen fp8 values pass descale factors with inputs start work on fp8 output kernel output descale_o * fp8 seems slower * clean up newer benching code. fp8 is slower * output markdown and multiple types * bench all supported_dtypes for function by default * add dockerignore * need the .git for submodule update * ignore training data * get ready for ck * forward ck bench working * triton versus ck works * tuned triton perf comp * collect env flags * bench varlen and kvcache * function configs * show relative percentage diff * postive means triton faster negative means ck is faster * save * add new decode impl with switch flag * batch 1 and nheads 1 seems to work * autotune by default * simple stride calc in old impl * fixed bug due to strides are bhsd * rename the dim_k * clean up * old path works * rm block ptrs for q * rm block_ptrs for k * rm block_ptrs for v * rm block_ptrs from o * disable debug on bench * clean up * clean up names * compute offs_k properly * pass padded head to reduce kernel * fix o_mask bug * rm old impl * lambda grid * save final * ignore git stuff * add inference params to prefill * cache seqlens working * most cases work except newkv * fix minor bugs when runing fwd and bwd * check for backend * don't ignore .git * add modes * bench bwd * add llama configs * test fwd impl * run bwd_impl * move fp8 code * use Decode kernel for kvcache * fix fp8 import bug * fix bug * add arch in report * clean up test suite * fix fp8 typos * run ci * add fused kernel * add one kernel * update ci and readme * report ratios and remove split impl test expand bwd impl test * use split kernel * get one kernel working * use flag to switch bwd mode * clean up test_ir * one kernel has its own copy of the bwd kernels * autotune stub * pass og metaparams by default * add autotune configs * add tuning configs * update fused kernel code * use jingning * no auto tune for bwd * simpler varlen branching * fix constexpr bug * fix varlen fp8 * qkv fp8 working * fp8 qkv varlen green * fix bench functions * pick bench functions * bench defaults set * fix bug * add bench deps * bench env variations * per backend env configs * fix bug * add improved fused kernel * fix bug * final clean up * Enable Alibi (Dao-AILab#138) * test alibi * isolate failure * simpler test * clean up alibi * pass alibi to kernels * add stub code for actual alibi computation * add debug input * clean up ref. Use it to dev alibi first * add alibi in fwd ref * save * use compute_alibi_tensor_ref * normal fa works with alibi ref * alibi works on varlen ref * compare with ref * clean up ref prints * fix alibi none issue and use delta do o for ref * don't use alibi helper * alibi is green * run ci * fix test.py bug and update readme * min diff --------- Co-authored-by: Alex Kranias <alex.kranias@amd.com> Co-authored-by: Jingning Tang <jingning.tang@amd.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

reubenconducts and others added 30 commits September 2, 2025 16:25

[FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuT…

6387433

…e DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs

benchmark: qualify all attention backends by methods list (Dao-AILab#…

e8c7344

…1881)

[NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882)

7bdb426

* fix typo * Update setup.py * Update setup.py * Update setup.py * Update setup.py

fix typo in flops calculation for local attention (Dao-AILab#1883)

e980f0f

flash-attn-cute bwd sm90 (Dao-AILab#1868)

2cc6fd6

[Cute] Make testing utils standlone for cute (Dao-AILab#1892)

8ecf128

Bump pin for CuTeDSL (Dao-AILab#1891)

589cc20

Improve causal backward determinism perf with SPT schedule (Dao-AILab…

5c1627a

…#1893) * add spt scheduler for causal bwd determinism * add new torch check for det hdim 256 to stable api

Upgrade to cutlass v4.2.1 (Dao-AILab#1905)

1ceaa98

switch to use cutlass.utils.get_smem_capacity_in_bytes instead of dep…

3b24b08

…recated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906)

Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908)

0165c96

Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>

[Cute] Write ex2 emulation in a more readable form

cc0a79b

[Cute] Simplify utils.py a bit

5059fd5

[Cute] Remove arith & vector import in utils.py

c485eea

[CuteDSL] Fix test (Dao-AILab#1925)

cbd2490

Refactors to enable FlexAttention (Dao-AILab#1840)

5183de4

* Refactors to enable FlexAttention * Thread throught the buffers to the score_mod * add-test * add fastdivmod * comments * comments

[Cute] Fix softmax for cutlass-dsl==4.2.1

a38d69d

[Cute] Fix softmax for fwd_sm100

437b35a

[Cute,Bwd] Simplify bwd_preprocessing kernel

ea03e06

[Cute,Fwd,Sm90] Simplify by passing around functions

fbdba01

[Cute,Fwd,Sm90] Simplify score mode by passing around partial fn

b528f4b

[Cute] Optionally dump cubin and sass

13f2077

[Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n

c172985

[Cute,Bwd,Sm90] Format file w ruff

9eee089

[Cute,Bwd,Sm90] Fix bwd dK & dV, more async

42e4e3e

[Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum

093b935

micmelesse and others added 24 commits January 28, 2026 07:49

fix compute_block_sparsity usage in benchmark_mask_mod (Dao-AILab#2221)

514e63c

Fix shared-memory race (Dao-AILab#2229)

188643b

Use TORCH_TARGET_VERSION over TORCH_STABLE_ONLY (Dao-AILab#2155)

ef9e6a6

short readme for flex flash (Dao-AILab#2231)

24445c0

hdim 192 smem fix (Dao-AILab#2235)

f1284cf

[CUTE]Bump to Cutedsl (Dao-AILab#2216)

abaa878

Co-authored-by: Cursor <cursoragent@cursor.com>

pytest-dist round robin to gpus (Dao-AILab#2241)

48af662

[DSL] Replace old fence with cute.arch.fence_view_async_shared()

a804a5a

[DSL]Replace utils.{fma,mul,add}_packed_f32x2 with cute.arch version

5a66f2c

[DSL] Remove coord_offset_i64, domain_offset_i64, elem_pointer_i64

d39b629

Cute-dsl now supports i64 strides by default

[Sm90] Use functions from quack.sm90_utils

81f2c2d

[DSL] Use cute.arch.warp_reduction_{max,sum}

7edcf59

[Layout] Use reshape_acc_to_mn and reshape_acc_to_frgA from quack

b735ef2

[Layout] Use quack.layout_utils.mma_partition_C_vec

8dd8019

[DSL] Use cute.math.{exp2,log2,log}

90f10fa

[Layout] Use layout_utils.transpose_view and select from quack

b9148ce

[Bwd,Sm90] Use quack.copy_utils

c912a37

[Bwd,Sm100] Shorten PipelineTmaUmma create

deb1830

[Bwd,Sm90] Have score_mod and score_mod_bwd as partial functions

17d2943

[DSL] warpgroup_reg_alloc -> setmaxregister_increase

2a8d39c

Fix Hopper tests (Dao-AILab#2242)

72c7ba4

Merge remote-tracking branch 'upstream/main' into merge_upstream

fc9e426

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni force-pushed the merge_upstream branch from 67281f0 to fc9e426 Compare February 11, 2026 19:24

LucasWilkinson merged commit 6b9a223 into vllm-project:main Feb 11, 2026
2 checks passed

MatthewBonanni mentioned this pull request Feb 11, 2026

[Cute,Fwd,Sm100] hdim 192 smem fix #118

Closed

LucasWilkinson mentioned this pull request Feb 12, 2026

Revert cutlass after bad merge #120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream#119

Merge upstream#119
LucasWilkinson merged 498 commits intovllm-project:mainfrom
MatthewBonanni:merge_upstream

MatthewBonanni commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

MatthewBonanni commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants