Merged
Conversation
drisspg
added a commit
that referenced
this pull request
Jan 29, 2026
stack-info: PR: #2216, branch: drisspg/stack/7
7bb4e46 to
7056dc1
Compare
drisspg
added a commit
that referenced
this pull request
Jan 29, 2026
stack-info: PR: #2216, branch: drisspg/stack/7
This was referenced Jan 31, 2026
1cc40ae to
90afa4e
Compare
Merged
drisspg
added a commit
that referenced
this pull request
Feb 4, 2026
stack-info: PR: #2216, branch: drisspg/stack/7
90afa4e to
2c1ec17
Compare
2c1ec17 to
2968ef1
Compare
drisspg
added a commit
to drisspg/flash-attention
that referenced
this pull request
Feb 5, 2026
stack-info: PR: Dao-AILab#2216, branch: drisspg/stack/7
drisspg
added a commit
to drisspg/flash-attention
that referenced
this pull request
Feb 5, 2026
stack-info: PR: Dao-AILab#2216, branch: drisspg/stack/7
2968ef1 to
5765608
Compare
Member
|
@drisspg lets bump to cutlass dsl 4.4.0.dev1 and then we can merge |
drisspg
added a commit
to drisspg/flash-attention
that referenced
this pull request
Feb 8, 2026
stack-info: PR: Dao-AILab#2216, branch: drisspg/stack/7 Co-authored-by: Cursor <cursoragent@cursor.com>
5765608 to
d074f85
Compare
Collaborator
Author
|
@tridao this one should be good |
Collaborator
Author
|
|
stack-info: PR: #2216, branch: drisspg/stack/7 Co-authored-by: Cursor <cursoragent@cursor.com>
d074f85 to
0492857
Compare
drisspg
commented
Feb 8, 2026
tridao
approved these changes
Feb 8, 2026
LucasWilkinson
pushed a commit
to vllm-project/flash-attention
that referenced
this pull request
Feb 11, 2026
* [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL (Dao-AILab#1858) * update num_threads based on num wgs * fix bug when not intra_wg_overlap and not mma_pv_is_rs * make FA3 compatible with CUDA 13 Builds (Dao-AILab#1860) Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0 when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128), leading to a compiler failure during barrier initialization. Changed to round-up division to ensure a minimum value of 1. * [BUILD] SBSA wheels + CUDA 13 Support (Dao-AILab#1865) * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration * drop 12.4 * drop 12.4 * fix correct name * fix correct name * fix correct name * fix correct name * cibuildwheel.yml * benchmark: qualify all attention backends by methods list (Dao-AILab#1881) * ABI stable fa3 (Dao-AILab#1791) * squashed * fixes * fixes * Fix narrow * Add TORCH_STABLE_ONLY flag * new_empty + zero_ --> new_zeros * revert flash_api.cpp and add flash_api_stable.cpp * update setup.py * Only pass TORCH_STABLE_ONLY for stable build * Address Jane's comments * > to >= * [NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882) * fix typo * Update setup.py * Update setup.py * Update setup.py * Update setup.py * fix typo in flops calculation for local attention (Dao-AILab#1883) * flash-attn-cute bwd sm90 (Dao-AILab#1868) * [Cute] Make testing utils standlone for cute (Dao-AILab#1892) * Bump pin for CuTeDSL (Dao-AILab#1891) * Improve causal backward determinism perf with SPT schedule (Dao-AILab#1893) * add spt scheduler for causal bwd determinism * add new torch check for det hdim 256 to stable api * Upgrade to cutlass v4.2.1 (Dao-AILab#1905) * switch to use cutlass.utils.get_smem_capacity_in_bytes instead of deprecated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906) * Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908) Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local> * C++11 fix warnings (Dao-AILab#1904) * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char). * Update flash_api_stable.cpp * upstream cutlass v4.2.1 csrc * [Cute] Write ex2 emulation in a more readable form * [Cute] Simplify utils.py a bit * [Cute] Remove arith & vector import in utils.py * [CuteDSL] Fix test (Dao-AILab#1925) * Refactors to enable FlexAttention (Dao-AILab#1840) * Refactors to enable FlexAttention * Thread throught the buffers to the score_mod * add-test * add fastdivmod * comments * comments * [Cute] Fix softmax for cutlass-dsl==4.2.1 * [Cute] Fix softmax for fwd_sm100 * [Cute,Bwd] Simplify bwd_preprocessing kernel * [Cute,Fwd,Sm90] Simplify by passing around functions * [Cute,Fwd,Sm90] Simplify score mode by passing around partial fn * [Cute] Optionally dump cubin and sass * [Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n * [Cute,Bwd,Sm90] Format file w ruff * [Cute,Bwd,Sm90] Fix bwd dK & dV, more async * [Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum * [Cute,Bwd,Sm90] Use 1 barrier for loading both K & V * [Cute,Bwd,Sm90] Don't clear dK & dV, use zero_init mma flag instead * [Cute,Bwd,Sm90] Use TMA to store dK & dV * [Cute,Bwd,Sm90] Load K together w Q & LSE in the first iteration * [Cute,Sm90] Move gemm helper functions to hopper_helpers.py * Swap masking to not use R2P * Pre-indent to make commit diffs readable * Adding varlen support + tests * Remove self refs in softmax for loop (Dao-AILab#1924) Co-authored-by: Tri Dao <tridao@users.noreply.github.com> * [Cute,Bwd,Sm90] Make postprocessing kernel work * [Cute] Run ruff format on bwd files * [CI] Add pre-commit GH action * [Cute,Bwd,Sm90] Try dO_stage=1, PdS_stage=1 * [Cute,Bwd,Sm90] Make causal work * [Cute,Bwd,Sm90] Implement dQ_swapAB * [Cute,Bwd,Sm90] Implement SdP_swapAB * [AMD] Torch Compile Issues (Dao-AILab#1756) * fix rounding and dropout metdata bug * fix lse shape and bug in interface * return softmax is true * [Cute,Bwd,Sm90] Implement mma_dkv_is_rs * [Cute,Bwd,Sm90] Use block size 80x128 * [CUTE] Enable Pack GQA for score mods (Dao-AILab#1937) * Add precommit list and then uncomment in chunks (Dao-AILab#1941) * create list to work through * include ampere * [ROCm] prepare CK sources for pytorch hipify v2 APIs (Dao-AILab#1944) See pytorch/pytorch#151845. pytorch has removed caffe2, but hipify still contained work-arounds for caffe2 vs torch compatibility. As a result of hipify v2 changes, some torch APIs are changing. * [Cute] Add flake8 config file * [Cute,Fwd,Sm90] Load Q & K using the same mbarrier * [Cute,Bwd,Sm90] Use the same producer states if Q_stage == dO_stage * [Cute,Bwd,Sm90] Split sdQaccum layout into 2 warp groups * [Cute,Bwd,Sm90] Implement masking * [Cute,Fwd,Sm100] Parse swizzle from pointer, don't need to pass in * [Cute,Fwd,Sm100] Clean up * [Cute,Fwd,Sm100] Clean up mask * [Cute] Reformat blackwell_helpers.py, block_info.py * [Cute] Format mma_sm100_desc.py, seqlen_info.py * sm100 bwd add kernel and update postprocess mask and barriers (Dao-AILab#1945) * [Cute,Bwd,Sm100] Format flash_bwd_sm100.py and flash_bwd_postprocess * [Cute,Bwd,Sm100] Rename var {m,n}_block_size->tile_{m,n} * [Cute,Bwd,Sm100] Clean up a bit * add barrier module (Dao-AILab#1946) * [Cute,Bwd,Sm100] Have a separate function to set up the mma * [Cute,Bwd,Sm100] Load LSE with cpasync_bulk * [Cute,Bwd,Sm100] Load dPsum with cpasync_bulk * [Cute,Bwd,Sm100] Use copy_utils functions to load Q & dO * [Cute,Bwd,Sm100] Load K & Q, V & dO in the first iteration * [Cute,Bwd,Sm100] Simplify mma by using functools.partial * [Cute,Bwd,Sm100] Don't need q_dk_consumer_state * [Cute,Bwd,Sm100] Simplify dQacc_reduce, don't need mbarrier * [Cute,Bwd,Sm100] Iterate from m_block_min -> m_block_max * [Cute,Bwd,Sm100] Try direct atomicadd rmem -> gmem * [Cute,Bwd,Sm100] Combine pipeline_dK and pipeline_dV into one * [Cute,Bwd,Sm100] All compute warps wait for lse_barrier * [Cute,Bwd,Sm100] sdQaccum doesn't need swizzle * [Cute,Bwd,Sm100] Try gemm_ptx * [Cute,Bwd,Sm100] Clean up compute fn * [Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1 * [Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables * [Cute,Bwd,Sm100] Hardcode dS_stage = 1 * [Cute,Bwd,Sm100] Add option for delay tma store * Fix hopper cuda 13 build (Dao-AILab#1949) * [CuteDSL] Fix hash function for cute.jit decorator (Dao-AILab#1953) * Block Sparsity and Flex Attention mask mod support (Dao-AILab#1942) * clean up and rebase for PR * add mask mod tests * add benchmarking files * refactor for better style * remove extraneous csrc * type hint buffers * refactor: order of non/overlap and modify blocksparse producer to agree with dense * change variable name back to buffers * remove unnecessary variable in first_half_block * restore erroneous packgqa deletion * add blocksparsity and mask_mod asserts to interface.py * fix rebase issues * Restore submodule and reset pointer to upstream/main * rename cutlass.const_expr to const_expr * support fully masked m blocks (i.e. skipped tiles) * remove outdated commented code * cutlass v4.3.0 (Dao-AILab#1952) * [Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx * [Cute,Bwd,Sm100] More cleanup * [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (Dao-AILab#1961) * clean up and rebase for PR * add mask mod tests * add benchmarking files * refactor for better style * remove extraneous csrc * type hint buffers * refactor: order of non/overlap and modify blocksparse producer to agree with dense * change variable name back to buffers * remove unnecessary variable in first_half_block * restore erroneous packgqa deletion * add blocksparsity and mask_mod asserts to interface.py * fix rebase issues * Restore submodule and reset pointer to upstream/main * rename cutlass.const_expr to const_expr * support fully masked m blocks (i.e. skipped tiles) * remove outdated commented code * rename buffers -> aux_tensors, fix score_mod test in sm90 fwd * fix mask mod interface issues and tests * remove newline at end of file * format with ruff * format mask & sm100 with ruff * format more files with ruff * format barrier.py with ruff * Fix FA3 segfault with custom CUDA streams in ABI stable build (Dao-AILab#1957) The ABI stable implementation incorrectly used getCurrentStream().id() which returns a StreamId (int64_t) instead of the actual cudaStream_t pointer. Casting an integer ID to a stream pointer caused segmentation faults when using custom CUDA streams. Fixed by using the proper AOTI C API function aoti_torch_get_current_cuda_stream() which returns the actual CUDA stream pointer. * [Cute,Fwd,Sm100] Fix interface w score mod to get it to run * [Cute,Sm100] In gemm ptx, add to base smem_address instead * [Cute,Bwd,Sm100] Make postprocessing work, add interface * [Cute,Bwd,Sm100] Simplify layouts in compute_loop * [Cute,Bwd,Sm100] Causal mask * [Cute,Bwd,Sm100] Enable bwd tests * [Cute,Bwd] Enable bwd benchmarks * [Cute] Add store_shared_remote_fp32x4 util function * [Cute,Bwd,Sm100] Tune registers * [Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr * [Cute,Bwd,Sm100] Reduce sync * [Cute] Change utils.view_transpose back * [Cute,Bwd,Sm100] Remove delay_tma_store option * [Cute,Bwd,Sm100] Implement cluster Co-authored-by: Ted Zadouri <tz6037@princeton.edu> * [Cute] Copy benchmark util functions to cute directory Easier to benchmark without having to install FA2 * [Cute,Bwd,Sm100] Use pipeline class for LSE and dPsum * [Cute,Bwd,Sm100] Remove stage from sK, sV, tP, sdS * [Cute,Bwd,Sm100] Fix wrong LSE and dPsum indexing in load * [Cute] Blocks tweaks (Dao-AILab#1964) * [Cute,Bwd,Sm100] Use TS MMA for dK * [Cute,Blocksparse] Group block sparse input torch tensors * [Cute,Bwd,Sm100] Separate mma_S and mma_dP * [Cute,Bwd,Sm100] Try LPTBwdScheduler * [Cute,Bwd,Sm100] Try separating warps loading Q and dO * BlockSparse Tweaks (Dao-AILab#1970) * Tweaks * better errors * Switch to new API * [Cute] Fix main (Dao-AILab#1982) * [Cute,Fwd,Sm100] Implement SplitKV (Dao-AILab#1940) * Implement split KV * Remove modal bench harness * Fixes * [Cute] Extract block-sparse utilities from SM80/90 (Dao-AILab#1984) - Create block_sparse_utils.py with SM80/90 block-sparse logic - Refactor flash_fwd.py to use extracted utilities - Clean up whitespace in block_sparsity.py This extracts the block-sparse consumer loop and related utilities from flash_fwd.py into a reusable module for SM80/90 architectures. * Enable python-3.10+ (Dao-AILab#1998) * [Cute, Bwd, Sm100] Add GQA support (Dao-AILab#2004) * add gqa for sm100 bwd * remove mha guard for test * change to cluster size 1 * [Cute,Fwd,Sm100] fix major regression with split kv (Dao-AILab#2006) * [CuTe DSL] Block sparsity computation kernel (Dao-AILab#1983) * begin block sparsity computation kernel * block sparsity computation kernel and benchmark working * loop range_constexpr * add fast kernel * merge fast and regular kernel * use TensorSSA approach to mask mod * update with OOB check * tests and benchmarks for block sparsity working * remove extraneous files * Revert mask.py to previous state - removing unintended changes from block sparsity work * remove flex attn test stub * add sleeps to benchmark * correct block sparsity benchmark to use torch.compile * Restore missing mask definitions and fix benchmark window_size handling * move benchmarks into new directory * compute_block_sparsity docstring * streamline compute block sparsity benchmark script * [NVIDIA] bump github actions (Dao-AILab#1996) * Update GitHub Actions to use checkout@v5 and setup-python@v6; enhance compute capability support * revert changes * revert * Update publish.yml * Update publish.yml * Update publish.yml * Update publish.yml * cuda-toolkit@v0.2.29 * [Cute,Fwd,Sm100] Support paged attention (Dao-AILab#1999) * modal bench and correctness * implement for one thread per row * coalesced(?) gmem loads * use cp async * use 64 threads to load * fill in smem for V * pass tests * fixes * removed extra files * handle V loading for n_block < 0 * Add torch.compile support to flash attention 3 * Don't return mutated variables in mha_bwd * Change fake_check flag to be opt-in; Remove build.sh and remove if-else around `torch.library.custom_op` usage * Remove print statements and update exception message * Fix flash_attn_backward_fake * Add `safe_aot_autograd_check` * Update namespace to flash_attn_3 * Add `flash_attn_forward.register_autograd` * Fix bug in `flash_attn_backward_fake` * Add support and tests for torch.export and aoti_compile_and_package * format code * update flash_api_stable.cpp * Fix flash_api_stable.cpp build * Only run schema_check if dtype is not float8_e4m3fn * Correctly compute kBlockM for sm88/86/80 * Fix bug in boxed_mha_bwd * don't run autograd_check when num_splits > 0 * [Cute] Add block-sparsity support to SM100 (Dao-AILab#1985) - Implement block-sparse attention in flash_fwd_sm100.py - Update interface.py to handle SM100 block size calculations (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows) - Add mask_mod parameter support in mask.py for block-sparse masking - Add SM100 test fixtures and tile size handling in test_mask_mod.py This enables block-sparsity on SM 10.0 architecture, including mask_mod support and proper block size accounting. * [Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao-AILab#2014) * use correction warps for epi when varlen (non tma O) * properly enable fallback epilogue for varlen q * fix rebase errors * update tests * Raise TypeError if out is specified when compiling _flash_attn_forward * add fastdivmod for oob reads in mask_mods (Dao-AILab#2020) * add fastdivmod for oob reads in mask_mods * Updates for h100 * don't pass mask_fn to softmax_step generically (Dao-AILab#2026) * swap order of decorators (Dao-AILab#2029) * [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions (Dao-AILab#2033) * enable deterministic mode for sm100 bwd and fix race conditions * turn off lpt scheduler for causal * use more regs for reduce when deterministic * make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release * use 100k iterations for default * [NFC] Trivial fix to silence linter (Dao-AILab#1928) Not much to see here, but this causes linter noise * Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032) * [Cute] Add authors * [Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031) * Bump pin (Dao-AILab#2025) * Bump pin * Swtich to new fastdivmod * cleanup varlen on blackwell * Allow for only cute install * ruff all the smaller files (Dao-AILab#2040) * [Flash] Fix head dim 64 bwd (Dao-AILab#2035) * Add headdim64 tests (Dao-AILab#2041) * [Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046) * add local for sm100 bwd * add deterministic * update tests * ruff files * remove old code * move comment * override window_size = None for causal * revert to fwd test defaults * Add hash attr to shortcut expensive check (Dao-AILab#2048) * [AMD ROCm] Update to latest composable_kernel to improve performance (Dao-AILab#2052) * Update CK and c++ version * update CK * update ck * Update comment to reflect qscale_type in fmha_fwd_traits --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> * fixing cute bwd func def (Dao-AILab#2056) * Fix use-after-free in FA3 deterministic mode. The pytorch caching allocator actually saves us here, but if you turn it off, then compute-sanitizer will detect this. (Dao-AILab#2063) * [CUTE] Allow grads to be preallocated (Dao-AILab#2065) * [Cute,Fwd] Extend score_mod to variable sequence length (Dao-AILab#2043) * rebase to main * varlen support for score mod * interface change for varlen score mod * implement varlen support for score mod * varlen score mod working; updated tests * modify varlen score mod to use fastdiv_mods updated per sequence * updated test suite * current working state of varlen score mod * refactor varlen score mod tests * fix to transpose * refactor varlen score mod tests; fix bug; clean up varlen score mod application in kernel * refactor test_score_mod.py to use external score mod definition file * update flash_fwd.py for varlen score mod * sm90 varlen score mod working; test revisions * enable packgqa for varlen score mod; set up fastdiv_mod recomputation * update flash_fwd_sm100.py for recomputing fastdiv_mods & format varlen score mod test * Overwrite pack_gqa.py, tile_scheduler.py, and test_flash_attn.py with origin/main versions * rebase to main * fix test rebase artifacts * fix floor_if_packed redundancy * correct sm90 divmods mismatch * revert test_flash_attn to main * add varlen score mod benchmark script * packgqa for varlen (independent of score mod) * rm benchmark from PR * move score mod arg wrapping to utils.py * format with ruff * major refactor: change score_mod signature to accept seqlen_info and update all tests accordingly * reinstate varlen packgqa exclusion checks * move fastdiv_mods recomputation out of apply_score_mod in prep for varlen mask_mod support * remove duplicate fastdiv_mod recomputation * [Fix] fastdiv_mods for paged attn and seqused_* * clean up PR; fix paged_kv varlen for sm90 * update to varlen score mod test script (paged kv) * remove premature seqlen arguments from sm90 apply_mask_mod * [CUTE] Seeing if tvvm reduces cpu overhead (Dao-AILab#2042) * [FIRST] Fix softcap scoremod kwargs typo. (Dao-AILab#2072) * basics working (Dao-AILab#2070) * Blocksparse impl (Dao-AILab#2085) * Fix IMA in fwd on m boundary (Dao-AILab#2091) * Fix IMA in fwd on m boundary * Fix compeltely OOB loads * Update to dsl 3.4.3 (Dao-AILab#2092) * README for AMD ROCm (Dao-AILab#2068) * readme update for rocm Signed-off-by: seungrok.jung <seungrok.jung@amd.com> * readme update for rocm Signed-off-by: seungrok.jung <seungrok.jung@amd.com> --------- Signed-off-by: seungrok.jung <seungrok.jung@amd.com> * fix shuffle sync for pack gqa epilogue (Dao-AILab#2097) * improve paged cpasync * Enable Thor (Dao-AILab#2108) * [Cute] Add quack as dependency * [Cute,Fwd,Sm90] Change PipelineTMAAsync sublass to signal per warp Previous we signal per warp group, but that makes the code more complicated for a tiny bit of perf gain. * Add pack-gqa support for blcoksparse impl w/ braodcasted H dim (Dao-AILab#2098) * [Cute,Fwd] improved block sparsity (Dao-AILab#2100) * improved block sparsity computation * refactor blocksparsity computation for tvm-ffi * refactor mask mod definitions and tests * refactor of block sparsity and mask mod application; eventually allow varlen * remove fastdivmods from compute block sparsity * remove unnecessary imports * revert to 1-phase block sparsity computation * update bwd kernels to use new AttentionMaskCls api * fix linter error * [Cute] Fix minor lint issue in shuffle_sync * Misc tests that should be xfailed for now (Dao-AILab#2127) * Update cutlass to fix undefined symbol: cuDriverGetVersion. (Dao-AILab#2142) * [Cute,Fwd,Sm100] Support `q_stage=1` for inference (Dao-AILab#1993) * use q_stage=1 for split kv * determine q_stage via seqlen_q for sm100 * repurpose softmax1 warps for cp.async load * address comments * [Cute] Fix two tests that were failing (Dao-AILab#2149) * [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py The paged KV cache tests (test_score_mod_with_paged_kvcache and test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY to skip tests on SM90 since paged KV cache is only supported on SM100. However, the variable was never defined, causing a NameError. This adds the same definition used in test_mask_mod.py: COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0] * [Cute] Fix missing seqlen_info parameter in mask_mod call The mask_mod call in apply_mask_sm100_transposed was missing the seqlen_info parameter. All mask functions expect the signature: (batch, head, m_idx, n_idx, seqlen_info, aux_tensors) The other two mask_mod calls in the same file correctly pass all 6 arguments, but this one only passed 5, causing: TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors' This fixes test_mask_mod.py::test_mask_mod_ima_partial_block. * cleanup * [Cute, Bwd, Sm100] Add varlen for sm100 bwd (Dao-AILab#2150) * varlen bwd with rounded padded offsets * fix mha * change offset mode to round down multiple * enable varlen bwd tests * enable deterministic mode * fix deadlock and switch mha to no postprocess * reenable tests * fix lint error * use head swizzle/spt for deterministic, update tests * change padding offset based on arch * rebase and update interface, tests * add arch dispatch for padded offset q to postprocess * address comments * remove tile sizes from seqlen info class vars * block-sparse backward SM90 (Dao-AILab#2136) * score-mod backward SM90 (Dao-AILab#2137) * [Cute] Clarify and fix subtle cachekey bug (Dao-AILab#2143) * [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change (Dao-AILab#2146) * [CUTE][SM90]Enable pack-gqa with broadcasted maskmods (Dao-AILab#2145) * [CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158) * [Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167) * fix seqused in varlen bwd * enable store zero for zero len seqused q * [CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170) * [Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171) * add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing * remove unnecessary reformatting * reinstate changes * [Cute][Flex] Remove no longer needed contig (Dao-AILab#2172) * [Cute] update row_max before safe overwrite for online_softmax (Dao-AILab#2174) * update row_max before safe overwrite * move up row_max_prev * [Cute][Flex] add back in contig (Dao-AILab#2177) * [Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180) * baseline local flops * [Cute,Fwd,Sm100] distributed offset calculation for paged KV (Dao-AILab#2104) * fully shard paged KV address calculation across threads * use t0 indices for static bound checking * increase tiled copy to full KV row * shrink predicate tensor * clarify paged KV divisibility constraints * increase load register allocation * Add R2P dual bound masking for local attention Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention. * remove benchmark result, undo changes to benchmark * Add R2P dual bound masking for local attention Add mask_r2p_dual_bound function using XOR of two bitmasks to efficiently mask elements outside [col_limit_left, col_limit_right) range for SM100 local attention. * switch from xor to mask_right & ~ mask_left * flip in_bound to out_bound * remove zero logic for right_s and left_s * remove 24 clamp * doc * lint * added back clamp to avoid "OverflowError: Python int too large to convert to C long" * add comment * [Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189) * [Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AILab#2194) * fix * same fix for bwd and SM80 * reduce chance of build oom (Dao-AILab#2079) * [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases (Dao-AILab#2187) * ci: Use 1 ninja job for cu13 (Dao-AILab#2195) Signed-off-by: oliver könig <okoenig@nvidia.com> * Update README to include 'psutil' package as build requirement (Dao-AILab#2210) Added 'psutil' as a build requirement in the README. * [Flex][SM100] Replay expand fix on sm100 (Dao-AILab#2209) stack-info: PR: Dao-AILab#2209, branch: drisspg/stack/6 * [DSL] Optionally patch cute-dsl to use system's ptxas * [AMD] Triton Backend for ROCm #3 (Dao-AILab#2178) * Fused Bwd (Dao-AILab#137) * Fused with Good perf and stride fixed Fix fused bugs isolate failing case fix bug bring back test cases rm split impl in fused use exp2 is global variable now try oom fix save make fused the default limit to reproduce failure return default to split fix head size bug use exp2 back to true * new grid * BLK_SLICE_FACTOR = 1 * add tflops * new commit * test in parrallel * strides added by jusson * disable alibi * fix bugs again * default to fused * add bwd options for varlen * backend filter * default to jingning and batch 4 * best fwd config * fix TRITON_PRINT_AUTOTUNING flag bug * tune * Tuning fwd prefill * add if else * use flag * Minor mask fix * FLIP GRID * use best config for default * print when autotuning * test bfloat16 * fix k and v stride bugs * skip bfloat16 * test kvpacked * disable internal tests * pick default config based on arch * Add alibi in the new bwd kernel (Dao-AILab#139) * enable alibi for jinging kernel enable alibi for jinging kernel match * save bad configs * fix alibi and causal bug * disable autotune by default * auto tune when benching is good * set best config * remove env var * Update amd_tests.yml * upgrad to triton==3.3.0 * increase shm * use 64 x 64 for now * save * handle 1d alibi * Add fp8 to fused kernel (Dao-AILab#140) * fp8 stuff find test case compute delta fp8 basic fp8 config passing non causal path works * isolate bad case * fix fp8 bug * didnot fix fp8 bug * back to failing test * fp8 tests passing * skip * skip ref tests --------- Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> * head, seq, batch (Dao-AILab#141) * Fix keys (Dao-AILab#144) * save * rm keys * fix keys * use GHA_RENDER_DEVICES * normal docker * Pad LSE (Dao-AILab#148) * add round multiple * fix fwd * backward fix * use rounded lse flag * passing ROUNDED_LSE * default is new rounded mode * rename to fused_atmoics and fused_no_atomics * add test for torch_compile * add varlen torch compile test * add old one kernel for ref * fix varlen mismatch bug * fix shape issue in varlen but mismatch * sync torch compile kernel launch * simple varlen test * add debug code * rm old * ignore old impls * DEBUG flag works in interface only * ref uses the righ shape for lse * rm oldest bwd kernel * fix typo * fix varlen bug * fix bug. Get info from q for now * simple shape and stride checkout * add more tests * test kvcache * kvcache safe * match case * fix segfault due to bad return_softmax * run bench * run seperate for the main functions * just output benchmark * default csv format and time stamp files * non verbsoe bench * Sliding Window Forward (Dao-AILab#151) * Compress SWA work test case set up debug inputs add fwd ref one mask ref fwd first pass save ref doesnot work for bigger seqlens save new version some causal cases failing found bad cases working new attn new atten works new attn_fwd works reorg n_extra_tokens use seqlen_delta_qk ref fwd works add sliding window to bwd ref test kvcache decode ref work with everything except sliding window add debug code for 12 failing sliding window cases for decode attention_decode_forward_ref_impl mostly works except for alibi fix alibi in attention_decode_forward_ref_impl ref works with normal, varlen & kvcache move stuff around figure out masking old attn inner two inner functions remove load_fn do Lk - Lq like ref unify IS_CAUSAL code in epilogue clean up add args rm inference stuff simplify compute_masking simpler compute mask stub out returning front masking variables remove pointer pass compute ptrs inloop compute block min and max window stub inside inner mask loop trying to use attn_fwd_mask causes issues fix compiler bug when front masking gen specifc types add sliding window and debug statements use identity for v add more taste cases add comments save use k_max_token for clarity disable debug configs basic NON-CAUSAL SLIDING WINDOW non causal sliding window works on the all the shapes non sliding window working in fwd clean up fused bwd seperate old fwd_prefill move configs to utils.py * fix bwd ref bug * skip local cases so that fa output * no sliding window causal green * add backward test skip for sliding window * clean reduce in fwd_kvcache. no is_CASUAL branching * add kvcache masking * kvcache working * fix some bugs in test.py * clean up * Fix Device Segfault (Dao-AILab#152) * Compress segfault work fix backward segfault rework offset ignore .profile ignore .analysis save * assert the kernel launch device and tensor devices are the same * fix failing asserts * add asserts to fwd * Fix SDMASK bug * Log triton, torch and fa version * Fix fp8 import issues * fix docs (Dao-AILab#154) * Sliding Window block classification logic (Dao-AILab#155) * add aiter code * remove aiter stuff * sliding window non causal masking works * causal and sliding window block masking * extract common * clean up typo * helper for swa * ignore .amd * fix last block bug * Enable FA V3 (Dao-AILab#157) * Compress PA work narrow pa test ref works on most cases inplace ref with new_kv inplace paged attention add pa ref save pa basic paged works save fix swa + causal in pa. Also new_kv only on pa path passing build fa v3 import interface from fa v3 copy fa tests use v3 api clean up rename to match old test support different head sizes remove fp8 basisc passing v3 cases test_flash_attn_varlen_output v3 working isolate bad case for kvcache case passing save use decode is seqused/ cacheseql is given use decode if not varlen basci kvcache v3 working kvcache enable more cases detect kvcache case if seqused_q is non and sequese_k is not None skip failing test find fp8 failing case mha fp8 works fix fp8 MQA/GQA bug clean up more clean up clean up more don't need fp8 dead code remove train code with fp8 stuff fp8 working in kvcache paged + fp8 seems to be working new_kv allowed * clean up * skip hopper race test * clean up more * fix paged + alibi * similar inner paged api * unify _attn_fwd_inner * AITER integration (Dao-AILab#159) * clean up v2 interface * assert fp8 scale shapes * rotary working * move rotary to impl layers * remove einops * enable rotarry in v3 * create interface * fix descale assert * unify bwd * lint from aiter * clean fp8 api * add api change * assert shapes for v2 * remove ref and bench.py * remove metadata class and clean up * bwd_prefill * one bwd.py * rename * lint * add bwd_change (Dao-AILab#156) * Tune FP8 Perf (Dao-AILab#160) * check cu count for gfx942 * create get_cu_count * update repo root * update forward tune * clean up load * use float8_e4m3fnuz * save * show bwd mode * recommend fp8 * use torch.float32 for fp8 kernel * add both best fp16 and fp8 config * tune fp8 backward * descale factors should be b, hk * fp8 bwd working on all primus configs * tune bwd configs * fa v3 tests passing * better warning * clean up bwd launcher * v3 passing * tune more * improve perf * clean up * lint * clean * start tuning gfx950 * tune non causal path * fix bug * save * Skip configs where BLOCK_M2 % BLOCK_N2 != 0 * skip more * stop tuning * fix varlen bug * fix dropout & causal/swa segfault * update the to machine new changes * save * fix more bugs * remove random seed * clean up * update readme * print tensor stats for debug * disable sliding window tests * add rdna configs * fix k partial bug * fix block_size_n bug * fix type check bug --------- Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> * fix compute_block_sparsity usage in benchmark_mask_mod (Dao-AILab#2221) * Fix shared-memory race (Dao-AILab#2229) * Use TORCH_TARGET_VERSION over TORCH_STABLE_ONLY (Dao-AILab#2155) * short readme for flex flash (Dao-AILab#2231) * [FA3] Mark current main version as v3.0.0 stable (Dao-AILab#2223) A collaboration between Flash-Attention, PyTorch and xFormers is trying to provide pre-built wheels for FA3 across as many platforms/environments as possible (e.g., ARM, Windows, CUDA 13, ...). To simplify the installation workflow, it would help to tag these packages as stable, but the current main version is tagged as beta. FA3 hasn't received substantial updates in a while (the latest was a bugfix almost two months ago), and most new development is happening in FA4. Thus, in this PR, I propose we just claim that the current main version _is_ stable. I have heard concerns that the feature set of FA3 doesn't currently match FA2 (e.g., dropout is missing). I think this concern is partly addressed by the fact that the new wheels will have a different name than the FA2 ones (`flash_attn_3` and `flash_attn` respectively), hence the former does _not_ claim to be a replacement for the latter, and the two can coexist (and they provide different modules). * hdim 192 smem fix (Dao-AILab#2235) * Add `FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON` env var support (Dao-AILab#2239) * Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON env var support Allows users to override triton config when not autotuning. * Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON to readme * Rename to FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON * [CUTE]Bump to Cutedsl (Dao-AILab#2216) Co-authored-by: Cursor <cursoragent@cursor.com> * pytest-dist round robin to gpus (Dao-AILab#2241) * [DSL] Replace old fence with cute.arch.fence_view_async_shared() * [DSL]Replace utils.{fma,mul,add}_packed_f32x2 with cute.arch version * [DSL] Remove coord_offset_i64, domain_offset_i64, elem_pointer_i64 Cute-dsl now supports i64 strides by default * [Sm90] Use functions from quack.sm90_utils * [DSL] Use cute.arch.warp_reduction_{max,sum} * [Layout] Use reshape_acc_to_mn and reshape_acc_to_frgA from quack * [Layout] Use quack.layout_utils.mma_partition_C_vec * [DSL] Use cute.math.{exp2,log2,log} * [Layout] Use layout_utils.transpose_view and select from quack * [Bwd,Sm90] Use quack.copy_utils * [Bwd,Sm100] Shorten PipelineTmaUmma create * [Bwd,Sm90] Have score_mod and score_mod_bwd as partial functions * [DSL] warpgroup_reg_alloc -> setmaxregister_increase * Fix Hopper tests (Dao-AILab#2242) --------- Signed-off-by: seungrok.jung <seungrok.jung@amd.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Reuben Stern <107093092+reubenconducts@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Rajesh Shashi Kumar <35628747+rajesh-s@users.noreply.github.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Henry Tsang <henrylhtsang@meta.com> Co-authored-by: Ted Zadouri <tedzadouri@gmail.com> Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com> Co-authored-by: jayhshah <jayhshah@gmail.com> Co-authored-by: brandonsun <brandons@nvidia.com> Co-authored-by: JackCharlesZhang <113156832+JackCharlesZhang@users.noreply.github.com> Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local> Co-authored-by: Tri Dao <tridpq@gmail.com> Co-authored-by: imbr92 <40306754+imbr92@users.noreply.github.com> Co-authored-by: Kevin Tong <kevin@augmentcode.com> Co-authored-by: Tri Dao <tridao@users.noreply.github.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Kevin Wang <kevmo314@gmail.com> Co-authored-by: Ted Zadouri <tz6037@princeton.edu> Co-authored-by: timmy-feng <70349932+timmy-feng@users.noreply.github.com> Co-authored-by: Guilherme Leobas <guilhermeleobas@gmail.com> Co-authored-by: Anakin(Yancheng) Zheng <103552181+anakinxc@users.noreply.github.com> Co-authored-by: Jean-Luc Duprat <jld@acm.org> Co-authored-by: Markus Hoehnerbach <mhoehnerbach@meta.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: liangel-02 <liangel@meta.com> Co-authored-by: skarupke <malteskarupke@fastmail.fm> Co-authored-by: Leo Dong <leodong0315@gmail.com> Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Kareem <81531392+KareemMusleh@users.noreply.github.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Wang Lecheng <wanglecheng@stu.pku.edu.cn> Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> Co-authored-by: zhuochen <zhuochen@outlook.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Co-authored-by: Luca Wehrstedt <luca.wehrstedt@gmail.com> Co-authored-by: Alex Butler <alexheretic@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
[CUTE]Bump to Cutedsl
NOT FOR LAND YET -> should wait til 4.4.0 release
Going to work on dynamic CLC scheduler;
requires: Dao-AILab/quack#70