Skip to content

[Cute] Bump pin for CuTeDSL#1891

Merged
tridao merged 1 commit intoDao-AILab:mainfrom
drisspg:update-to-4.2.0
Sep 17, 2025
Merged

[Cute] Bump pin for CuTeDSL#1891
tridao merged 1 commit intoDao-AILab:mainfrom
drisspg:update-to-4.2.0

Conversation

@drisspg
Copy link
Collaborator

@drisspg drisspg commented Sep 16, 2025

Summary

Without these changes I was getting

  DSLRuntimeError: 💥💥💥 Error during runtime code generation for function `__call__` 💥💥💥
    Caused exception: cannot access local variable 'acc_S_mn' where it is not associated with a value

and the dynamic protocol warnings for softmax: https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/dsl_jit_arg_generation.html#direct-protocol-implementation-in-custom-types

Screenshot 2025-09-16 at 8 03 21 PM

@drisspg drisspg force-pushed the update-to-4.2.0 branch 3 times, most recently from 7448198 to 9671234 Compare September 16, 2025 18:50
@drisspg drisspg changed the title Bump pin for CuTeDSL [Not ready] Bump pin for CuTeDSL Sep 16, 2025
@drisspg drisspg changed the title [Not ready] Bump pin for CuTeDSL [Not ready] [CUTE] Bump pin for CuTeDSL Sep 16, 2025
@drisspg drisspg changed the title [Not ready] [CUTE] Bump pin for CuTeDSL [Not ready] [Cute] Bump pin for CuTeDSL Sep 16, 2025
@drisspg drisspg force-pushed the update-to-4.2.0 branch 3 times, most recently from 4b7e459 to d1547dc Compare September 16, 2025 20:32
@drisspg drisspg changed the title [Not ready] [Cute] Bump pin for CuTeDSL [Cute] Bump pin for CuTeDSL Sep 16, 2025
)
c = 0
col_limit_transformed = 0
ncol: cute.Constexpr = 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only one that feels kinda sketch..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @drisspg

This should already been fixed with 4.2.1 release

@tridao tridao merged commit 589cc20 into Dao-AILab:main Sep 17, 2025
LucasWilkinson added a commit to vllm-project/flash-attention that referenced this pull request Jan 29, 2026
* Remove old xentropy kernel

This hasn't been used since 2023-09

* Remove old fused softmax kernel from apex/Megatron

* Remove old attn decode kernel from FasterTransformer

* Remove old rotary kernel

* [Cute] Implement page table with TMA for fwd_sm100

* [Cute] Remove trailing bracket (Dao-AILab#1809)

This fixes Commit 81cdf4c

* [Cute] Make sure R2P happen

* feat: add support for pytorch2.8 (Dao-AILab#1801)

* [Cute] Implement PackGQA with TMA for fwd_sm100

Credit: Jay Shah's idea

* Bump to v2.8.3

* [BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen (Dao-AILab#1795)

When the parameter `cache_seqlen` is scalar, it should be expand to
vector of shape (batch_size).  In the original code, whenever `block_table`
is used, the shape of `k_cache` is (num_blocks, page_size, ...), and
thus `cache_seqlen` is expanded to shape (num_blocks) instead of
(batch_size), which is wrong.  This fix uses the shape of `q`, which
is always `batch_size`.

* [Cute] Port fwd_combine kernel from C++ to cute-dsl

* [Cute] Simplify tile scheduler storing params

* [Cute] Implement sink for fwd_sm90

* [Cute] Implement PackGQA with TMA for fwd_sm90

* [Cute] Use R2P for masking in fwd_sm90

Actually doesn't seem to make it faster

* Add sorting and head swizzle to varlen scheduler (Dao-AILab#1823)

* use LPT order in varlen kernel

* add prefill decode benchmark script

* add sort in prepare

* add full implementation:

* add varlen kvhead swizzle

* add settings for swizzle ablation

* add correction term for sort when causal

* remove ablation options from frontend and clean up comments

* add comments in prepare kernel

* remove debug code and scripts

* put back defaults in tests

* remove excess Nones returned in python interface for varlen

* revert opinionated change to setup.py on cuda version 12.9

* force inline sort op and make east const

* more templating in varlen scheduler to cure some register spilling

* fix exploding build by splitting compilation and add qol macros for hdimdiff

* fix metadata mismatch with seqlenk in test script

* extend prepare kernel to >992 batches and always call it for varlen

* do inter-batch sort per every 992 batches

* better names in combine and fix prepare condition in api

* Fixes incorrect variable reference in comment (Dao-AILab#1775)

Corrects comment documentation to reference total_q instead of total_k for the output tensor dimensions, ensuring consistency with the actual parameter being described.

* Update the initialization of dk/dv_semaphore (Dao-AILab#1839)

When testing the deterministic option for the GQA case, we found it fell into deadlock issues. Initialization dk and dv_semaphore to zeros to fix this issue.

* Update tile_scheduler.hpp (Dao-AILab#1841)

* ci: Move build job to workflow template (Dao-AILab#1835)

* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Build via workflow template (Dao-AILab#1844)

* ci: Move build job to workflow template

Signed-off-by: oliver könig <okoenig@nvidia.com>

* check out right tag

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* revert

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Allow build/deploy of arbitrary configurations (Dao-AILab#1827)

* ci: Allow build/deploy of arbitrary configurations

Signed-off-by: oliver könig <okoenig@nvidia.com>

* add

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanui

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cxx11_abi

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* final

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* upload

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: Switch to workflow_dispatch (Dao-AILab#1847)

* [`FA3`] Allow returning LSE via kwarg (Dao-AILab#1851)

* lse output

* style

* style

* revert test changes, introduce optional kwarg to output lse

* [BugFix] fix flash_fwd.FlashAttentionForwardSm80  bugs (Dao-AILab#1856)

* [BugFix] fix softcap condition

softcap should only be referenced when its not none, currently the logic is reversed and will result in an error

* [BugFix] fix sm80 cuteDSL error


1. Current condition on softcap is wrong and will result in RuntimeError. Change the code to align with sm_100
2. Make window_size_left and window_size_right optional to align with sm_100 and all other interfaces.

* Fix typo of range_constexpr

* Fix seqlen

* [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL (Dao-AILab#1858)

* update num_threads based on num wgs

* fix bug when not intra_wg_overlap and not mma_pv_is_rs

* make FA3 compatible with CUDA 13 Builds (Dao-AILab#1860)

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0
when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128),
leading to a compiler failure during barrier initialization. Changed to round-up
division to ensure a minimum value of 1.

* [BUILD] SBSA wheels + CUDA 13 Support (Dao-AILab#1865)

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* drop 12.4

* drop 12.4

* fix correct name

* fix correct name

* fix correct name

* fix correct name

* cibuildwheel.yml

* benchmark: qualify all attention backends by methods list (Dao-AILab#1881)

* ABI stable fa3 (Dao-AILab#1791)

* squashed

* fixes

* fixes

* Fix narrow

* Add TORCH_STABLE_ONLY flag

* new_empty + zero_ --> new_zeros

* revert flash_api.cpp and add flash_api_stable.cpp

* update setup.py

* Only pass TORCH_STABLE_ONLY for stable build

* Address Jane's comments

* > to >=

* [NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882)

* fix typo

* Update setup.py

* Update setup.py

* Update setup.py

* Update setup.py

* fix typo in flops calculation for local attention (Dao-AILab#1883)

* flash-attn-cute bwd sm90 (Dao-AILab#1868)

* [Cute] Make testing utils standlone for cute (Dao-AILab#1892)

* Bump pin for CuTeDSL (Dao-AILab#1891)

* Improve causal backward determinism perf with SPT schedule (Dao-AILab#1893)

* add spt scheduler for causal bwd determinism

* add new torch check for det hdim 256 to stable api

* Upgrade to cutlass v4.2.1 (Dao-AILab#1905)

* switch to use cutlass.utils.get_smem_capacity_in_bytes instead of deprecated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906)

* Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908)

Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>

* C++11 fix warnings (Dao-AILab#1904)

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* Update flash_api_stable.cpp

* upstream cutlass v4.2.1 csrc

* [Cute] Write ex2 emulation in a more readable form

* [Cute] Simplify utils.py a bit

* [Cute] Remove arith & vector import in utils.py

* [CuteDSL] Fix test (Dao-AILab#1925)

* Refactors to enable FlexAttention (Dao-AILab#1840)

* Refactors to enable FlexAttention

* Thread throught the buffers to the score_mod

* add-test

* add fastdivmod

* comments

* comments

* [Cute] Fix softmax for cutlass-dsl==4.2.1

* [Cute] Fix softmax for fwd_sm100

* [Cute,Bwd] Simplify bwd_preprocessing kernel

* [Cute,Fwd,Sm90] Simplify by passing around functions

* [Cute,Fwd,Sm90] Simplify score mode by passing around partial fn

* [Cute] Optionally dump cubin and sass

* [Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n

* [Cute,Bwd,Sm90] Format file w ruff

* [Cute,Bwd,Sm90] Fix bwd dK & dV, more async

* [Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum

* [Cute,Bwd,Sm90] Use 1 barrier for loading both K & V

* [Cute,Bwd,Sm90] Don't clear dK & dV, use zero_init mma flag instead

* [Cute,Bwd,Sm90] Use TMA to store dK & dV

* [Cute,Bwd,Sm90] Load K together w Q & LSE in the first iteration

* [Cute,Sm90] Move gemm helper functions to hopper_helpers.py

* Swap masking to not use R2P

* Pre-indent to make commit diffs readable

* Adding varlen support + tests

* Remove self refs in softmax for loop (Dao-AILab#1924)

Co-authored-by: Tri Dao <tridao@users.noreply.github.com>

* [Cute,Bwd,Sm90] Make postprocessing kernel work

* [Cute] Run ruff format on bwd files

* [CI] Add pre-commit GH action

* [Cute,Bwd,Sm90] Try dO_stage=1, PdS_stage=1

* [Cute,Bwd,Sm90] Make causal work

* [Cute,Bwd,Sm90] Implement dQ_swapAB

* [Cute,Bwd,Sm90] Implement SdP_swapAB

* [AMD] Torch Compile Issues (Dao-AILab#1756)

* fix rounding and dropout metdata bug

* fix lse shape and bug in interface

* return softmax is true

* [Cute,Bwd,Sm90] Implement mma_dkv_is_rs

* [Cute,Bwd,Sm90] Use block size 80x128

* [CUTE] Enable Pack GQA for score mods (Dao-AILab#1937)

* Add precommit list and then uncomment in chunks (Dao-AILab#1941)

* create list to work through

* include ampere

* [ROCm] prepare CK sources for pytorch hipify v2 APIs (Dao-AILab#1944)

See pytorch/pytorch#151845.
pytorch has removed caffe2, but hipify still contained
work-arounds for caffe2 vs torch compatibility.
As a result of hipify v2 changes, some torch APIs are changing.

* [Cute] Add flake8 config file

* [Cute,Fwd,Sm90] Load Q & K using the same mbarrier

* [Cute,Bwd,Sm90] Use the same producer states if Q_stage == dO_stage

* [Cute,Bwd,Sm90] Split sdQaccum layout into 2 warp groups

* [Cute,Bwd,Sm90] Implement masking

* [Cute,Fwd,Sm100] Parse swizzle from pointer, don't need to pass in

* [Cute,Fwd,Sm100] Clean up

* [Cute,Fwd,Sm100] Clean up mask

* [Cute] Reformat blackwell_helpers.py, block_info.py

* [Cute] Format mma_sm100_desc.py, seqlen_info.py

* sm100 bwd add kernel and update postprocess mask and barriers (Dao-AILab#1945)

* [Cute,Bwd,Sm100] Format flash_bwd_sm100.py and flash_bwd_postprocess

* [Cute,Bwd,Sm100] Rename var {m,n}_block_size->tile_{m,n}

* [Cute,Bwd,Sm100] Clean up a bit

* add barrier module (Dao-AILab#1946)

* [Cute,Bwd,Sm100] Have a separate function to set up the mma

* [Cute,Bwd,Sm100] Load LSE with cpasync_bulk

* [Cute,Bwd,Sm100] Load dPsum with cpasync_bulk

* [Cute,Bwd,Sm100] Use copy_utils functions to load Q & dO

* [Cute,Bwd,Sm100] Load K & Q, V & dO in the first iteration

* [Cute,Bwd,Sm100] Simplify mma by using functools.partial

* [Cute,Bwd,Sm100] Don't need q_dk_consumer_state

* [Cute,Bwd,Sm100] Simplify dQacc_reduce, don't need mbarrier

* [Cute,Bwd,Sm100] Iterate from m_block_min -> m_block_max

* [Cute,Bwd,Sm100] Try direct atomicadd rmem -> gmem

* [Cute,Bwd,Sm100] Combine pipeline_dK and pipeline_dV into one

* [Cute,Bwd,Sm100] All compute warps wait for lse_barrier

* [Cute,Bwd,Sm100] sdQaccum doesn't need swizzle

* [Cute,Bwd,Sm100] Try gemm_ptx

* [Cute,Bwd,Sm100] Clean up compute fn

* [Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1

* [Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables

* [Cute,Bwd,Sm100] Hardcode dS_stage = 1

* [Cute,Bwd,Sm100] Add option for delay tma store

* Fix hopper cuda 13 build (Dao-AILab#1949)

* [CuteDSL] Fix hash function for cute.jit decorator (Dao-AILab#1953)

* Block Sparsity and Flex Attention mask mod support (Dao-AILab#1942)

* clean up and rebase for PR

* add mask mod tests

* add benchmarking files

* refactor for better style

* remove extraneous csrc

* type hint buffers

* refactor: order of non/overlap and modify blocksparse producer to agree with dense

* change variable name back to buffers

* remove unnecessary variable in first_half_block

* restore erroneous packgqa deletion

* add blocksparsity and mask_mod asserts to interface.py

* fix rebase issues

* Restore submodule and reset pointer to upstream/main

* rename cutlass.const_expr to const_expr

* support fully masked m blocks (i.e. skipped tiles)

* remove outdated commented code

* cutlass v4.3.0 (Dao-AILab#1952)

* [Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx

* [Cute,Bwd,Sm100] More cleanup

* [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (Dao-AILab#1961)

* clean up and rebase for PR

* add mask mod tests

* add benchmarking files

* refactor for better style

* remove extraneous csrc

* type hint buffers

* refactor: order of non/overlap and modify blocksparse producer to agree with dense

* change variable name back to buffers

* remove unnecessary variable in first_half_block

* restore erroneous packgqa deletion

* add blocksparsity and mask_mod asserts to interface.py

* fix rebase issues

* Restore submodule and reset pointer to upstream/main

* rename cutlass.const_expr to const_expr

* support fully masked m blocks (i.e. skipped tiles)

* remove outdated commented code

* rename buffers -> aux_tensors, fix score_mod test in sm90 fwd

* fix mask mod interface issues and tests

* remove newline at end of file

* format with ruff

* format mask & sm100 with ruff

* format more files with ruff

* format barrier.py with ruff

* Fix FA3 segfault with custom CUDA streams in ABI stable build (Dao-AILab#1957)

The ABI stable implementation incorrectly used getCurrentStream().id()
which returns a StreamId (int64_t) instead of the actual cudaStream_t
pointer. Casting an integer ID to a stream pointer caused segmentation
faults when using custom CUDA streams.

Fixed by using the proper AOTI C API function aoti_torch_get_current_cuda_stream()
which returns the actual CUDA stream pointer.

* [Cute,Fwd,Sm100] Fix interface w score mod to get it to run

* [Cute,Sm100] In gemm ptx, add to base smem_address instead

* [Cute,Bwd,Sm100] Make postprocessing work, add interface

* [Cute,Bwd,Sm100] Simplify layouts in compute_loop

* [Cute,Bwd,Sm100] Causal mask

* [Cute,Bwd,Sm100] Enable bwd tests

* [Cute,Bwd] Enable bwd benchmarks

* [Cute] Add store_shared_remote_fp32x4 util function

* [Cute,Bwd,Sm100] Tune registers

* [Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr

* [Cute,Bwd,Sm100] Reduce sync

* [Cute] Change utils.view_transpose back

* [Cute,Bwd,Sm100] Remove delay_tma_store option

* [Cute,Bwd,Sm100] Implement cluster

Co-authored-by: Ted Zadouri <tz6037@princeton.edu>

* [Cute] Copy benchmark util functions to cute directory

Easier to benchmark without having to install FA2

* [Cute,Bwd,Sm100] Use pipeline class for LSE and dPsum

* [Cute,Bwd,Sm100] Remove stage from sK, sV, tP, sdS

* [Cute,Bwd,Sm100] Fix wrong LSE and dPsum indexing in load

* [Cute] Blocks tweaks (Dao-AILab#1964)

* [Cute,Bwd,Sm100] Use TS MMA for dK

* [Cute,Blocksparse] Group block sparse input torch tensors

* [Cute,Bwd,Sm100] Separate mma_S and mma_dP

* [Cute,Bwd,Sm100] Try LPTBwdScheduler

* [Cute,Bwd,Sm100] Try separating warps loading Q and dO

* BlockSparse Tweaks (Dao-AILab#1970)

* Tweaks

* better errors

* Switch to new API

* [Cute] Fix main (Dao-AILab#1982)

* [Cute,Fwd,Sm100] Implement SplitKV (Dao-AILab#1940)

* Implement split KV

* Remove modal bench harness

* Fixes

* [Cute] Extract block-sparse utilities from SM80/90 (Dao-AILab#1984)

- Create block_sparse_utils.py with SM80/90 block-sparse logic
- Refactor flash_fwd.py to use extracted utilities
- Clean up whitespace in block_sparsity.py

This extracts the block-sparse consumer loop and related utilities
from flash_fwd.py into a reusable module for SM80/90 architectures.

* Enable python-3.10+ (Dao-AILab#1998)

* [Cute, Bwd, Sm100] Add GQA support (Dao-AILab#2004)

* add gqa for sm100 bwd

* remove mha guard for test

* change to cluster size 1

* [Cute,Fwd,Sm100] fix major regression with split kv (Dao-AILab#2006)

* [CuTe DSL] Block sparsity computation kernel (Dao-AILab#1983)

* begin block sparsity computation kernel

* block sparsity computation kernel and benchmark working

* loop range_constexpr

* add fast kernel

* merge fast and regular kernel

* use TensorSSA approach to mask mod

* update with OOB check

* tests and benchmarks for block sparsity working

* remove extraneous files

* Revert mask.py to previous state - removing unintended changes from block sparsity work

* remove flex attn test stub

* add sleeps to benchmark

* correct block sparsity benchmark to use torch.compile

* Restore missing mask definitions and fix benchmark window_size handling

* move benchmarks into new directory

* compute_block_sparsity docstring

* streamline compute block sparsity benchmark script

* [NVIDIA] bump github actions (Dao-AILab#1996)

* Update GitHub Actions to use checkout@v5 and setup-python@v6; enhance compute capability support

* revert changes

* revert

* Update publish.yml

* Update publish.yml

* Update publish.yml

* Update publish.yml

* cuda-toolkit@v0.2.29

* [Cute,Fwd,Sm100] Support paged attention (Dao-AILab#1999)

* modal bench and correctness

* implement for one thread per row

* coalesced(?) gmem loads

* use cp async

* use 64 threads to load

* fill in smem for V

* pass tests

* fixes

* removed extra files

* handle V loading for n_block < 0

* Add torch.compile support to flash attention 3

* Don't return mutated variables in mha_bwd

* Change fake_check flag to be opt-in; Remove build.sh and remove if-else around `torch.library.custom_op` usage

* Remove print statements and update exception message

* Fix flash_attn_backward_fake

* Add `safe_aot_autograd_check`

* Update namespace to flash_attn_3

* Add `flash_attn_forward.register_autograd`

* Fix bug in `flash_attn_backward_fake`

* Add support and tests for torch.export and aoti_compile_and_package

* format code

* update flash_api_stable.cpp

* Fix flash_api_stable.cpp build

* Only run schema_check if dtype is not float8_e4m3fn

* Correctly compute kBlockM for sm88/86/80

* Fix bug in boxed_mha_bwd

* don't run autograd_check when num_splits > 0

* [Cute] Add block-sparsity support to SM100 (Dao-AILab#1985)

- Implement block-sparse attention in flash_fwd_sm100.py
- Update interface.py to handle SM100 block size calculations
  (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows)
- Add mask_mod parameter support in mask.py for block-sparse masking
- Add SM100 test fixtures and tile size handling in test_mask_mod.py

This enables block-sparsity on SM 10.0 architecture, including
mask_mod support and proper block size accounting.

* [Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao-AILab#2014)

* use correction warps for epi when varlen (non tma O)

* properly enable fallback epilogue for varlen q

* fix rebase errors

* update tests

* Raise TypeError if out is specified when compiling _flash_attn_forward

* add fastdivmod for oob reads in mask_mods (Dao-AILab#2020)

* add fastdivmod for oob reads in mask_mods

* Updates for h100

* don't pass mask_fn to softmax_step generically (Dao-AILab#2026)

* swap order of decorators (Dao-AILab#2029)

* [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions (Dao-AILab#2033)

* enable deterministic mode for sm100 bwd and fix race conditions

* turn off lpt scheduler for causal

* use more regs for reduce when deterministic

* make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release

* use 100k iterations for default

* [NFC] Trivial fix to silence linter (Dao-AILab#1928)

Not much to see here, but this causes linter noise

* Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032)

* [Cute] Add authors

* [Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031)

* Bump pin (Dao-AILab#2025)

* Bump pin

* Swtich to new fastdivmod

* cleanup varlen on blackwell

* Allow for only cute install

* ruff all the smaller files (Dao-AILab#2040)

* [Flash] Fix head dim 64 bwd (Dao-AILab#2035)

* Add headdim64 tests (Dao-AILab#2041)

* [Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046)

* add local for sm100 bwd

* add deterministic

* update tests

* ruff files

* remove old code

* move comment

* override window_size = None for causal

* revert to fwd test defaults

* Add hash attr to shortcut expensive check (Dao-AILab#2048)

* [AMD ROCm] Update to latest composable_kernel to improve performance (Dao-AILab#2052)

* Update CK and c++ version

* update CK

* update ck

* Update comment to reflect qscale_type in fmha_fwd_traits

---------

Co-authored-by: Jeff Huang <chiachi.huang@amd.com>

* fixing cute bwd func def (Dao-AILab#2056)

* Fix use-after-free in FA3 deterministic mode. The pytorch caching allocator actually saves us here, but if you turn it off, then compute-sanitizer will detect this. (Dao-AILab#2063)

* [CUTE] Allow grads to be preallocated (Dao-AILab#2065)

* [Cute,Fwd] Extend score_mod to variable sequence length (Dao-AILab#2043)

* rebase to main

* varlen support for score mod

* interface change for varlen score mod

* implement varlen support for score mod

* varlen score mod working; updated tests

* modify varlen score mod to use fastdiv_mods updated per sequence

* updated test suite

* current working state of varlen score mod

* refactor varlen score mod tests

* fix to transpose

* refactor varlen score mod tests; fix bug; clean up varlen score mod application in kernel

* refactor test_score_mod.py to use external score mod definition file

* update flash_fwd.py for varlen score mod

* sm90 varlen score mod working; test revisions

* enable packgqa for varlen score mod; set up fastdiv_mod recomputation

* update flash_fwd_sm100.py for recomputing fastdiv_mods & format varlen score mod test

* Overwrite pack_gqa.py, tile_scheduler.py, and test_flash_attn.py with origin/main versions

* rebase to main

* fix test rebase artifacts

* fix floor_if_packed redundancy

* correct sm90 divmods mismatch

* revert test_flash_attn to main

* add varlen score mod benchmark script

* packgqa for varlen (independent of score mod)

* rm benchmark from PR

* move score mod arg wrapping to utils.py

* format with ruff

* major refactor: change score_mod signature to accept seqlen_info and update all tests accordingly

* reinstate varlen packgqa exclusion checks

* move fastdiv_mods recomputation out of apply_score_mod in prep for varlen mask_mod support

* remove duplicate fastdiv_mod recomputation

* [Fix] fastdiv_mods for paged attn and seqused_*

* clean up PR; fix paged_kv varlen for sm90

* update to varlen score mod test script (paged kv)

* remove premature seqlen arguments from sm90 apply_mask_mod

* [CUTE] Seeing if tvvm reduces cpu overhead (Dao-AILab#2042)

* [FIRST] Fix softcap scoremod kwargs typo. (Dao-AILab#2072)

* basics working (Dao-AILab#2070)

* Blocksparse impl (Dao-AILab#2085)

* Fix IMA in fwd on m boundary (Dao-AILab#2091)

* Fix IMA in fwd on m boundary

* Fix compeltely OOB loads

* Update to dsl 3.4.3 (Dao-AILab#2092)

* README for AMD ROCm (Dao-AILab#2068)

* readme update for rocm

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

* readme update for rocm

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

---------

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

* fix shuffle sync for pack gqa epilogue (Dao-AILab#2097)

* improve paged cpasync

* Enable Thor (Dao-AILab#2108)

* [Cute] Add quack as dependency

* [Cute,Fwd,Sm90] Change PipelineTMAAsync sublass to signal per warp

Previous we signal per warp group, but that makes the code more complicated
for a tiny bit of perf gain.

* Add pack-gqa support for blcoksparse impl w/ braodcasted H dim (Dao-AILab#2098)

* [Cute,Fwd] improved block sparsity (Dao-AILab#2100)

* improved block sparsity computation

* refactor blocksparsity computation for tvm-ffi

* refactor mask mod definitions and tests

* refactor of block sparsity and mask mod application; eventually allow varlen

* remove fastdivmods from compute block sparsity

* remove unnecessary imports

* revert to 1-phase block sparsity computation

* update bwd kernels to use new AttentionMaskCls api

* fix linter error

* [Cute] Fix minor lint issue in shuffle_sync

* Misc tests that should be xfailed for now (Dao-AILab#2127)

* Update cutlass to fix undefined symbol: cuDriverGetVersion. (Dao-AILab#2142)

* [Cute,Fwd,Sm100] Support `q_stage=1` for inference (Dao-AILab#1993)

* use q_stage=1 for split kv

* determine q_stage via seqlen_q for sm100

* repurpose softmax1 warps for cp.async load

* address comments

* [Cute] Fix two tests that were failing  (Dao-AILab#2149)

* [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py

The paged KV cache tests (test_score_mod_with_paged_kvcache and
test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY
to skip tests on SM90 since paged KV cache is only supported on SM100.
However, the variable was never defined, causing a NameError.

This adds the same definition used in test_mask_mod.py:
COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0]

* [Cute] Fix missing seqlen_info parameter in mask_mod call

The mask_mod call in apply_mask_sm100_transposed was missing the
seqlen_info parameter. All mask functions expect the signature:
(batch, head, m_idx, n_idx, seqlen_info, aux_tensors)

The other two mask_mod calls in the same file correctly pass all 6
arguments, but this one only passed 5, causing:
TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors'

This fixes test_mask_mod.py::test_mask_mod_ima_partial_block.

* cleanup

* [Cute, Bwd, Sm100] Add varlen for sm100 bwd (Dao-AILab#2150)

* varlen bwd with rounded padded offsets

* fix mha

* change offset mode to round down multiple

* enable varlen bwd tests

* enable deterministic mode

* fix deadlock and switch mha to no postprocess

* reenable tests

* fix lint error

* use head swizzle/spt for deterministic, update tests

* change padding offset based on arch

* rebase and update interface, tests

* add arch dispatch for padded offset q to postprocess

* address comments

* remove tile sizes from seqlen info class vars

* block-sparse backward SM90 (Dao-AILab#2136)

* score-mod backward SM90 (Dao-AILab#2137)

* [Cute] Clarify and fix subtle cachekey bug (Dao-AILab#2143)

* [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change (Dao-AILab#2146)

* [CUTE][SM90]Enable pack-gqa with broadcasted maskmods (Dao-AILab#2145)

* [CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158)

* [Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167)

* fix seqused in varlen bwd

* enable store zero for zero len seqused q

* [CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170)

* [Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171)

* add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing

* remove unnecessary reformatting

* reinstate changes

* [Cute][Flex] Remove no longer needed contig (Dao-AILab#2172)

* [Cute] update row_max before safe overwrite for online_softmax (Dao-AILab#2174)

* update row_max before safe overwrite

* move up row_max_prev

* [Cute][Flex] add back in contig (Dao-AILab#2177)

* [Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180)

* baseline local flops

* [Cute,Fwd,Sm100] distributed offset calculation for paged KV (Dao-AILab#2104)

* fully shard paged KV address calculation across threads

* use t0 indices for static bound checking

* increase tiled copy to full KV row

* shrink predicate tensor

* clarify paged KV divisibility constraints

* increase load register allocation

* Add R2P dual bound masking for local attention

Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.

* remove benchmark result, undo changes to benchmark

* Add R2P dual bound masking for local attention

Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.

* switch from xor to mask_right & ~ mask_left

* flip in_bound to out_bound

* remove zero logic for right_s and left_s

* remove 24 clamp

* doc

* lint

* added back clamp to avoid "OverflowError: Python int too large to convert to C long"

* add comment

* [Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189)

* [Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AILab#2194)

* fix

* same fix for bwd and SM80

* reduce chance of build oom (Dao-AILab#2079)

* [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases (Dao-AILab#2187)

* Remove hopper/flash_api_torch_lib.cpp from CMakeLists.txt

Upstream flash_api.cpp already has torch bindings, so this file is no longer needed.

* Fix compatibility between upstream flash_api.cpp and downstream flash.h

- Use prepare_seqlen_q_ptr instead of num_m_blocks_ptr (downstream API)
- Restore static_switch.h from downstream (has QV_SWITCH macro)

* Restore entire hopper/ folder from downstream

Using downstream's hopper code (with n_offset, CP, varlen combine) for full
compatibility. Upstream changes are kept in non-hopper files.

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: seungrok.jung <seungrok.jung@amd.com>
Co-authored-by: Tri Dao <tridpq@gmail.com>
Co-authored-by: Jean-Luc Duprat <jld@acm.org>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Chao Shi <stepinto@live.com>
Co-authored-by: jayhshah <jayhshah@gmail.com>
Co-authored-by: Jingze Shi <losercheems@gmail.com>
Co-authored-by: y-sq <58683402+y-sq@users.noreply.github.com>
Co-authored-by: Ravi Ghadia <40660742+ghadiaravi13@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Mingyang <mhao1999@outlook.com>
Co-authored-by: Reuben Stern <107093092+reubenconducts@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Co-authored-by: Rajesh Shashi Kumar <35628747+rajesh-s@users.noreply.github.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Henry Tsang <henrylhtsang@meta.com>
Co-authored-by: Ted Zadouri <tedzadouri@gmail.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: brandonsun <brandons@nvidia.com>
Co-authored-by: JackCharlesZhang <113156832+JackCharlesZhang@users.noreply.github.com>
Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>
Co-authored-by: imbr92 <40306754+imbr92@users.noreply.github.com>
Co-authored-by: Kevin Tong <kevin@augmentcode.com>
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
Co-authored-by: Michael Melesse <micmelesse@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Kevin Wang <kevmo314@gmail.com>
Co-authored-by: Ted Zadouri <tz6037@princeton.edu>
Co-authored-by: timmy-feng <70349932+timmy-feng@users.noreply.github.com>
Co-authored-by: Guilherme Leobas <guilhermeleobas@gmail.com>
Co-authored-by: Anakin(Yancheng) Zheng <103552181+anakinxc@users.noreply.github.com>
Co-authored-by: Markus Hoehnerbach <mhoehnerbach@meta.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: liangel-02 <liangel@meta.com>
Co-authored-by: skarupke <malteskarupke@fastmail.fm>
Co-authored-by: Leo Dong <leodong0315@gmail.com>
Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
Co-authored-by: Kareem <81531392+KareemMusleh@users.noreply.github.com>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026
LucasWilkinson pushed a commit to vllm-project/flash-attention that referenced this pull request Feb 11, 2026
* [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL (Dao-AILab#1858)

* update num_threads based on num wgs

* fix bug when not intra_wg_overlap and not mma_pv_is_rs

* make FA3 compatible with CUDA 13 Builds (Dao-AILab#1860)

Fix CUDA barrier init crash when num_consumers < NumThreadsPerWarpGroup

Previously, integer division caused num_consumer_warpgroups_per_cluster to be 0
when params.num_consumers (e.g., 32) was less than NumThreadsPerWarpGroup (128),
leading to a compiler failure during barrier initialization. Changed to round-up
division to ensure a minimum value of 1.

* [BUILD] SBSA wheels + CUDA 13 Support (Dao-AILab#1865)

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* [BUILD] Update CUDA toolkit and PyTorch versions in CI configuration

* drop 12.4

* drop 12.4

* fix correct name

* fix correct name

* fix correct name

* fix correct name

* cibuildwheel.yml

* benchmark: qualify all attention backends by methods list (Dao-AILab#1881)

* ABI stable fa3 (Dao-AILab#1791)

* squashed

* fixes

* fixes

* Fix narrow

* Add TORCH_STABLE_ONLY flag

* new_empty + zero_ --> new_zeros

* revert flash_api.cpp and add flash_api_stable.cpp

* update setup.py

* Only pass TORCH_STABLE_ONLY for stable build

* Address Jane's comments

* > to >=

* [NVIDIA] Enable Blackwell Family Specific (Dao-AILab#1882)

* fix typo

* Update setup.py

* Update setup.py

* Update setup.py

* Update setup.py

* fix typo in flops calculation for local attention (Dao-AILab#1883)

* flash-attn-cute bwd sm90 (Dao-AILab#1868)

* [Cute] Make testing utils standlone for cute (Dao-AILab#1892)

* Bump pin for CuTeDSL (Dao-AILab#1891)

* Improve causal backward determinism perf with SPT schedule (Dao-AILab#1893)

* add spt scheduler for causal bwd determinism

* add new torch check for det hdim 256 to stable api

* Upgrade to cutlass v4.2.1 (Dao-AILab#1905)

* switch to use cutlass.utils.get_smem_capacity_in_bytes instead of deprecated cutlass.utils.ampere_helpers.SMEM_CAPACITY (Dao-AILab#1906)

* Add Missing None Gradient in FA3 QKVPacked (Dao-AILab#1908)

Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>

* C++11 fix warnings (Dao-AILab#1904)

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* errors are with C++11 narrowing warnings (treated as errors in strict builds) when initializing at::cuda::CUDAGuard with a non-constant char cast to c10::DeviceIndex (signed char).

* Update flash_api_stable.cpp

* upstream cutlass v4.2.1 csrc

* [Cute] Write ex2 emulation in a more readable form

* [Cute] Simplify utils.py a bit

* [Cute] Remove arith & vector import in utils.py

* [CuteDSL] Fix test (Dao-AILab#1925)

* Refactors to enable FlexAttention (Dao-AILab#1840)

* Refactors to enable FlexAttention

* Thread throught the buffers to the score_mod

* add-test

* add fastdivmod

* comments

* comments

* [Cute] Fix softmax for cutlass-dsl==4.2.1

* [Cute] Fix softmax for fwd_sm100

* [Cute,Bwd] Simplify bwd_preprocessing kernel

* [Cute,Fwd,Sm90] Simplify by passing around functions

* [Cute,Fwd,Sm90] Simplify score mode by passing around partial fn

* [Cute] Optionally dump cubin and sass

* [Cute,Fwd,Sm90] Rename m_block_size->tile_m, n_block_size->tile_n

* [Cute,Bwd,Sm90] Format file w ruff

* [Cute,Bwd,Sm90] Fix bwd dK & dV, more async

* [Cute,Bwd,Sm90] Use cp.async.bulk instead of TMA for LSE & dPsum

* [Cute,Bwd,Sm90] Use 1 barrier for loading both K & V

* [Cute,Bwd,Sm90] Don't clear dK & dV, use zero_init mma flag instead

* [Cute,Bwd,Sm90] Use TMA to store dK & dV

* [Cute,Bwd,Sm90] Load K together w Q & LSE in the first iteration

* [Cute,Sm90] Move gemm helper functions to hopper_helpers.py

* Swap masking to not use R2P

* Pre-indent to make commit diffs readable

* Adding varlen support + tests

* Remove self refs in softmax for loop (Dao-AILab#1924)

Co-authored-by: Tri Dao <tridao@users.noreply.github.com>

* [Cute,Bwd,Sm90] Make postprocessing kernel work

* [Cute] Run ruff format on bwd files

* [CI] Add pre-commit GH action

* [Cute,Bwd,Sm90] Try dO_stage=1, PdS_stage=1

* [Cute,Bwd,Sm90] Make causal work

* [Cute,Bwd,Sm90] Implement dQ_swapAB

* [Cute,Bwd,Sm90] Implement SdP_swapAB

* [AMD] Torch Compile Issues (Dao-AILab#1756)

* fix rounding and dropout metdata bug

* fix lse shape and bug in interface

* return softmax is true

* [Cute,Bwd,Sm90] Implement mma_dkv_is_rs

* [Cute,Bwd,Sm90] Use block size 80x128

* [CUTE] Enable Pack GQA for score mods (Dao-AILab#1937)

* Add precommit list and then uncomment in chunks (Dao-AILab#1941)

* create list to work through

* include ampere

* [ROCm] prepare CK sources for pytorch hipify v2 APIs (Dao-AILab#1944)

See pytorch/pytorch#151845.
pytorch has removed caffe2, but hipify still contained
work-arounds for caffe2 vs torch compatibility.
As a result of hipify v2 changes, some torch APIs are changing.

* [Cute] Add flake8 config file

* [Cute,Fwd,Sm90] Load Q & K using the same mbarrier

* [Cute,Bwd,Sm90] Use the same producer states if Q_stage == dO_stage

* [Cute,Bwd,Sm90] Split sdQaccum layout into 2 warp groups

* [Cute,Bwd,Sm90] Implement masking

* [Cute,Fwd,Sm100] Parse swizzle from pointer, don't need to pass in

* [Cute,Fwd,Sm100] Clean up

* [Cute,Fwd,Sm100] Clean up mask

* [Cute] Reformat blackwell_helpers.py, block_info.py

* [Cute] Format mma_sm100_desc.py, seqlen_info.py

* sm100 bwd add kernel and update postprocess mask and barriers (Dao-AILab#1945)

* [Cute,Bwd,Sm100] Format flash_bwd_sm100.py and flash_bwd_postprocess

* [Cute,Bwd,Sm100] Rename var {m,n}_block_size->tile_{m,n}

* [Cute,Bwd,Sm100] Clean up a bit

* add barrier module (Dao-AILab#1946)

* [Cute,Bwd,Sm100] Have a separate function to set up the mma

* [Cute,Bwd,Sm100] Load LSE with cpasync_bulk

* [Cute,Bwd,Sm100] Load dPsum with cpasync_bulk

* [Cute,Bwd,Sm100] Use copy_utils functions to load Q & dO

* [Cute,Bwd,Sm100] Load K & Q, V & dO in the first iteration

* [Cute,Bwd,Sm100] Simplify mma by using functools.partial

* [Cute,Bwd,Sm100] Don't need q_dk_consumer_state

* [Cute,Bwd,Sm100] Simplify dQacc_reduce, don't need mbarrier

* [Cute,Bwd,Sm100] Iterate from m_block_min -> m_block_max

* [Cute,Bwd,Sm100] Try direct atomicadd rmem -> gmem

* [Cute,Bwd,Sm100] Combine pipeline_dK and pipeline_dV into one

* [Cute,Bwd,Sm100] All compute warps wait for lse_barrier

* [Cute,Bwd,Sm100] sdQaccum doesn't need swizzle

* [Cute,Bwd,Sm100] Try gemm_ptx

* [Cute,Bwd,Sm100] Clean up compute fn

* [Cute,Bwd,Sm100] Combine pipeline_S and pipeline_P into 1

* [Cute,Bwd,Sm100] Don't shuffle LSE & dPsum, reduce state variables

* [Cute,Bwd,Sm100] Hardcode dS_stage = 1

* [Cute,Bwd,Sm100] Add option for delay tma store

* Fix hopper cuda 13 build (Dao-AILab#1949)

* [CuteDSL] Fix hash function for cute.jit decorator (Dao-AILab#1953)

* Block Sparsity and Flex Attention mask mod support (Dao-AILab#1942)

* clean up and rebase for PR

* add mask mod tests

* add benchmarking files

* refactor for better style

* remove extraneous csrc

* type hint buffers

* refactor: order of non/overlap and modify blocksparse producer to agree with dense

* change variable name back to buffers

* remove unnecessary variable in first_half_block

* restore erroneous packgqa deletion

* add blocksparsity and mask_mod asserts to interface.py

* fix rebase issues

* Restore submodule and reset pointer to upstream/main

* rename cutlass.const_expr to const_expr

* support fully masked m blocks (i.e. skipped tiles)

* remove outdated commented code

* cutlass v4.3.0 (Dao-AILab#1952)

* [Cute,Bwd,Sm100] Use CopyBulkG2SOp copy op instead of calling ptx

* [Cute,Bwd,Sm100] More cleanup

* [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs (Dao-AILab#1961)

* clean up and rebase for PR

* add mask mod tests

* add benchmarking files

* refactor for better style

* remove extraneous csrc

* type hint buffers

* refactor: order of non/overlap and modify blocksparse producer to agree with dense

* change variable name back to buffers

* remove unnecessary variable in first_half_block

* restore erroneous packgqa deletion

* add blocksparsity and mask_mod asserts to interface.py

* fix rebase issues

* Restore submodule and reset pointer to upstream/main

* rename cutlass.const_expr to const_expr

* support fully masked m blocks (i.e. skipped tiles)

* remove outdated commented code

* rename buffers -> aux_tensors, fix score_mod test in sm90 fwd

* fix mask mod interface issues and tests

* remove newline at end of file

* format with ruff

* format mask & sm100 with ruff

* format more files with ruff

* format barrier.py with ruff

* Fix FA3 segfault with custom CUDA streams in ABI stable build (Dao-AILab#1957)

The ABI stable implementation incorrectly used getCurrentStream().id()
which returns a StreamId (int64_t) instead of the actual cudaStream_t
pointer. Casting an integer ID to a stream pointer caused segmentation
faults when using custom CUDA streams.

Fixed by using the proper AOTI C API function aoti_torch_get_current_cuda_stream()
which returns the actual CUDA stream pointer.

* [Cute,Fwd,Sm100] Fix interface w score mod to get it to run

* [Cute,Sm100] In gemm ptx, add to base smem_address instead

* [Cute,Bwd,Sm100] Make postprocessing work, add interface

* [Cute,Bwd,Sm100] Simplify layouts in compute_loop

* [Cute,Bwd,Sm100] Causal mask

* [Cute,Bwd,Sm100] Enable bwd tests

* [Cute,Bwd] Enable bwd benchmarks

* [Cute] Add store_shared_remote_fp32x4 util function

* [Cute,Bwd,Sm100] Tune registers

* [Cute,Sm100] acc_tmem_addr is Int32 instead of constexpr

* [Cute,Bwd,Sm100] Reduce sync

* [Cute] Change utils.view_transpose back

* [Cute,Bwd,Sm100] Remove delay_tma_store option

* [Cute,Bwd,Sm100] Implement cluster

Co-authored-by: Ted Zadouri <tz6037@princeton.edu>

* [Cute] Copy benchmark util functions to cute directory

Easier to benchmark without having to install FA2

* [Cute,Bwd,Sm100] Use pipeline class for LSE and dPsum

* [Cute,Bwd,Sm100] Remove stage from sK, sV, tP, sdS

* [Cute,Bwd,Sm100] Fix wrong LSE and dPsum indexing in load

* [Cute] Blocks tweaks (Dao-AILab#1964)

* [Cute,Bwd,Sm100] Use TS MMA for dK

* [Cute,Blocksparse] Group block sparse input torch tensors

* [Cute,Bwd,Sm100] Separate mma_S and mma_dP

* [Cute,Bwd,Sm100] Try LPTBwdScheduler

* [Cute,Bwd,Sm100] Try separating warps loading Q and dO

* BlockSparse Tweaks (Dao-AILab#1970)

* Tweaks

* better errors

* Switch to new API

* [Cute] Fix main (Dao-AILab#1982)

* [Cute,Fwd,Sm100] Implement SplitKV (Dao-AILab#1940)

* Implement split KV

* Remove modal bench harness

* Fixes

* [Cute] Extract block-sparse utilities from SM80/90 (Dao-AILab#1984)

- Create block_sparse_utils.py with SM80/90 block-sparse logic
- Refactor flash_fwd.py to use extracted utilities
- Clean up whitespace in block_sparsity.py

This extracts the block-sparse consumer loop and related utilities
from flash_fwd.py into a reusable module for SM80/90 architectures.

* Enable python-3.10+ (Dao-AILab#1998)

* [Cute, Bwd, Sm100] Add GQA support (Dao-AILab#2004)

* add gqa for sm100 bwd

* remove mha guard for test

* change to cluster size 1

* [Cute,Fwd,Sm100] fix major regression with split kv (Dao-AILab#2006)

* [CuTe DSL] Block sparsity computation kernel (Dao-AILab#1983)

* begin block sparsity computation kernel

* block sparsity computation kernel and benchmark working

* loop range_constexpr

* add fast kernel

* merge fast and regular kernel

* use TensorSSA approach to mask mod

* update with OOB check

* tests and benchmarks for block sparsity working

* remove extraneous files

* Revert mask.py to previous state - removing unintended changes from block sparsity work

* remove flex attn test stub

* add sleeps to benchmark

* correct block sparsity benchmark to use torch.compile

* Restore missing mask definitions and fix benchmark window_size handling

* move benchmarks into new directory

* compute_block_sparsity docstring

* streamline compute block sparsity benchmark script

* [NVIDIA] bump github actions (Dao-AILab#1996)

* Update GitHub Actions to use checkout@v5 and setup-python@v6; enhance compute capability support

* revert changes

* revert

* Update publish.yml

* Update publish.yml

* Update publish.yml

* Update publish.yml

* cuda-toolkit@v0.2.29

* [Cute,Fwd,Sm100] Support paged attention (Dao-AILab#1999)

* modal bench and correctness

* implement for one thread per row

* coalesced(?) gmem loads

* use cp async

* use 64 threads to load

* fill in smem for V

* pass tests

* fixes

* removed extra files

* handle V loading for n_block < 0

* Add torch.compile support to flash attention 3

* Don't return mutated variables in mha_bwd

* Change fake_check flag to be opt-in; Remove build.sh and remove if-else around `torch.library.custom_op` usage

* Remove print statements and update exception message

* Fix flash_attn_backward_fake

* Add `safe_aot_autograd_check`

* Update namespace to flash_attn_3

* Add `flash_attn_forward.register_autograd`

* Fix bug in `flash_attn_backward_fake`

* Add support and tests for torch.export and aoti_compile_and_package

* format code

* update flash_api_stable.cpp

* Fix flash_api_stable.cpp build

* Only run schema_check if dtype is not float8_e4m3fn

* Correctly compute kBlockM for sm88/86/80

* Fix bug in boxed_mha_bwd

* don't run autograd_check when num_splits > 0

* [Cute] Add block-sparsity support to SM100 (Dao-AILab#1985)

- Implement block-sparse attention in flash_fwd_sm100.py
- Update interface.py to handle SM100 block size calculations
  (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows)
- Add mask_mod parameter support in mask.py for block-sparse masking
- Add SM100 test fixtures and tile size handling in test_mask_mod.py

This enables block-sparsity on SM 10.0 architecture, including
mask_mod support and proper block size accounting.

* [Cute,Sm100,Fwd] use correction warps for epi when not using TMA (Dao-AILab#2014)

* use correction warps for epi when varlen (non tma O)

* properly enable fallback epilogue for varlen q

* fix rebase errors

* update tests

* Raise TypeError if out is specified when compiling _flash_attn_forward

* add fastdivmod for oob reads in mask_mods (Dao-AILab#2020)

* add fastdivmod for oob reads in mask_mods

* Updates for h100

* don't pass mask_fn to softmax_step generically (Dao-AILab#2026)

* swap order of decorators (Dao-AILab#2029)

* [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions (Dao-AILab#2033)

* enable deterministic mode for sm100 bwd and fix race conditions

* turn off lpt scheduler for causal

* use more regs for reduce when deterministic

* make a src for tiled mma dK toggleable parameter, remove smem async fence for lse release

* use 100k iterations for default

* [NFC] Trivial fix to silence linter (Dao-AILab#1928)

Not much to see here, but this causes linter noise

* Add LICENSE and AUTHORS to flash_attn/cute (Dao-AILab#2032)

* [Cute] Add authors

* [Cute,Fwd] enable mask mod without blocksparsity (Dao-AILab#2031)

* Bump pin (Dao-AILab#2025)

* Bump pin

* Swtich to new fastdivmod

* cleanup varlen on blackwell

* Allow for only cute install

* ruff all the smaller files (Dao-AILab#2040)

* [Flash] Fix head dim 64 bwd (Dao-AILab#2035)

* Add headdim64 tests (Dao-AILab#2041)

* [Cute,Bwd,Sm100] Add local for sm100 bwd (Dao-AILab#2046)

* add local for sm100 bwd

* add deterministic

* update tests

* ruff files

* remove old code

* move comment

* override window_size = None for causal

* revert to fwd test defaults

* Add hash attr to shortcut expensive check (Dao-AILab#2048)

* [AMD ROCm] Update to latest composable_kernel to improve performance (Dao-AILab#2052)

* Update CK and c++ version

* update CK

* update ck

* Update comment to reflect qscale_type in fmha_fwd_traits

---------

Co-authored-by: Jeff Huang <chiachi.huang@amd.com>

* fixing cute bwd func def (Dao-AILab#2056)

* Fix use-after-free in FA3 deterministic mode. The pytorch caching allocator actually saves us here, but if you turn it off, then compute-sanitizer will detect this. (Dao-AILab#2063)

* [CUTE] Allow grads to be preallocated (Dao-AILab#2065)

* [Cute,Fwd] Extend score_mod to variable sequence length (Dao-AILab#2043)

* rebase to main

* varlen support for score mod

* interface change for varlen score mod

* implement varlen support for score mod

* varlen score mod working; updated tests

* modify varlen score mod to use fastdiv_mods updated per sequence

* updated test suite

* current working state of varlen score mod

* refactor varlen score mod tests

* fix to transpose

* refactor varlen score mod tests; fix bug; clean up varlen score mod application in kernel

* refactor test_score_mod.py to use external score mod definition file

* update flash_fwd.py for varlen score mod

* sm90 varlen score mod working; test revisions

* enable packgqa for varlen score mod; set up fastdiv_mod recomputation

* update flash_fwd_sm100.py for recomputing fastdiv_mods & format varlen score mod test

* Overwrite pack_gqa.py, tile_scheduler.py, and test_flash_attn.py with origin/main versions

* rebase to main

* fix test rebase artifacts

* fix floor_if_packed redundancy

* correct sm90 divmods mismatch

* revert test_flash_attn to main

* add varlen score mod benchmark script

* packgqa for varlen (independent of score mod)

* rm benchmark from PR

* move score mod arg wrapping to utils.py

* format with ruff

* major refactor: change score_mod signature to accept seqlen_info and update all tests accordingly

* reinstate varlen packgqa exclusion checks

* move fastdiv_mods recomputation out of apply_score_mod in prep for varlen mask_mod support

* remove duplicate fastdiv_mod recomputation

* [Fix] fastdiv_mods for paged attn and seqused_*

* clean up PR; fix paged_kv varlen for sm90

* update to varlen score mod test script (paged kv)

* remove premature seqlen arguments from sm90 apply_mask_mod

* [CUTE] Seeing if tvvm reduces cpu overhead (Dao-AILab#2042)

* [FIRST] Fix softcap scoremod kwargs typo. (Dao-AILab#2072)

* basics working (Dao-AILab#2070)

* Blocksparse impl (Dao-AILab#2085)

* Fix IMA in fwd on m boundary (Dao-AILab#2091)

* Fix IMA in fwd on m boundary

* Fix compeltely OOB loads

* Update to dsl 3.4.3 (Dao-AILab#2092)

* README for AMD ROCm (Dao-AILab#2068)

* readme update for rocm

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

* readme update for rocm

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

---------

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

* fix shuffle sync for pack gqa epilogue (Dao-AILab#2097)

* improve paged cpasync

* Enable Thor (Dao-AILab#2108)

* [Cute] Add quack as dependency

* [Cute,Fwd,Sm90] Change PipelineTMAAsync sublass to signal per warp

Previous we signal per warp group, but that makes the code more complicated
for a tiny bit of perf gain.

* Add pack-gqa support for blcoksparse impl w/ braodcasted H dim (Dao-AILab#2098)

* [Cute,Fwd] improved block sparsity (Dao-AILab#2100)

* improved block sparsity computation

* refactor blocksparsity computation for tvm-ffi

* refactor mask mod definitions and tests

* refactor of block sparsity and mask mod application; eventually allow varlen

* remove fastdivmods from compute block sparsity

* remove unnecessary imports

* revert to 1-phase block sparsity computation

* update bwd kernels to use new AttentionMaskCls api

* fix linter error

* [Cute] Fix minor lint issue in shuffle_sync

* Misc tests that should be xfailed for now (Dao-AILab#2127)

* Update cutlass to fix undefined symbol: cuDriverGetVersion. (Dao-AILab#2142)

* [Cute,Fwd,Sm100] Support `q_stage=1` for inference (Dao-AILab#1993)

* use q_stage=1 for split kv

* determine q_stage via seqlen_q for sm100

* repurpose softmax1 warps for cp.async load

* address comments

* [Cute] Fix two tests that were failing  (Dao-AILab#2149)

* [Cute] Add missing COMPUTE_CAPABILITY definition in test_score_mod.py

The paged KV cache tests (test_score_mod_with_paged_kvcache and
test_score_mod_with_paged_kvcache_aux_tensors) check COMPUTE_CAPABILITY
to skip tests on SM90 since paged KV cache is only supported on SM100.
However, the variable was never defined, causing a NameError.

This adds the same definition used in test_mask_mod.py:
COMPUTE_CAPABILITY = torch.cuda.get_device_capability()[0]

* [Cute] Fix missing seqlen_info parameter in mask_mod call

The mask_mod call in apply_mask_sm100_transposed was missing the
seqlen_info parameter. All mask functions expect the signature:
(batch, head, m_idx, n_idx, seqlen_info, aux_tensors)

The other two mask_mod calls in the same file correctly pass all 6
arguments, but this one only passed 5, causing:
TypeError: cute_ima_mask() missing 1 required positional argument: 'aux_tensors'

This fixes test_mask_mod.py::test_mask_mod_ima_partial_block.

* cleanup

* [Cute, Bwd, Sm100] Add varlen for sm100 bwd (Dao-AILab#2150)

* varlen bwd with rounded padded offsets

* fix mha

* change offset mode to round down multiple

* enable varlen bwd tests

* enable deterministic mode

* fix deadlock and switch mha to no postprocess

* reenable tests

* fix lint error

* use head swizzle/spt for deterministic, update tests

* change padding offset based on arch

* rebase and update interface, tests

* add arch dispatch for padded offset q to postprocess

* address comments

* remove tile sizes from seqlen info class vars

* block-sparse backward SM90 (Dao-AILab#2136)

* score-mod backward SM90 (Dao-AILab#2137)

* [Cute] Clarify and fix subtle cachekey bug (Dao-AILab#2143)

* [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change (Dao-AILab#2146)

* [CUTE][SM90]Enable pack-gqa with broadcasted maskmods (Dao-AILab#2145)

* [CUTE][SM90] GQA backward non deterministic (Dao-AILab#2158)

* [Cute,Bwd,Sm100] fix seqused in varlen bwd (Dao-AILab#2167)

* fix seqused in varlen bwd

* enable store zero for zero len seqused q

* [CUTE] Bump cutedsl to 4.3.5 (Dao-AILab#2170)

* [Cute,Flex] Add option to create and cache __cute_hash__ (Dao-AILab#2171)

* add __cute_hash__ when it doesn't exist to prevent unnecessary future hashing

* remove unnecessary reformatting

* reinstate changes

* [Cute][Flex] Remove no longer needed contig (Dao-AILab#2172)

* [Cute] update row_max before safe overwrite for online_softmax (Dao-AILab#2174)

* update row_max before safe overwrite

* move up row_max_prev

* [Cute][Flex] add back in contig (Dao-AILab#2177)

* [Cute][Flex]Add pack-gqa divmod (Dao-AILab#2180)

* baseline local flops

* [Cute,Fwd,Sm100] distributed offset calculation for paged KV (Dao-AILab#2104)

* fully shard paged KV address calculation across threads

* use t0 indices for static bound checking

* increase tiled copy to full KV row

* shrink predicate tensor

* clarify paged KV divisibility constraints

* increase load register allocation

* Add R2P dual bound masking for local attention

Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.

* remove benchmark result, undo changes to benchmark

* Add R2P dual bound masking for local attention

Add mask_r2p_dual_bound function using XOR of two bitmasks
to efficiently mask elements outside [col_limit_left, col_limit_right)
range for SM100 local attention.

* switch from xor to mask_right & ~ mask_left

* flip in_bound to out_bound

* remove zero logic for right_s and left_s

* remove 24 clamp

* doc

* lint

* added back clamp to avoid "OverflowError: Python int too large to convert to C long"

* add comment

* [Cute][Flex] Fix expanded tensor bug (Dao-AILab#2189)

* [Cute, SM90] fix fwd varlen Cute implementation bug for H100 (Dao-AILab#2194)

* fix

* same fix for bwd and SM80

* reduce chance of build oom (Dao-AILab#2079)

* [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases (Dao-AILab#2187)

* ci: Use 1 ninja job for cu13 (Dao-AILab#2195)

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update README to include 'psutil' package as build requirement (Dao-AILab#2210)

Added 'psutil' as a build requirement in the README.

* [Flex][SM100] Replay expand fix on sm100 (Dao-AILab#2209)

stack-info: PR: Dao-AILab#2209, branch: drisspg/stack/6

* [DSL] Optionally patch cute-dsl to use system's ptxas

* [AMD] Triton Backend for ROCm #3 (Dao-AILab#2178)

* Fused Bwd (Dao-AILab#137)

* Fused with Good perf and stride fixed

Fix fused bugs

isolate failing case

fix bug

bring back test cases

rm split impl in fused

use exp2 is global variable now

try oom fix

save

make fused the default

limit to reproduce failure

return default to split

fix head size bug

use exp2 back to true

* new grid

* BLK_SLICE_FACTOR = 1

* add tflops

* new commit

* test in parrallel

* strides added by jusson

* disable alibi

* fix bugs again

* default to fused

* add bwd options for varlen

* backend filter

* default to jingning and batch 4

* best fwd config

* fix TRITON_PRINT_AUTOTUNING flag bug

* tune

* Tuning fwd prefill

* add if else

* use flag

* Minor mask fix

* FLIP GRID

* use best config for default

* print when autotuning

* test bfloat16

* fix k and v stride bugs

* skip bfloat16

* test kvpacked

* disable internal tests

* pick default config based on arch

* Add alibi in the new bwd kernel (Dao-AILab#139)

* enable alibi for jinging kernel

enable alibi for jinging kernel

match

* save bad configs

* fix alibi and causal bug

* disable autotune by default

* auto tune when benching is good

* set best config

* remove env var

* Update amd_tests.yml

* upgrad to triton==3.3.0

* increase shm

* use 64 x 64 for now

* save

* handle 1d alibi

* Add fp8 to fused kernel (Dao-AILab#140)

* fp8 stuff

find test case

compute delta fp8

basic fp8 config passing

non causal path works

* isolate bad case

* fix fp8 bug

* didnot fix fp8 bug

* back to failing test

* fp8 tests passing

* skip

* skip ref tests

---------

Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com>

* head, seq, batch (Dao-AILab#141)

* Fix keys (Dao-AILab#144)

* save

* rm keys

* fix keys

* use GHA_RENDER_DEVICES

* normal docker

* Pad LSE (Dao-AILab#148)

* add round multiple

* fix fwd

* backward fix

* use rounded lse flag

* passing ROUNDED_LSE

* default is new rounded mode

* rename to fused_atmoics and fused_no_atomics

* add test for torch_compile

* add varlen torch compile test

* add old one kernel for ref

* fix varlen mismatch bug

* fix shape issue in varlen but mismatch

* sync torch compile kernel launch

* simple varlen test

* add debug code

* rm old

* ignore old impls

* DEBUG flag works in interface only

* ref uses the righ shape for lse

* rm oldest bwd kernel

* fix typo

* fix varlen bug

* fix bug. Get info from q for now

* simple shape and stride checkout

* add more tests

* test kvcache

* kvcache safe

* match case

* fix segfault due to bad return_softmax

* run bench

* run seperate for the main functions

* just output benchmark

* default csv format and time stamp files

* non verbsoe bench

* Sliding Window Forward (Dao-AILab#151)

* Compress SWA work

test case

set up debug inputs

add fwd ref

one mask ref

fwd first pass

save

ref doesnot work for bigger seqlens

save new version

some causal cases failing

found bad cases

working new attn

new atten works

new attn_fwd works

reorg n_extra_tokens

use seqlen_delta_qk

ref fwd works

add sliding window to bwd ref

test kvcache

decode ref work with everything except sliding window

add debug code for 12 failing sliding window cases for decode

attention_decode_forward_ref_impl mostly works except for alibi

fix alibi in attention_decode_forward_ref_impl

ref works with normal, varlen & kvcache

move stuff around

figure out masking

old attn inner

two inner functions

remove load_fn

do Lk - Lq like ref

unify IS_CAUSAL code in epilogue

clean up

add args

rm inference stuff

simplify compute_masking

simpler compute mask

stub out returning front masking variables

remove pointer pass

compute ptrs inloop

compute block min and max

window stub inside inner mask loop

trying to use attn_fwd_mask causes issues

fix compiler bug when front masking

gen specifc types

add sliding window and debug statements

use identity for v

add more taste cases

add comments

save

use k_max_token for clarity

disable debug configs

basic NON-CAUSAL SLIDING WINDOW

non causal sliding window works on the all the shapes

non sliding window working in fwd

clean up fused bwd

seperate old fwd_prefill

move configs to utils.py

* fix bwd ref bug

* skip local cases so that fa output

* no sliding window causal green

* add backward test skip for sliding window

* clean reduce in fwd_kvcache. no is_CASUAL branching

* add kvcache masking

* kvcache working

* fix some bugs in test.py

* clean up

* Fix Device Segfault (Dao-AILab#152)

* Compress segfault work

fix backward segfault

rework offset

ignore .profile

ignore .analysis

save

* assert the kernel launch device and tensor devices are the same

* fix failing asserts

* add asserts to fwd

* Fix SDMASK bug

* Log triton, torch and fa version

* Fix fp8 import issues

* fix docs (Dao-AILab#154)

* Sliding Window block classification logic (Dao-AILab#155)

* add aiter code

* remove aiter stuff

* sliding window non causal masking works

* causal and sliding window block masking

* extract common

* clean up typo

* helper for swa

* ignore .amd

* fix last block bug

* Enable FA V3 (Dao-AILab#157)

* Compress PA work

narrow pa test

ref works on most cases

inplace ref with new_kv

inplace paged attention

add pa ref

save pa

basic  paged works

save

fix swa + causal in pa. Also new_kv only on pa path

passing

build fa v3

import interface from fa v3

copy fa tests

use v3 api

clean up

rename to match old test

support different head sizes

remove fp8

basisc passing v3 cases

test_flash_attn_varlen_output v3 working

isolate bad case for kvcache

case passing

save

use decode is seqused/ cacheseql is given

use decode if not varlen

basci kvcache v3 working

kvcache enable more cases

detect kvcache case if seqused_q is non and sequese_k is not None

skip failing test

find fp8 failing case

mha fp8 works

fix fp8 MQA/GQA bug

clean up

more clean up

clean up more

don't need fp8 dead code

remove train code with fp8 stuff

fp8 working in kvcache

paged + fp8 seems to be working

new_kv allowed

* clean up

* skip hopper race test

* clean up more

* fix paged + alibi

* similar inner paged api

* unify _attn_fwd_inner

* AITER integration (Dao-AILab#159)

* clean up v2 interface

* assert fp8 scale shapes

* rotary working

* move rotary to impl layers

* remove einops

* enable rotarry in v3

* create interface

* fix descale assert

* unify bwd

* lint from aiter

* clean fp8 api

* add api change

* assert shapes for v2

* remove ref and bench.py

* remove metadata class and clean up

* bwd_prefill

* one bwd.py

* rename

* lint

* add bwd_change (Dao-AILab#156)

* Tune FP8 Perf (Dao-AILab#160)

* check cu count for gfx942

* create get_cu_count

* update repo root

* update forward tune

* clean up load

* use float8_e4m3fnuz

* save

* show bwd mode

* recommend fp8

* use torch.float32 for fp8 kernel

* add both best fp16 and fp8 config

* tune fp8 backward

* descale factors should be b, hk

* fp8 bwd working on all primus configs

* tune bwd configs

* fa v3 tests passing

* better warning

* clean up bwd launcher

* v3 passing

* tune more

* improve perf

* clean up

* lint

* clean

* start tuning gfx950

* tune non causal path

* fix bug

* save

* Skip configs where BLOCK_M2 % BLOCK_N2 != 0

* skip more

* stop tuning

* fix varlen bug

* fix dropout & causal/swa segfault

* update the to machine new changes

* save

* fix more bugs

* remove random seed

* clean up

* update readme

* print tensor stats for debug

* disable sliding window tests

* add rdna configs

* fix k partial bug

* fix block_size_n bug

* fix type check bug

---------

Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com>
Co-authored-by: Tianxing Wu <tianxing.wu@amd.com>

* fix compute_block_sparsity usage in benchmark_mask_mod (Dao-AILab#2221)

* Fix shared-memory race (Dao-AILab#2229)

* Use TORCH_TARGET_VERSION over TORCH_STABLE_ONLY (Dao-AILab#2155)

* short readme for flex flash (Dao-AILab#2231)

* [FA3] Mark current main version as v3.0.0 stable (Dao-AILab#2223)

A collaboration between Flash-Attention, PyTorch and xFormers is trying to provide pre-built wheels for FA3 across as many platforms/environments as possible (e.g., ARM, Windows, CUDA 13, ...). To simplify the installation workflow, it would help to tag these packages as stable, but the current main version is tagged as beta.

FA3 hasn't received substantial updates in a while (the latest was a bugfix almost two months ago), and most new development is happening in FA4. Thus, in this PR, I propose we just claim that the current main version _is_ stable.

I have heard concerns that the feature set of FA3 doesn't currently match FA2 (e.g., dropout is missing). I think this concern is partly addressed by the fact that the new wheels will have a different name than the FA2 ones (`flash_attn_3` and `flash_attn` respectively), hence the former does _not_ claim to be a replacement for the latter, and the two can coexist (and they provide different modules).

* hdim 192 smem fix (Dao-AILab#2235)

* Add `FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON` env var support (Dao-AILab#2239)

* Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON env var support

Allows users to override triton config when not autotuning.

* Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON to readme

* Rename to FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON

* [CUTE]Bump to Cutedsl (Dao-AILab#2216)

Co-authored-by: Cursor <cursoragent@cursor.com>

* pytest-dist round robin to gpus (Dao-AILab#2241)

* [DSL] Replace old fence with cute.arch.fence_view_async_shared()

* [DSL]Replace utils.{fma,mul,add}_packed_f32x2 with cute.arch version

* [DSL] Remove coord_offset_i64, domain_offset_i64, elem_pointer_i64

Cute-dsl now supports i64 strides by default

* [Sm90] Use functions from quack.sm90_utils

* [DSL] Use cute.arch.warp_reduction_{max,sum}

* [Layout] Use reshape_acc_to_mn and reshape_acc_to_frgA from quack

* [Layout] Use quack.layout_utils.mma_partition_C_vec

* [DSL] Use cute.math.{exp2,log2,log}

* [Layout] Use layout_utils.transpose_view and select from quack

* [Bwd,Sm90] Use quack.copy_utils

* [Bwd,Sm100] Shorten PipelineTmaUmma create

* [Bwd,Sm90] Have score_mod and score_mod_bwd as partial functions

* [DSL] warpgroup_reg_alloc -> setmaxregister_increase

* Fix Hopper tests (Dao-AILab#2242)

---------

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Reuben Stern <107093092+reubenconducts@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Co-authored-by: Rajesh Shashi Kumar <35628747+rajesh-s@users.noreply.github.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Henry Tsang <henrylhtsang@meta.com>
Co-authored-by: Ted Zadouri <tedzadouri@gmail.com>
Co-authored-by: Driss Guessous <32754868+drisspg@users.noreply.github.com>
Co-authored-by: jayhshah <jayhshah@gmail.com>
Co-authored-by: brandonsun <brandons@nvidia.com>
Co-authored-by: JackCharlesZhang <113156832+JackCharlesZhang@users.noreply.github.com>
Co-authored-by: Jack Zhang <jackzhang@Jacks-MacBook-Pro-4.local>
Co-authored-by: Tri Dao <tridpq@gmail.com>
Co-authored-by: imbr92 <40306754+imbr92@users.noreply.github.com>
Co-authored-by: Kevin Tong <kevin@augmentcode.com>
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
Co-authored-by: Michael Melesse <micmelesse@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Kevin Wang <kevmo314@gmail.com>
Co-authored-by: Ted Zadouri <tz6037@princeton.edu>
Co-authored-by: timmy-feng <70349932+timmy-feng@users.noreply.github.com>
Co-authored-by: Guilherme Leobas <guilhermeleobas@gmail.com>
Co-authored-by: Anakin(Yancheng) Zheng <103552181+anakinxc@users.noreply.github.com>
Co-authored-by: Jean-Luc Duprat <jld@acm.org>
Co-authored-by: Markus Hoehnerbach <mhoehnerbach@meta.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: liangel-02 <liangel@meta.com>
Co-authored-by: skarupke <malteskarupke@fastmail.fm>
Co-authored-by: Leo Dong <leodong0315@gmail.com>
Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
Co-authored-by: Kareem <81531392+KareemMusleh@users.noreply.github.com>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Wang Lecheng <wanglecheng@stu.pku.edu.cn>
Co-authored-by: Aliasger Zaidy <aliasger.zaidy@amd.com>
Co-authored-by: Tianxing Wu <tianxing.wu@amd.com>
Co-authored-by: zhuochen <zhuochen@outlook.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Co-authored-by: Luca Wehrstedt <luca.wehrstedt@gmail.com>
Co-authored-by: Alex Butler <alexheretic@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
YangWang92 pushed a commit to YangWang92/flash-attention that referenced this pull request Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants