vllm-project · LucasWilkinson · Feb 11, 2026 · Sep 2, 2025 · Sep 4, 2025 · Sep 5, 2025
diff --git a/.github/workflows/_build.yml b/.github/workflows/_build.yml
@@ -165,7 +165,7 @@ jobs:
           # Limit MAX_JOBS otherwise the github runner goes OOM
           # nvcc 11.8 can compile with 2 jobs, but nvcc 12.3 goes OOM
 
-          export MAX_JOBS=$([ "$MATRIX_CUDA_VERSION" == "129" ] && echo 1 || echo 2)
+          export MAX_JOBS=$([ "$MATRIX_CUDA_VERSION" == "129" ] || [ "$MATRIX_CUDA_VERSION" == "130" ] && echo 1 || echo 2)
           export NVCC_THREADS=2
           export FLASH_ATTENTION_FORCE_BUILD="TRUE"
           export FLASH_ATTENTION_FORCE_CXX11_ABI=${{ inputs.cxx11_abi }}

diff --git a/README.md b/README.md
@@ -67,6 +67,7 @@ flash_attn_interface.flash_attn_func()
 - CUDA toolkit or ROCm toolkit
 - PyTorch 2.2 and above.
 - `packaging` Python package (`pip install packaging`)
+- `psutil` Python package (`pip install psutil`)
 - `ninja` Python package (`pip install ninja`) *
 - Linux. Might work for Windows starting v2.3.2 (we've seen a few positive [reports](https://github.com/Dao-AILab/flash-attention/issues/595)) but Windows compilation still requires more testing. If you have ideas on how to set up prebuilt CUDA wheels for Windows, please reach out via Github issue.
 
@@ -128,74 +129,52 @@ FlashAttention-2 ROCm CK backend currently supports:
 3. Both forward's and backward's head dimensions up to 256.
 
 #### Triton Backend
-The Triton implementation of the [Flash Attention v2](https://tridao.me/publications/flash2/flash2.pdf) is currently a work in progress.
+The Triton implementation of [Flash Attention](https://tridao.me/publications/flash2/flash2.pdf) supports AMD's CDNA (MI200, MI300) and RDNA GPUs using fp16, bf16, and fp32 datatypes. It provides forward and backward passes with causal masking, variable sequence lengths, arbitrary Q/KV sequence lengths and head sizes, MQA/GQA, dropout, rotary embeddings, ALiBi, paged attention, and FP8 (via the Flash Attention v3 interface). Sliding window attention is currently a work in progress.
 
-It supports AMD's CDNA (MI200, MI300) and RDNA GPU's using fp16, bf16 and fp32 datatypes.
-
-These features are supported in Fwd and Bwd
-1) Fwd and Bwd with causal masking
-2) Variable sequence lengths
-3) Arbitrary Q and KV sequence lengths
-4) Arbitrary head sizes
-5) Multi and grouped query attention
-6) Dropout
-7) Rotary embeddings
-8) ALiBi
-
-We are working on the following things
-1) Paged Attention 
-2) Sliding Window
-3) FP8
-4) Performance Improvements
-
-##### Getting Started
-To get started with the triton backend for AMD, follow the steps below.
-
-First install the torch for ROCm from https://pytorch.org/get-started/locally/ if it is not installed. The torch and triton will be installed.   
-
-Then install Flash Attention with the flag `FLASH_ATTENTION_TRITON_AMD_ENABLE` set to `"TRUE"`.
-
-```
+To install, first get PyTorch for ROCm from https://pytorch.org/get-started/locally/, then install Triton and Flash Attention:
+```sh
+pip install triton==3.5.1
 cd flash-attention
 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install
 ```
 
-To test that things are working, you can run our tests. These tests take hours so you don't need to run the full thing.
-```
+To run the tests (note: full suite takes hours):
+```sh
 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" pytest tests/test_flash_attn_triton_amd.py
 ```
 
-You can use autotune for better performance by using this flag `FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="TRUE"`
-```
-FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="TRUE" python $PATH_TO_CODE
-```
+For better performance, enable autotune with `FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="TRUE"`.
 
-###### Docker
-You can also use the Dockerfile below which does the above steps on top of the latest rocm/pytorch image.
+Alternativly, if _not_ autotuning, `FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON` may be used to set a single triton config overriding the hardcoded defaults for `attn_fwd`. E.g.
+```sh
+FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'
 ```
+
+For a quick start with Docker:
+```dockerfile
 FROM rocm/pytorch:latest
 
 WORKDIR /workspace
 
-# install flash attention
-ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
+# install triton
+RUN pip install triton==3.5.1
 
-RUN git clone https://github.com/ROCm/flash-attention.git &&\ 
+# build flash attention with triton backend
+RUN git clone https://github.com/Dao-AILab/flash-attention &&\ 
     cd flash-attention &&\
-    python setup.py install
+    FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install
 
 # set working dir
 WORKDIR /workspace/flash-attention
-```
 
-To build the docker file
-```
-docker build -t fa_triton .
+# set env variable to use triton backend
+ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
 ```
 
-To run the docker image
-```
-docker run -it --network=host --user root --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host --shm-size 16G --device=/dev/kfd --device=/dev/dri fa_triton
+Build and run:
+```sh
+docker build -t flash-attn-triton .
+docker run -it --network=host --user root --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host --shm-size 16G --device=/dev/kfd --device=/dev/dri flash-attn-triton
 ```
 
 ## How to use FlashAttention

diff --git a/csrc/cutlass b/csrc/cutlass
diff --git a/flash_attn/cute/README.md b/flash_attn/cute/README.md
@@ -0,0 +1,26 @@
+# Flash Attention CUTE
+
+## Development Installation
+
+1. Clone the repository (if you haven't already):
+   ```bash
+   git clone https://github.com/Dao-AILab/flash-attention.git
+   cd flash-attention/cute
+   ```
+
+2. Install in editable mode with dev dependencies:
+   ```bash
+   pip install -e "./cute[dev]"
+   ```
+
+## Running Tests
+
+```bash
+pytest tests/cute/
+```
+
+## Linting
+
+```bash
+ruff check flash_attn/cute/
+```
diff --git a/flash_attn/cute/block_sparse_utils.py b/flash_attn/cute/block_sparse_utils.py
@@ -14,7 +14,6 @@
 
 # Import data structures from block_sparsity
 from flash_attn.cute.block_sparsity import BlockSparseTensors
-from flash_attn.cute import utils
 from flash_attn.cute import copy_utils
 from flash_attn.cute.named_barrier import NamedBarrierBwd
 
@@ -698,14 +697,14 @@ def handle_block_sparse_empty_tile_correction_sm100(
                     row_max_value = sink_val * (LOG2_E / softmax_scale_log2)
                     row_sum_value = Float32(1.0)
                 else:
-                    row_sum_value = row_sum_value + utils.exp2f(
-                        sink_val * LOG2_E - row_max_value * softmax_scale_log2
+                    row_sum_value = row_sum_value + cute.math.exp2(
+                        sink_val * LOG2_E - row_max_value * softmax_scale_log2, fastmath=True
                     )
         if tidx < m_block_size:
             scale_row_idx = tidx + stage * m_block_size
             sScale[scale_row_idx] = row_sum_value
             if const_expr(mLSE is not None or learnable_sink is not None):
-                sScale[scale_row_idx + m_block_size * 2] = row_max_value
+                sScale[scale_row_idx + q_stage * m_block_size] = row_max_value
         acc_flag = row_sum_value == Float32(0.0) or row_sum_value != row_sum_value
         stats[stage] = (row_sum_value, row_max_value, acc_flag)
 
@@ -1123,8 +1122,7 @@ def _load_q_do_block_sm90(
     else:
         pipeline_Q.producer_acquire(producer_state_Q)
     load_Q(m_block, producer_state=producer_state_Q)
-    with cute.arch.elect_one():
-        load_LSE(m_block, producer_state=producer_state_Q)
+    load_LSE(m_block, producer_state=producer_state_Q)
 
     producer_state_dO_cur = (
         producer_state_dO if const_expr(not Q_stage_eq_dO_stage) else producer_state_Q
@@ -1135,8 +1133,7 @@ def _load_q_do_block_sm90(
     else:
         pipeline_dO.producer_acquire(producer_state_dO_cur)
     load_dO(m_block, producer_state=producer_state_dO_cur)
-    with cute.arch.elect_one():
-        load_dPsum(m_block, producer_state=producer_state_dO_cur)
+    load_dPsum(m_block, producer_state=producer_state_dO_cur)
 
     producer_state_Q.advance()
     producer_state_dO.advance()
@@ -1253,10 +1250,10 @@ def consume_block_sparse_mma_bwd_sm90(
     is_causal: cutlass.Constexpr,
     is_local: cutlass.Constexpr,
     thr_mma_SdP,
-    softmax_scale,
-    seqlen,
-    subtile_factor: cutlass.Constexpr,
-    m_block_max: int,
+    score_mod_fn=None,
+    score_mod_bwd_fn=None,
+    subtile_factor: cutlass.Constexpr = 1,
+    m_block_max: int = 0,
     aux_tensors=None,
     fastdiv_mods=(None, None),
 ):
@@ -1318,15 +1315,9 @@ def consume_block_sparse_mma_bwd_sm90(
                 consumer_state_Q,
                 consumer_state_dO,
                 mask_fn=mask_fn_partial,
+                score_mod_fn=score_mod_fn,
+                score_mod_bwd_fn=score_mod_bwd_fn,
                 dKV_accumulate=dKV_accumulate,
-                thr_mma_SdP=thr_mma_SdP,
-                batch_idx=batch_idx,
-                head_idx=head_idx,
-                n_block=n_block,
-                softmax_scale=softmax_scale,
-                seqlen=seqlen,
-                aux_tensors=aux_tensors,
-                fastdiv_mods=fastdiv_mods,
             )
             dKV_accumulate = True
 
@@ -1342,15 +1333,9 @@ def consume_block_sparse_mma_bwd_sm90(
                     consumer_state_Q,
                     consumer_state_dO,
                     mask_fn=mask_fn_full,
+                    score_mod_fn=score_mod_fn,
+                    score_mod_bwd_fn=score_mod_bwd_fn,
                     dKV_accumulate=dKV_accumulate,
-                    thr_mma_SdP=thr_mma_SdP,
-                    batch_idx=batch_idx,
-                    head_idx=head_idx,
-                    n_block=n_block,
-                    softmax_scale=softmax_scale,
-                    seqlen=seqlen,
-                    aux_tensors=aux_tensors,
-                    fastdiv_mods=fastdiv_mods,
                 )
                 dKV_accumulate = True
 

diff --git a/flash_attn/cute/cute_dsl_ptxas.py b/flash_attn/cute/cute_dsl_ptxas.py
@@ -0,0 +1,151 @@
+"""
+System ptxas replacement for CUTLASS DSL.
+Environment variables:
+    CUTE_DSL_PTXAS_PATH    - Path to ptxas (e.g., /usr/local/cuda/bin/ptxas)
+    CUTE_DSL_PTXAS_VERBOSE - Set to 1 for verbose output
+"""
+
+import os
+import sys
+import re
+import ctypes
+import subprocess
+from pathlib import Path
+
+import cutlass
+
+
+CUTE_DSL_PTXAS_PATH = os.environ.get("CUTE_DSL_PTXAS_PATH", None)
+VERBOSE = os.environ.get("CUTE_DSL_PTXAS_VERBOSE", "0") == "1"
+
+_original_load_cuda_library = None
+_user_wanted_ptx = False  # True if user originally set CUTE_DSL_KEEP_PTX=1
+
+
+def _log(msg):
+    if VERBOSE:
+        print(f"[ptxas] {msg}", file=sys.stderr)
+
+
+def _get_ptx(compiled_func) -> tuple[str, Path] | None:
+    """Find and read PTX file, stripping null bytes."""
+    func_name = getattr(compiled_func, "function_name", None)
+    if not func_name:
+        return None
+
+    dump_dir = os.environ.get("CUTE_DSL_DUMP_DIR", Path.cwd())
+    for ptx_path in Path(dump_dir).glob(f"*{func_name}*.ptx"):
+        content = ptx_path.read_text().rstrip("\x00")
+        if ".entry " in content and content.rstrip().endswith("}"):
+            _log(f"Found PTX: {ptx_path}")
+            return content, ptx_path
+    return None
+
+
+def _compile_ptx(ptx_path: Path, ptx_content: str) -> bytes:
+    """Compile PTX to cubin using system ptxas."""
+    # Extract arch from PTX
+    match = re.search(r"\.target\s+(sm_\d+[a-z]?)", ptx_content)
+    arch = match.group(1) if match else "sm_90a"
+
+    # Write stripped content back if needed
+    if ptx_path.read_text() != ptx_content:
+        ptx_path.write_text(ptx_content)
+
+    # Compile
+    cubin_tmp = ptx_path.with_suffix(".cubin.tmp")
+    try:
+        assert CUTE_DSL_PTXAS_PATH is not None
+        result = subprocess.run(
+            [CUTE_DSL_PTXAS_PATH, f"-arch={arch}", "-O3", "-o", str(cubin_tmp), str(ptx_path)],
+            capture_output=True,
+            text=True,
+        )
+        if result.returncode != 0:
+            raise RuntimeError(f"ptxas failed: {result.stderr}")
+
+        cubin_data = cubin_tmp.read_bytes()
+        _log(f"Compiled {ptx_path.name} -> {len(cubin_data)} bytes ({arch})")
+
+        # Save cubin if CUTE_DSL_KEEP_CUBIN is set
+        if os.environ.get("CUTE_DSL_KEEP_CUBIN", "0") == "1":
+            cubin_out = ptx_path.with_suffix(".cubin")
+            cubin_out.write_bytes(cubin_data)
+            _log(f"Saved: {cubin_out}")
+
+        return cubin_data
+    finally:
+        cubin_tmp.unlink(missing_ok=True)
+
+
+def _patched_load_cuda_library(self):
+    """Replacement for _load_cuda_library that uses system ptxas."""
+
+    result = _get_ptx(self)
+    if not result:
+        _log("PTX not found, falling back to embedded ptxas")
+        return _original_load_cuda_library(self)
+
+    ptx_content, ptx_path = result
+
+    try:
+        cubin = _compile_ptx(ptx_path, ptx_content)
+    except Exception as e:
+        _log(f"Compilation failed ({e}), falling back to embedded ptxas")
+        return _original_load_cuda_library(self)
+
+    # Load cubin
+    import cuda.bindings.runtime as cuda_runtime
+
+    err, library = cuda_runtime.cudaLibraryLoadData(cubin, None, None, 0, None, None, 0)
+    if err != cuda_runtime.cudaError_t.cudaSuccess:
+        _log(f"cudaLibraryLoadData failed ({err}), falling back to embedded ptxas")
+        return _original_load_cuda_library(self)
+
+    # Register kernels on all devices
+    _, cuda_load_to_device = self._get_cuda_init_and_load()
+    lib_ptr = ctypes.c_void_p(int(library))
+    dev_id = ctypes.c_int32(0)
+    err_val = ctypes.c_int32(0)
+    args = (ctypes.c_void_p * 3)(
+        ctypes.cast(ctypes.pointer(lib_ptr), ctypes.c_void_p),
+        ctypes.cast(ctypes.pointer(dev_id), ctypes.c_void_p),
+        ctypes.cast(ctypes.pointer(err_val), ctypes.c_void_p),
+    )
+
+    for dev in range(self.num_devices):
+        dev_id.value = dev
+        cuda_load_to_device(args)
+        if err_val.value != 0:
+            _log("cuda_load_to_device failed, falling back to embedded ptxas")
+            return _original_load_cuda_library(self)
+
+    _log(f"Loaded kernel from {ptx_path.name}")
+
+    # Delete PTX if user didn't originally want it kept
+    if not _user_wanted_ptx:
+        ptx_path.unlink(missing_ok=True)
+
+    return [cuda_runtime.cudaLibrary_t(lib_ptr.value)]
+
+
+def patch():
+    """Install system ptxas hook. Call before importing cutlass."""
+    global _original_load_cuda_library, _user_wanted_ptx
+
+    assert CUTE_DSL_PTXAS_PATH is not None
+    if not os.path.isfile(CUTE_DSL_PTXAS_PATH) or not os.access(CUTE_DSL_PTXAS_PATH, os.X_OK):
+        raise RuntimeError(f"ptxas not found: {CUTE_DSL_PTXAS_PATH}")
+
+    # Track if user originally wanted PTX kept
+    _user_wanted_ptx = os.environ.get("CUTE_DSL_KEEP_PTX", "0") == "1"
+    # os.environ['CUTE_DSL_KEEP_PTX'] = '1'
+    assert os.environ.get("CUTE_DSL_KEEP_PTX", "0") == "1", (
+        "Require CUTE_DSL_KEEP_PTX=1 to use system's ptxas"
+    )
+
+    cls = cutlass.cutlass_dsl.cuda_jit_executor.CudaDialectJitCompiledFunction
+    _original_load_cuda_library = cls._load_cuda_library
+    cls._load_cuda_library = _patched_load_cuda_library
+    _log("Patch applied")
+    return