Problem
AITER's unified_attention currently only supports causal attention:
# File: aiter/ops/triton/unified_attention.py:126
# Source: https://github.com/ROCm/aiter/blob/main/aiter/ops/triton/unified_attention.py#L126
assert causal, "Only causal attention is supported"
This prevents encoder-only models (BERT, RoBERTa, sentence transformers, embeddings) from using AITER's optimized attention on ROCm.
Use Case
vLLM pooling models fail on ROCm because:
- ROCM_AITER_FA backend raises
NotImplementedError for ENCODER_ONLY attention type
- vLLM falls back to FlexAttention which has numerical precision issues on ROCm
- Result: 33 pooling tests failing on AMD CI
vLLM Issue: vllm-project/vllm#29466
vLLM PR workaround: vllm-project/vllm#31084
Impact
This limitation affects:
- All encoder-only models on ROCm: BERT, RoBERTa, sentence-transformers, embeddings models
- Frameworks: vLLM, Transformers, and other frameworks that use AITER
- Competitiveness: Prevents ROCm from competing with CUDA for encoder-only workloads
- Performance: Forces use of generic implementations instead of AITER-optimized kernels
Request
Add bidirectional (non-causal) attention support to unified_attention:
- Remove restriction: Remove or make conditional the
assert causal check
- Add parameter: Add
is_causal parameter to kernel calls
- Modify masks: Update attention mask logic to support bidirectional attention
- Testing: Test with encoder-only models (BERT-base, RoBERTa, sentence-transformers)
Example Models Affected
- Embeddings:
sentence-transformers/all-MiniLM-L12-v2, intfloat/e5-small
- Cross-encoders:
cross-encoder/ms-marco-MiniLM-L-6-v2
- Classification:
nie3e/sentiment-polish-gpt2-small
- Token classification:
boltuix/NeuroBERT-NER
All of these use ENCODER_ONLY attention (bidirectional, non-causal).
Current Workaround
vLLM is currently using generic FlashAttention for encoder-only models on ROCm, bypassing AITER. This works but doesn't benefit from AITER's optimizations.
References
Environment
- ROCm Version: Latest (tested on vLLM AMD CI)
- AITER Version: v0.1.7
- Hardware: AMD MI300X, MI250X (AMD CI)
- Framework: vLLM v0.7+
Thank you for considering this feature request! Adding encoder-only support would greatly benefit the ROCm ecosystem and enable AITER optimizations for a whole class of models currently forced to use generic implementations.
Problem
AITER's
unified_attentioncurrently only supports causal attention:This prevents encoder-only models (BERT, RoBERTa, sentence transformers, embeddings) from using AITER's optimized attention on ROCm.
Use Case
vLLM pooling models fail on ROCm because:
NotImplementedErrorforENCODER_ONLYattention typevLLM Issue: vllm-project/vllm#29466
vLLM PR workaround: vllm-project/vllm#31084
Impact
This limitation affects:
Request
Add bidirectional (non-causal) attention support to
unified_attention:assert causalcheckis_causalparameter to kernel callsExample Models Affected
sentence-transformers/all-MiniLM-L12-v2,intfloat/e5-smallcross-encoder/ms-marco-MiniLM-L-6-v2nie3e/sentiment-polish-gpt2-smallboltuix/NeuroBERT-NERAll of these use ENCODER_ONLY attention (bidirectional, non-causal).
Current Workaround
vLLM is currently using generic FlashAttention for encoder-only models on ROCm, bypassing AITER. This works but doesn't benefit from AITER's optimizations.
References
unified_attention.py- https://github.com/ROCm/aiter/blob/main/aiter/ops/triton/unified_attention.py#L126Environment
Thank you for considering this feature request! Adding encoder-only support would greatly benefit the ROCm ecosystem and enable AITER optimizations for a whole class of models currently forced to use generic implementations.