English | 简体中文
Flash-Sparse-Attention is a high-performance trainable sparse attention implementation that combines Flash Attention's memory efficiency with sparse computation for handling extremely long sequences in Transformer models.
Note
Support for arbitrary mask and bias shapes is available in this branch. The current main branch no longer maintains that feature set.
- Forward and backward passes for dense attention, sparse attention, and gated attention
- Regular batched inputs and varlen inputs
- Causal attention and local window attention
- Arbitrary combinations of Q and KV sequence lengths, with head dimensions up to 256
- Grouped Query Attention and Multi Query Attention
- Sparse softmax threshold control
- Gated attention with gate inputs and configurable gating sparsity
- Flex Local Window Attention with per-head arbitrary window sizes and local ranges
- Split-KV for workload balancing in forward and decode workloads
- Split-QO for workload balancing in backward workloads
- Fused Quant for low-precision computation on hardware without native FP8 support
- Top-k gather KV-cache decode
- Linux: Ubuntu 22.04 or later
- Device: GPU, XPU, NPU, or PPU
- Python: 3.9 or later
- PyTorch: 2.5.1 or later
- Triton: 3.6.0 or later
Install from PyPI:
pip install flash-sparse-attnTo install from source:
git clone https://github.com/flash-algo/flash-sparse-attn.git
cd flash-sparse-attn
pip install .You can also load the kernels directly from HuggingFace Kernel without installing the package:
from kernels import get_kernel
fsa = get_kernel("JingzeShi/flash-sparse-attn", version=1, trust_remote_code=True)
out = fsa.flash_dense_attn_func(q, k, v, is_causal=True)
out = fsa.flash_sparse_attn_func(q, k, v, is_causal=True, softmax_threshold=0.01)
out = fsa.flash_gated_attn_func(q, k, v, alpha, delta, is_causal=True)Requires pip install kernels.
Below are examples for the three common attention variants:
import torch
from flash_sparse_attn.ops.triton.interface import (
flash_dense_attn_func,
flash_sparse_attn_func,
flash_gated_attn_func,
)
dtype = torch.bfloat16
device = torch.device("cuda")
batch_size, seqlen_q, seqlen_k, num_heads, num_kv_heads, head_dim = 2, 1024, 1024, 8, 2, 64
query = torch.randn(batch_size, seqlen_q, num_heads, head_dim, dtype=dtype, device=device)
key = torch.randn(batch_size, seqlen_k, num_kv_heads, head_dim, dtype=dtype, device=device)
value = torch.randn(batch_size, seqlen_k, num_kv_heads, head_dim, dtype=dtype, device=device)Use this when you do not need explicit sparsification but still want an efficient attention kernel.
output_dense = flash_dense_attn_func(
query=query,
key=key,
value=value,
is_causal=True,
)
print(output_dense.shape)Use this when you want to skip low-contribution attention weights through softmax_threshold and reduce effective compute on long sequences.
output_sparse = flash_sparse_attn_func(
query=query,
key=key,
value=value,
is_causal=True,
softmax_threshold=1.0,
)
print(output_sparse.shape)Use this when you need explicit gating signals for sparse attention. alpha controls query-side gating and delta controls key-side gating.
alpha = torch.randn(batch_size, num_heads, seqlen_q, device=device, dtype=dtype)
delta = torch.randn(batch_size, num_kv_heads, seqlen_k, device=device, dtype=dtype)
output_gated = flash_gated_attn_func(
query=query,
key=key,
value=value,
alpha=alpha,
delta=delta,
is_causal=True,
softmax_threshold=1.0,
gate_threshold=1.0,
)
print(output_gated.shape)The following benchmarks cover forward, backward, and decode workloads. They include dense, sparse, and gated implementations, with FlashAttention used as the baseline.
Forward Performance
Backward Performance
Decode Performance
Forward Performance
Backward Performance
Decode Performance
Forward Performance
Backward Performance
Decode Performance
Forward Performance
Backward Performance
Decode Performance
Benchmark scripts are located under tests, covering forward, backward, and decoding performance.
python tests/benchmark_forward.pypython tests/benchmark_backward.pypython tests/benchmark_decode.pyIf you use FSA in your research, please cite:
@misc{shi2025trainabledynamicmasksparse,
title={Trainable Dynamic Mask Sparse Attention},
author={Jingze Shi and Yifan Wu and Bingheng Wu and Yiran Peng and Liangdong Wang and Guang Liu and Yuyu Luo},
year={2025},
eprint={2508.02124},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.02124},
}This project builds upon and integrates several excellent works:
- OpenSeek - Kernel development support
- Flash-Attention - Memory-efficient attention computation
- NVIDIA CUTLASS - High-performance matrix operations library
We thank the open-source community for its contributions to efficient Transformer implementations. 🤗












