Hi, your request will be fast-approved if you: (1) Complete all form fields in full detail. (2) Clearly demonstrate your project's significance, including: used and target product, economic benefit. (Commercial use cases receive highest priority)

Approval time are prioritized based on project impact. Submissions for high-value commercial applications typically receive review within 48 hours. Additionally, we will consider sharing the SageAttention3 code with significant projects later.

Log in or Sign Up to review the conditions and access this model content.


license: apache-2.0 (Commercial applications are also allowed!)

SageAttention

This repository provides the official implementation of SageAttention, SageAttention2, and SageAttention2++, which achieve surprising speedup on most GPUs without lossing accuracy across all models in a plug-and-play way.

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
Paper: https://arxiv.org/abs/2410.02367
Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
Paper: https://arxiv.org/abs/2411.10958
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Paper: https://arxiv.org/abs/2505.11594
Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen

Installation

Base environment

  • python>=3.9 , torch>=2.3.0 , triton>=3.0.0
  • CUDA:
    • >=12.8 for Blackwell and SageAttention2++
    • >=12.4 for fp8 support on Ada
    • >=12.3 for fp8 support on Hopper
    • >=12.0 for Ampere
  • flash-attn for benchmarking

Install Package

To use SageAttention 2.2.0 (SageAttention2++ contained), please compile from source:

git clone https://github.com/thu-ml/SageAttention.git
cd sageattention 
python setup.py install  # or pip install -e .

To benchmark the speed against FlashAttention3, please compile FlashAttention3 from source:

git clone https://github.com/Dao-AILab/flash-attention.git --recursive
git checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release
cd hopper
python setup.py install

How to Use

Note that the default API is already SageAttention2++, corresponding to _qattn_sm89.qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf

from sageattention import sageattn
attn_output = sageattn(q, k, v, tensor_layout="HND", is_causal=False)
  • q, k, v are FP16/BF16 dtype with the shape (batch_size, head_num, seq_len, head_dim) using default tensor_layout="HND". For shape (batch_size, seq_len, head_num, head_dim), set tensor_layout="NHD".
  • is_causal determines the use of a causal mask.

Available APIs:

  • sageattn: Automatically selects the optimal kernel based on the GPU to achieve a good performance-accuracy trade-off.
  • sageattn_qk_int8_pv_fp16_triton: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend.
  • sageattn_qk_int8_pv_fp16_cuda: INT8 quantization for $QK^\top$ and FP16 for $PV$ using CUDA backend.
  • sageattn_qk_int8_pv_fp8_cuda: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend. (the default API is already SageAttention2++)
  • sageattn_qk_int8_pv_fp8_cuda_sm90: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend, specifically optimized for Hopper GPUs.
  • sageattn_varlen: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend. Support for varying sequence lengths within the same batch.

For optimal speed and accuracy performance on custom devices and models, we strongly recommend referring to the this file for detailed guidance.

Note: Support for different sequence lengths between q and k,v and group-query attention is available.

Plug-and-play Example

Note: Not all models works with F.scaled_dot_product_attention = sageattn. Technically, you should replace the original Attention by modifying the Attention Class of the target model. For image and video models, we suggest only replacing the attention in DiT (see example/mochi.py for detail).

Kernel Benchmarking

We provide a benchmarking script to compare the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please refer to the benchmark/ directory for more details.

Performance

Speed of Kernels

8+8 means the kernel with INT8 quantization for $QK^\top$ and FP8 quantization for $PV$. 8+16 uses FP16 with FP16 accumulator for $PV$.

Local Image

Local Image

Local Image

Local Image

Local Image

Local Image

Local Image

Note: The TOPS results refer only to the Attention Kernel, excluding the quantization and smoothing.

End-to-end Performance

End-to-End Accuracy:

Local Image

Local Image

Local Image

Local Image

End-to-End Speedup:

Local Image

Citation

If you use this code or find our work valuable, please cite:

@inproceedings{zhang2025sageattention,
  title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}
@inproceedings{zhang2024sageattention2,
  title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
  author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025}
}
@article{zhang2025sageattention3,
  title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training},
  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
  journal={arXiv preprint arXiv:2505.11594},
  year={2025}
}
@article{zhang2025sageattention2++,
  title={SageAttention2++: A More Efficient Implementation of SageAttention2},
  author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei},
  journal={arXiv preprint arXiv:2505.21136},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including jt-zhang/SageAttention2_plus