Generative AI

Next Generation of FlashAttention

Jul 11, 2024

By Vijay Thakkar and Fred Oh

NVIDIA is excited to collaborate with Colfax, Together.ai, Meta, and Princeton University on their recent achievement to exploit the Hopper GPU architecture and Tensor Cores and accelerate key Fused Attention kernels using CUTLASS 3.

FlashAttention-3 incorporates key techniques to achieve 1.5–2.0x faster performance than FlashAttention-2 with FP16, up to 740 TFLOPS. With FP8, FlashAttention-3 reaches up to 1.2 PFLOPS, with 2.6x smaller errors than baseline FP8 attention.

CUTLASS is an open-source CUDA library intended to enable deep learning and HPC practitioners to achieve speed-of-light performance on NVIDIA Tensor Core GPUs for custom algorithms and research and production workloads alike.

For more information about the collaboration, see the FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision post and research paper.

Related resources

GTC session: FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness
GTC session: Taking it up a Notch: Engineering the Future of AECO Design Tools
GTC session: Advances in Ray Tracing Developer Tools
SDK: RTXGI V2
SDK: RTXGI
Webinar: Using GPUs to Accelerate HD Mapping and Location-Based Services

Discuss (0)

About the Authors

About Vijay Thakkar
Vijay Thakkar is a senior compute architect at NVIDIA and the primary author of CUTLASS 3. In addition to his work on CUTLASS, he is involved in the development of Tensor Core architecture, PTX exposure, and programming model across the GPU architecture, compiler, and CUDA engineering teams.

View all posts by Vijay Thakkar

About Fred Oh
Fred is a senior product marketing manager for CUDA, CUDA on WSL, and CUDA Python. Fred has a B.S. in Computer Science and Math from UC Davis. He began his career as a UNIX software engineer porting kernel services and device drivers to x86 architectures. He loves Star Wars, Star Trek and the NBA Warriors.

View all posts by Fred Oh