FlashAttention-4 and Inference Kernels for Blackwell

Section 58.4

"FlashAttention is the kernel that taught a generation of researchers to read CUDA. FlashAttention-4 is the kernel that taught them to read Blackwell."

TensorTensor, Kernel-Whisperer AI Agent
Note: Learning Objectives
Big Picture

The kernel layer used to be invisible to architecture researchers. By 2026 it sets the speed of architectural progress: a new attention variant cannot ship at frontier scale until someone writes a fast kernel for it. FA-4 is the canonical example of this co-evolution, and its Blackwell-specific design choices reveal why every NVIDIA generation rewrites the attention kernel from scratch.

Prerequisites

This section assumes the attention mechanics from Section 4.2, the FlashAttention-1 and -2 background from Section 9.5, and basic CUDA familiarity.

Every generation of NVIDIA hardware (Ampere, Hopper, Blackwell) breaks the previous generation's attention kernels, because the streaming multiprocessor (SM) architecture changes and the tensor-core instructions get new shapes. The FlashAttention line of work (Dao et al., 2022 onwards) is the canonical adaptation: each generation produces a new IO-aware exact-attention kernel that fuses the softmax into the tensor-core pipeline and stays close to roofline performance. FlashAttention-4 (Dao et al., March 2026) is the Blackwell-generation entry.

What makes FA-4 different from its predecessors is that Blackwell's SMs are asymmetric: tensor cores and special function units run at different effective throughputs, and the optimal kernel must pipeline the two paths separately. The Modal blog's reverse-engineering writeup and the Together AI deep-dive remain the clearest explainers; the paper is dense.

58.4.1 Why each GPU generation rewrites the kernel

Fun Fact

Tri Dao published FlashAttention in 2022 as a graduate student, and within a year every major training stack quietly switched to it. The trick is that the math of attention is unchanged, the order of memory accesses is different, and the GPU stops choking on its own HBM bandwidth. NVIDIA shipped Blackwell B200 in 2024 with hardware features that essentially required a FlashAttention rewrite, which is how we got FA-4 in 2026. The recurring joke in the kernel-writing community is that FlashAttention is now a yearly subscription, paid in PhD theses.

FlashAttention-1 (2022) targeted A100. FA-2 (2023) targeted H100 Hopper. FA-3 (2024) targeted H100 with FP8 and warp-specialization. FA-4 (2026) targets Blackwell B200/B300 and the new asymmetric tensor cores that mix MMA throughputs across SM partitions. The headline number is roughly 2x faster than FA-3 on the same workload, with new support for variable-length attention masks (essential for variable-length packed-sequence training) and group-query attention pipelining.

Vertical timeline showing FlashAttention's four versions: FA-1 (2022
Figure 58.4.1: Four FlashAttention versions in four years, one per NVIDIA hardware generation. The kernel rewrites itself because the SM architecture and tensor-core instruction set rewrite themselves. A100 Ampere SM, IO-aware tiling and online softmax in FP16), FA-2 (2023, H100 Hopper, reordered work for better SM occupancy), FA-3 (2024, H100 with warp-specialization and FP8 for 1.5-2x speedup), and FA-4 (2026, Blackwell B200/B300 with asymmetric SM pipelining for 2x over FA-3). Each version is paired with its underlying architectural change in NVIDIA hardware.

58.4.2 Algorithm and kernel pipelining co-design

The FA-4 paper's central technical move is to design the algorithm and the kernel together. Previous FlashAttention papers held the algorithm fixed (tile, softmax-online, IO-aware) and varied the kernel for new hardware. FA-4 changes both: the algorithm now schedules the softmax and the GEMM on separate SM partitions, with explicit "asymmetric pipelining" of the two paths. This is the kind of optimization that requires hand-written CUDA / PTX; NVIDIA's Blackwell architecture documentation is the reference for the underlying instructions.

Three nested boxes representing the Blackwell memory hierarchy: outer HBM3e hold
Figure 58.4.2: FlashAttention-4 tiling on a Blackwell SM. The IO-aware idea from FA-1 (keep tiles in fast SRAM) is preserved; FA-4's new contribution is to schedule the matmul (tensor partition) and the softmax (special-function partition) asymmetrically so they overlap instead of stalling each other. That overlap is the ~2x speedup over FA-3. s full Q, K, V matrices and the naive O(seqlen^2) attention matrix; inner SRAM/L1 per SM holds resident Q tile plus streamed K and V tiles plus an online softmax accumulator; innermost Registers hold the running m, l, O statistics. The SRAM box also splits into Tensor partition (tcgen05 MMA) and SFU partition (exp, division, softmax) with an arc labeled async pipelining. A bandwidth bar at the bottom shows HBM ~8 TB/s, SRAM ~20 TB/s, Registers ~80 TB/s.

58.4.3 The wider inference-kernel ecosystem

FA-4 is the highest-profile but not the only kernel that matters in 2026. The broader stack:

58.4.4 Comparing the inference-kernel paths

Table 58.4.1a: 63.4.1 Inference-kernel options on modern hardware.
Kernel / frameworkHardwareSpeedup over baselineBest for
FlashAttention-4Blackwell B200/B300~2x over FA-3Frontier training and prefill
FlashAttention-3Hopper H100~1.5-2x over FA-2 with FP8H100 deployments
vLLM PagedAttentionAny~5-10x throughputProduction serving
TensorRT-LLMNVIDIA only~1.5-3x over vLLMNVIDIA-locked production
bitnet.cppx86 / ARM CPUs2.4-6.2x x86, 1.4-5x ARM CPUCPU-only LLM serving
Key Insight
Mental Model: memory hierarchy is the bottleneck, not compute

The FA line of work has one recurring theme: the algorithm performance is set by where you can keep the tile, not by how fast the multiplier runs. Blackwell HBM3e is ~8 TB/s; SRAM/L1 is ~20 TB/s; registers are ~80 TB/s. Tiling so the working set lives in SRAM, with online softmax keeping running statistics in registers, is the whole game. FA-4 added one more step: schedule the tensor-core partition and the special-function partition asymmetrically so neither stalls. The mental model: attention is a streaming problem with a memory budget, not a FLOP-counting problem with an arithmetic budget. Once you internalize that, every kernel innovation in the FA-1 to FA-4 line looks like the same idea applied to a new memory wall. The January 2026 LLM Inference Hardware survey formalises the trend.

Real-World Scenario: DeepSeek V4 Compressed Sparse Attention

When DeepSeek V4 shipped (Q1 2026) with Compressed Sparse Attention (CSA), the open-weight community immediately discovered that vLLM, TensorRT-LLM, and SGLang did not have CSA kernels. For roughly 4-6 weeks "the weights are public" coexisted with "the inference is 3-5x slower than dense baselines" because nobody had written a fused CSA kernel. The vLLM team shipped a first CSA implementation in early Q2 2026; TensorRT-LLM followed in late Q2. This pattern is the new normal: every novel attention variant ships with a custom kernel, and your effective throughput depends on whether your serving stack has caught up. Always check kernel support before committing to deploying a novel architecture.

Warning: FA-4 is specific to Blackwell SMs

FA-4's "asymmetric pipelining" optimization assumes Blackwell's tensor-core and special-function-unit ratios; it does not produce speedups on Hopper (H100/H200) and may regress on Ampere (A100). The compile-time guard in the official kernel rejects pre-Blackwell architectures. If you maintain a mixed-fleet cluster (some H100, some B200), you need both FA-3 and FA-4 in your build, with runtime dispatch by GPU arch. Triton-based forks (for AMD MI355X, Intel Gaudi) implement the FA algorithm but not the FA-4-specific kernel; they typically land at 60-70% of CUDA FA-4 throughput on equivalent hardware. The non-portability of frontier kernels is one of the costs of NVIDIA's tooling lead.

Warning: kernel availability gates frontier-model adoption

When DeepSeek V4 shipped with Compressed Sparse Attention, vLLM and TensorRT-LLM did not support CSA out of the box. The result was a multi-week lag between "weights are public" and "production inference is fast". This is the new normal: every novel attention variant ships with a custom kernel, and if your serving stack does not have it, your effective throughput collapses. Check kernel support before you commit to deploying a new architecture.

DeepSeek V4 CSA: weights-public vs kernel-ready lag in 2026
Figure 58.4.3: The DeepSeek V4 CSA timeline that the section's case study describes. The open-weight release in Q1 2026 led to a 4-6 week window where production serving stacks ran CSA through generic fallback kernels at 3-5x the dense-attention latency. vLLM shipped a first CSA implementation in early Q2 (closing most of the gap, ~70% of FA-4 throughput); TensorRT-LLM followed in late Q2 with NVIDIA-optimized kernels at FA-4-class throughput. The pattern is the new normal: novel attention variants (CSA, sliding-window, multi-head latent attention) ship with custom kernels, and effective deployment latency depends on whether the serving stack has caught up.

58.4.5 What changes for non-NVIDIA silicon

FA-4 is CUDA / PTX-only. The non-NVIDIA equivalents are written in Triton (for AMD MI355X), in custom SDKs (Cerebras WSE, Groq LPU), or in OpenCL / Vulkan (mobile). The bandwidth and instruction asymmetries differ across vendors; the FlashAttention algorithm is portable but the kernel is not. This is one of the unresolved questions of 2027: will Triton converge across vendors as a unified kernel language, or will each silicon ship its own framework? Section 58.5 closes the chapter with the systems-level co-design question this implies.

Key Takeaways
Self-Check
Q1. Why does FA-4 not produce a speedup on H100?
Show Answer
The asymmetric-pipelining schedule assumes Blackwell's tensor-core / special-function-unit ratio. Hopper's SM partition layout differs, so the schedule does not align and the compile-time guard refuses to build.
Q2. A team wants to deploy DeepSeek V4 in February 2026. What should they check before promising a latency SLA?
Show Answer
CSA kernel support in their serving stack. Until vLLM/TensorRT-LLM/SGLang ships a fused CSA kernel, fallback paths can be 3-5x slower than dense baselines on equivalent hardware.

What's Next?

In the next section, Section 58.5: Training-Inference Co-Design, we build on the material covered here.

Further Reading
Dao et al., "FlashAttention-4" (March 2026): the Blackwell-generation kernel.
Dao et al., "FlashAttention-3" (2024): the FP8 + warp-specialization H100 kernel.
Modal Labs, "Reverse-engineering FlashAttention-4": clearest explainer.
vLLM project, "vLLM PagedAttention".