Choosing the right GPU is one of the most consequential decisions in any LLM project. The table below summarizes the key specifications of the most widely used accelerators for LLM workloads as of early 2026. Specifications are for the data center (SXM or OAM) variants unless otherwise noted.
GPU Comparison
| GPU | Vendor | HBM | Memory BW | BF16 TFLOPS | FP8 TFLOPS | Interconnect | TDP |
|---|---|---|---|---|---|---|---|
| A100 SXM | NVIDIA | 80 GB HBM2e | 2.0 TB/s | 312 | N/A | NVLink 3 (600 GB/s) | 400W |
| H100 SXM | NVIDIA | 80 GB HBM3 | 3.35 TB/s | 990 | 1,979 | NVLink 4 (900 GB/s) | 700W |
| H200 SXM | NVIDIA | 141 GB HBM3e | 4.8 TB/s | 990 | 1,979 | NVLink 4 (900 GB/s) | 700W |
| B100 | NVIDIA | 192 GB HBM3e | 8.0 TB/s | 1,750 | 3,500 | NVLink 5 (1,800 GB/s) | 700W |
| B200 | NVIDIA | 192 GB HBM3e | 8.0 TB/s | 2,250 | 4,500 | NVLink 5 (1,800 GB/s) | 1,000W |
| MI300X | AMD | 192 GB HBM3 | 5.3 TB/s | 1,307 | 2,615 | Infinity Fabric (896 GB/s) | 750W |
| L40S | NVIDIA | 48 GB GDDR6 | 864 GB/s | 362 | 733 | PCIe 4.0 (64 GB/s) | 350W |
Reading the Table
HBM = High Bandwidth Memory, the total VRAM available for model weights, activations, and KV cache. Memory BW = bandwidth between the memory and compute cores; this is often the bottleneck during inference. BF16/FP8 TFLOPS = theoretical peak throughput in tera floating-point operations per second at each precision. TDP = thermal design power, relevant for cooling and electricity costs.
Key Takeaways from the Spec Sheet
- Memory capacity determines which models you can run. A 70B parameter model in BF16 requires approximately 140 GB of VRAM, meaning a single A100 (80 GB) cannot hold it, but a single H200 (141 GB) or MI300X (192 GB) can.
- Memory bandwidth determines inference speed. The tokens-per-second rate during autoregressive decoding is almost entirely memory-bandwidth-bound, because each token requires reading the full model weights from memory.
- Compute FLOPS determine training speed. Training throughput scales with raw TFLOPS because matrix multiplications dominate the computation. The jump from H100 to B200 is roughly 2.3x in BF16.
- Interconnect matters for multi-GPU setups. Tensor parallelism across GPUs requires fast inter-GPU communication. NVLink/NVSwitch is far superior to PCIe for multi-GPU training and inference.