Section G.1: GPU Comparison: The Accelerator Landscape

Choosing the right GPU is one of the most consequential decisions in any LLM project. The table below summarizes the key specifications of the most widely used accelerators for LLM workloads as of early 2026. Specifications are for the data center (SXM or OAM) variants unless otherwise noted.

GPU Comparison

GPU	Vendor	HBM	Memory BW	BF16 TFLOPS	FP8 TFLOPS	Interconnect	TDP
A100 SXM	NVIDIA	80 GB HBM2e	2.0 TB/s	312	N/A	NVLink 3 (600 GB/s)	400W
H100 SXM	NVIDIA	80 GB HBM3	3.35 TB/s	990	1,979	NVLink 4 (900 GB/s)	700W
H200 SXM	NVIDIA	141 GB HBM3e	4.8 TB/s	990	1,979	NVLink 4 (900 GB/s)	700W
B100	NVIDIA	192 GB HBM3e	8.0 TB/s	1,750	3,500	NVLink 5 (1,800 GB/s)	700W
B200	NVIDIA	192 GB HBM3e	8.0 TB/s	2,250	4,500	NVLink 5 (1,800 GB/s)	1,000W
MI300X	AMD	192 GB HBM3	5.3 TB/s	1,307	2,615	Infinity Fabric (896 GB/s)	750W
L40S	NVIDIA	48 GB GDDR6	864 GB/s	362	733	PCIe 4.0 (64 GB/s)	350W

Reading the Table

HBM = High Bandwidth Memory, the total VRAM available for model weights, activations, and KV cache. Memory BW = bandwidth between the memory and compute cores; this is often the bottleneck during inference. BF16/FP8 TFLOPS = theoretical peak throughput in tera floating-point operations per second at each precision. TDP = thermal design power, relevant for cooling and electricity costs.

Key Takeaways from the Spec Sheet

Memory capacity determines which models you can run. A 70B parameter model in BF16 requires approximately 140 GB of VRAM, meaning a single A100 (80 GB) cannot hold it, but a single H200 (141 GB) or MI300X (192 GB) can.
Memory bandwidth determines inference speed. The tokens-per-second rate during autoregressive decoding is almost entirely memory-bandwidth-bound, because each token requires reading the full model weights from memory.
Compute FLOPS determine training speed. Training throughput scales with raw TFLOPS because matrix multiplications dominate the computation. The jump from H100 to B200 is roughly 2.3x in BF16.
Interconnect matters for multi-GPU setups. Tensor parallelism across GPUs requires fast inter-GPU communication. NVLink/NVSwitch is far superior to PCIe for multi-GPU training and inference.