Beyond NVIDIA: Groq, Cerebras, Tenstorrent, AMD MI355

Section 58.1

"There are only two hard things in computer science: cache invalidation, naming things, and bandwidth."

Adapted from Phil Karlton, with apologies, 2024
Note: Learning Objectives
Big Picture

For ten years "what do you run an LLM on" had one answer. By 2026 it has five, each optimized for a different point on the training-inference / latency-throughput / centralized-edge surface. Understanding which silicon serves which workload is now a first-order architectural decision, not a procurement detail.

Prerequisites

This section assumes familiarity with LLM compute planning from Section 57.1 and with GPU procurement from Section 57.3. Familiarity with cross-hardware benchmarking from Section 57.4 helps when reading the spec-sheet comparisons.

For nearly a decade NVIDIA's H100 and (more recently) Blackwell B200 GPUs were the only practical answer to "what do you train a frontier model on". That stopped being true around the end of 2025. The non-NVIDIA inference silicon story is now real money and real deployments: Cerebras CS-3, Groq LPU (now NVIDIA-owned), Tenstorrent, and AMD's MI355X each carve a slice. Training is still dominated by NVIDIA Blackwell-class clusters, but inference, especially low-latency token streaming, is genuinely heterogeneous in 2026.

The single most consequential transaction was NVIDIA's $20B acquisition of Groq in December 2025. NVIDIA absorbed the LPU technology into the new "Vera Rubin LPX" inference rack architecture; the resulting product positions inference silicon as a first-class peer to training silicon inside NVIDIA's own line. Whether that consolidates the market or accelerates the alternatives (Cerebras IPO, AMD MI355 ramp, AWS Trainium roadmap) is one of the three open questions this chapter ends on.

58.1.1 Cerebras CS-3 and the wafer-scale bet

Fun Fact

Cerebras's wafer-scale chip has roughly 900,000 cores on a single piece of silicon, an engineering feat that would have been impossible without TSMC's willingness to ship a wafer with defects and let Cerebras route around them in software. The chip is partly defect-tolerant because every wafer Cerebras ships almost certainly has defects.

Cerebras CS-3 packages 900,000 cores on a single wafer-scale chip with 44 GB of on-chip SRAM and 21 PB/s of memory bandwidth. The pitch is bandwidth-density: inference latency for large language models is bandwidth-bound, and CS-3 has more bandwidth per square centimetre than any conventional GPU cluster. In January 2026 Cerebras signed a $10B+ deal with OpenAI for 750 MW of capacity, the first hyperscaler commitment to wafer-scale silicon as a primary inference fabric. Cerebras filed for IPO in March 2026.

Under the Hood: Cerebras wafer-scale engine

A wafer-scale engine treats one entire silicon wafer as a single chip: ~900,000 small cores joined by an on-wafer mesh (Swarm) so core-to-core messages never leave the die. Each core has its own local SRAM; there is no shared HBM, so the model's weights are streamed layer-by-layer from an external memory appliance (MemoryX) while activations stay resident, and the compiler lays out each layer's matmul across a region of cores in a dataflow pipeline. Because every wafer has fab defects, a redundant routing layer maps out bad cores and reconnects the mesh, so the part ships as a fully-connected fabric. The payoff is enormous aggregate SRAM bandwidth, which is what bounds batch-1 LLM latency.

58.1.2 Groq LPU (now inside NVIDIA Vera Rubin)

The Groq LPU (Language Processing Unit) is a deterministic-latency accelerator optimized for token streaming. Through 2024-25 Groq held the public latency crown, hitting 1000+ tokens/second on small open-weight models. NVIDIA acquired Groq in December 2025 for $20B; the LPU is now sold as the "LPX" inference rack alongside the Vera Rubin training rack. Existing Groq cloud customers still buy capacity via the Groq Console; the silicon itself ships through NVIDIA channels going forward.

Under the Hood: Groq LPU deterministic dataflow

The Groq LPU achieves deterministic latency by removing the hardware that makes GPUs unpredictable. It holds weights and activations in large on-chip SRAM rather than HBM, so there are no cache misses or memory-controller queues, and it has no dynamic schedulers or speculative execution. Instead the compiler statically lays out every operation and data movement onto the chip's functional units cycle by cycle, ahead of time. Because the schedule is fixed at compile time, the chip executes the same instruction stream every run with no runtime arbitration, which is why token-per-second latency is both very low and essentially constant. The trade-off is limited per-chip memory, so large models are sharded across many LPUs.

58.1.3 Tenstorrent (RISC-V chiplet)

Tenstorrent takes a different bet: RISC-V cores in a chiplet architecture, designed by Jim Keller. Tenstorrent raised $700M in December 2024 and shipped the Blackhole accelerator in 2025; their pitch is open-source compute, no CUDA lock-in, and a clear path to chiplet composability for custom system builds. The 2026 deployments are in research and HPC labs, not yet hyperscalers.

58.1.4 AMD MI355X and the second-source story

AMD's MI355X ships ~6 TB/s of HBM bandwidth and 288 GB of HBM3e per accelerator, the most memory of any commodity-channel chip in 2026. AMD's ROCm stack now supports PyTorch, vLLM, and most of the open-weight inference path; for many workloads MI355X is a viable second source. Hyperscalers (Microsoft, Oracle) have committed multi-billion-dollar orders. AWS's parallel story is Trainium 2 (GA December 2024) and Trainium 4 (expected late 2026).

58.1.5 Comparing the non-NVIDIA silicon

Table 58.1.1: Non-NVIDIA inference silicon, mid-2026.
SiliconTypeMemoryBest forStatus
Cerebras CS-3Wafer-scale44 GB on-chipBandwidth-bound LLM inferenceOpenAI deal, IPO pending
Groq LPU / NVIDIA LPXDeterministic-latencyper-chip SRAMLowest-latency token streamingNVIDIA-owned since Dec 2025
Tenstorrent BlackholeRISC-V chiplet32 GB GDDR6Open-source compute, custom buildsResearch / HPC deployments
AMD MI355XGPU + ROCm288 GB HBM3eLargest-memory workloadsHyperscaler-committed
AWS Trainium 2 / 4Cloud-locked ASIC96 GB HBMAWS-native training/inferenceTrainium 4 late 2026
Bar chart of effective memory bandwidth by chip family on a log scale: A100 at 2,000 GB/s, H100 at 3,350, MI355X at 6,000, B200 at 8,000, Groq LPU at 80,000, and Cerebras CS-3 off-scale at 21,000,000 GB/s.
Figure 58.1.1a: Memory bandwidth, not FLOPs, sets batch-1 inference latency. SRAM-class silicon (Groq, Cerebras) jumps a full log unit past HBM, which is why they hold the latency crown despite far lower nominal FLOPs.
Five frontier silicon families on training vs inference and latency vs throughput axes
Figure 58.1.2: The five 2026 frontier-silicon families positioned on the training-vs-inference axis (horizontal) and the throughput-vs-latency axis (vertical). Bubble size reflects effective memory-bandwidth class. NVIDIA Blackwell B200 and AMD MI355X dominate the training quadrant; Cerebras CS-3 anchors high-throughput inference; the Groq LPU (now NVIDIA-owned LPX since the December 2025 $20B acquisition) holds the latency quadrant. The dashed arrow is the dominant 2026 deployment pattern from the OpenAI Cerebras case study: train on Blackwell, serve on Cerebras. AWS Trainium 2/4 and Tenstorrent Blackhole serve narrower cloud-locked and research niches respectively.
Key Insight: Mental Model: interconnect dominates compute

The 2026 silicon story is best understood by ignoring FLOPs and watching where the data lives. A 70B model in fp16 reads 140 GB per token at batch-1; at 1000 tokens/s that demands 140 TB/s of memory bandwidth, more than any single GPU provides. Cerebras and Groq win because they put weights physically next to the arithmetic units (wafer-scale SRAM, distributed SRAM with deterministic routing). HBM gates a B200 even though its tensor cores are over-provisioned. The right mental model: think of frontier silicon as a memory hierarchy with arithmetic units bolted on, not the reverse. Horace He's "Making Deep Learning Go Brrrr" remains the clearest treatment.

Real-World Scenario
Cerebras CS-3 + OpenAI ($10B+, January 2026)

On January 14, 2026, OpenAI signed a $10B+, 750 MW capacity commitment with Cerebras for inference-only deployment. Why inference and not training? Because OpenAI's GPT-5.5 was already trained on Blackwell, but serving it to a billion daily users had become the larger compute line item. The CS-3's 21 PB/s on-wafer bandwidth let OpenAI run frontier-class models at sub-100ms first-token latency without holding the model in HBM. This deal also explains the timing of Cerebras's March 2026 IPO filing: anchor-customer revenue was now contracted, so public-market valuation became defensible. Whether Anthropic and Google strike similar deals through 2026-27 is the open competitive question.

Warning
comparing pflops is misleading without bandwidth context

A vendor benchmark sheet showing "5 pflops fp16" tells you almost nothing about LLM serving latency. The question that matters is what fraction of peak FLOPs can the memory subsystem feed. Blackwell B200 is rated at 4.5 pflops fp8 but only 8 TB/s of HBM3e, so for batch-1 inference on a 70B model it operates at roughly 5% of peak. Cerebras CS-3 has 21 PB/s of on-die bandwidth and saturates its arithmetic. Always ask: peak FLOPs, peak memory bandwidth, and the arithmetic intensity of your workload. The three together predict throughput; any one of them in isolation does not. MIT's HERMES paper formalises this for multi-stage pipelines.

Tip: vendor-neutral latency dashboards

Trust Artificial Analysis and LMArena over vendor press releases. Both run cross-silicon evaluations on identical models with identical prompts; the resulting tokens/sec and time-to-first-token numbers are the only fair way to compare Groq, Cerebras, NVIDIA, and AMD at apples-to-apples granularity. Press-release numbers are typically batch-many throughput on a friendly workload.

Research Frontier: Open questions on which 2027 will settle

Three questions stay open at the time of writing. First, whether the NVIDIA-Groq consolidation forecloses an independent inference-silicon ecosystem or accelerates it (Cerebras IPO, AMD MI355 ramp, AWS Trainium roadmap are the natural test cases). Second, whether bandwidth-density architectures (Cerebras CS-3) prove out at 1T+ parameter scale or whether they hit yield walls. Third, whether AMD's ROCm stack catches up enough that MI355X becomes a true second source rather than a contingency. Section 58.2 turns to the other axis of divergence: training across the public internet.

Key Takeaways
Self-Check
Q1: Why does Groq LPU outperform B200 on batch-1 latency despite having far lower nominal FLOPs?
Show Answer
Batch-1 inference is bandwidth-bound, not FLOPs-bound. A 70B model in fp16 must read ~140 GB of weights per token, so the throughput ceiling is set by how fast memory can feed the arithmetic units. B200 has roughly 8 TB/s of HBM3e, which leaves its tensor cores running at a small fraction of peak. Groq's LPU pins model weights into distributed on-chip SRAM with deterministic routing, hitting effective bandwidth orders of magnitude higher than HBM; the arithmetic actually saturates, and tokens per second goes up even though nominal FLOPs are far lower. This is the same roofline argument Horace He makes in "Making Deep Learning Go Brrrr."
Q2: Cerebras CS-3 has 21 PB/s of on-wafer bandwidth. Why does the same chip not yet dominate training?
Show Answer
Training workloads are gradient-bound across many chips, not weight-bound within one chip. A frontier training run spreads optimizer state and gradients across thousands of accelerators and is bottlenecked by cluster-level all-reduce throughput on NVLink and InfiniBand, not by single-chip memory bandwidth. The existing NVIDIA software stack (Megatron, FSDP, PyTorch FSDPv3) is tuned for clusters of HBM-equipped GPUs, and re-targeting it to a one-wafer-per-rack topology is a multi-year tooling investment. Cerebras's bandwidth advantage solves the wrong layer of the training-latency problem, which is why its 2026 wins are in inference (the OpenAI 10B+ inference deal) rather than training.
Q3: What is the practical implication of NVIDIA's Groq acquisition for an inference customer in mid-2026?
Show Answer
Existing Groq Cloud customers continue to consume capacity via the Groq Console; new procurement happens through NVIDIA channels as the "Vera Rubin LPX" inference rack. The technology has been absorbed into NVIDIA's stack, so customers who valued Groq's independence as a hedge against NVIDIA pricing have lost it. The live economic question is whether NVIDIA prices LPX defensively (to retain existing low-latency customers without cannibalizing B200 training revenue) or aggressively (to grow share against Cerebras CS-3 and AMD MI355X). Inference customers should keep Cerebras and AMD warm as second sources rather than assuming the post-acquisition LPX road map will stay favorable.
What's Next

Centralized accelerators define one end of the design space; the other end pushes training across geographically distributed nodes with limited bandwidth. Continue to Section 58.2: Decentralized Training: Nous Psyche, DeMo, DisTrO.

Further Reading
Cerebras Systems, "CS-3 vs Groq LPU" (2025): vendor comparison with technical specs.
NVIDIA Newsroom, "Vera Rubin LPX (Groq integration)" (Dec 2025 / Mar 2026).
Suvinay Subramanian et al., "HERMES: Heterogeneous Multi-Stage LLM Inference" (MIT, 2025).
Markman Capital, "Cerebras: Everything You Need to Know" (Q1 2026): financial / OpenAI deal coverage.
Horace He, "Making Deep Learning Go Brrrr": the canonical roofline / bandwidth intro.
AImultiple, "AI Chip Makers in 2026": continually-updated market survey.