AI Chip Design Day 24

The Specialists vs Generalists

Company	Chip	Focus	Max TFLOPS	Cost
Google	TPU	Train + Infer	430	Data center
NVIDIA	H100	All compute	1,450	Expensive, flexible
Groq	LPU	Inference (LLM)	3,800	Narrow, fast
Cerebras	Wafer	Training	Unknown (huge)	Experimental
SambaNova	Dataflow	Training + Infer	12,800	Limited capacity

Groq LPU (Language Processing Unit)

Extreme specialization: LLM inference only

No systolic array (unnecessary for inference)
Fixed hardware for token-by-token generation
Clock speed: 1 GHz (high power density, requires cooling)
Peak: 3,800 TFLOPS (single chip)
Memory: 32 GB (not HBM, custom)
Use case: Run LLaMA, Mixtral, LLaMA-70B in real-time

Why This Works for Inference

LLM inference pattern: 1. Load model weights (one-time, slow) 2. Feed tokens through network 3. Generate next token (parallelizable with batch) Traditional systolic: Designed for training (multiple epochs) Groq LPU: Designed for serving (one-pass inference) Result: Achieves 430 tokens/sec for GPT-3 (vs GPU's 50 tokens/sec)

Cerebras Wafer-Scale Engine

Extreme integration: entire chip on one wafer

12-inch wafer with 2.6 trillion transistors
400,000+ cores (compared to H100's ~16,000)
Local memory per core (no global HBM needed)
Still experimental, limited software ecosystem

SambaNova Reconfigurable

Dataflow units that reconfigure per model:

Not fixed systolic (like TPU) or fixed GPU (like H100)
Can rewire MACs for different layer sizes
Claimed: 12.8 POPS (petaFLOPS, rare benchmark)
Challenge: Programming model complex (low adoption)

The Tradeoff

Specialization gains: - Groq: 10× faster LLM inference (but only LLMs) - Cerebras: Massive parallelism (but software immature) - SambaNova: Flexibility (but hard to program) Generalization wins: - TPU: train + infer (good for both) - H100: all workloads (most flexible) Production reality: Google/NVIDIA win market share despite lower peak TFLOPS because developers know how to use them.