The Specialists vs Generalists
| Company | Chip | Focus | Max TFLOPS | Cost |
|---|---|---|---|---|
| TPU | Train + Infer | 430 | Data center | |
| NVIDIA | H100 | All compute | 1,450 | Expensive, flexible |
| Groq | LPU | Inference (LLM) | 3,800 | Narrow, fast |
| Cerebras | Wafer | Training | Unknown (huge) | Experimental |
| SambaNova | Dataflow | Training + Infer | 12,800 | Limited capacity |
Groq LPU (Language Processing Unit)
Extreme specialization: LLM inference only
- No systolic array (unnecessary for inference)
- Fixed hardware for token-by-token generation
- Clock speed: 1 GHz (high power density, requires cooling)
- Peak: 3,800 TFLOPS (single chip)
- Memory: 32 GB (not HBM, custom)
- Use case: Run LLaMA, Mixtral, LLaMA-70B in real-time
Why This Works for Inference
LLM inference pattern:
1. Load model weights (one-time, slow)
2. Feed tokens through network
3. Generate next token (parallelizable with batch)
Traditional systolic: Designed for training (multiple epochs)
Groq LPU: Designed for serving (one-pass inference)
Result: Achieves 430 tokens/sec for GPT-3 (vs GPU's 50 tokens/sec)
Cerebras Wafer-Scale Engine
Extreme integration: entire chip on one wafer
- 12-inch wafer with 2.6 trillion transistors
- 400,000+ cores (compared to H100's ~16,000)
- Local memory per core (no global HBM needed)
- Still experimental, limited software ecosystem
SambaNova Reconfigurable
Dataflow units that reconfigure per model:
- Not fixed systolic (like TPU) or fixed GPU (like H100)
- Can rewire MACs for different layer sizes
- Claimed: 12.8 POPS (petaFLOPS, rare benchmark)
- Challenge: Programming model complex (low adoption)
The Tradeoff
Specialization gains:
- Groq: 10× faster LLM inference (but only LLMs)
- Cerebras: Massive parallelism (but software immature)
- SambaNova: Flexibility (but hard to program)
Generalization wins:
- TPU: train + infer (good for both)
- H100: all workloads (most flexible)
Production reality: Google/NVIDIA win market share despite lower peak TFLOPS because developers know how to use them.
Day 25: Other accelerators (Qualcomm Hexagon, Intel Gaudi, AWS Trainium).