The Strategic Vision: Why Google Built TPU
Context (2017-2020): TPU v1-v3 were data-center chips for Google's internal use only. GPUs dominated the market. By 2022, Google faced a problem:
- Training LLMs at scale (175B+ parameters) cost $5M+ per run
- NVIDIA GPUs (V100, A100) were mature, fast, but expensive
- Custom silicone cost $100M+ to design and tape-out
- But the ROI was clear: If TPU v4 was 2-3× more efficient, Google would save hundreds of millions in training costs
TPU v4's mission: Be the most efficient AI training chip ever built, optimized specifically for transformer models.
The Core Design: 256×256 Systolic Array
🔷 Why 256×256 and Not Smaller or Larger?
Analysis of array size trade-offs:
| Array Size | Total MACs | Die Area (mm²) | Peak TFLOPS | TFLOPS/mm² | Wiring Complexity |
|---|---|---|---|---|---|
| 128×128 | 16,384 | 20 | 107 | 5.4 | Low |
| 256×256 | 65,536 | 80 | 430 | 5.4 | Medium |
| 512×512 | 262,144 | 320 | 1,720 | 5.4 | Extreme |
Observation: TFLOPS/mm² is constant! So why pick 256×256?
- Power density: Larger array = longer wires = more resistance = more power loss
- Synthesis complexity: At 512×512, place-and-route becomes infeasible (routing congestion)
- Yield: Larger die = more defects per wafer (exponential cost)
- Latency: Signal propagation in a 512×512 array exceeds clock period (timing closure fails)
- Sweet spot: 256×256 fits all constraints and provides 430 TFLOPS/chip
The Memory Subsystem: Why 8 GB HBM3 + 24 MB SRAM?
🎯 Memory Capacity vs Bandwidth Optimization
Problem: Large transformer models need massive memory (LLaMA 70B = 280 GB @ FP32, 140 GB @ BF16). A single TPU with 8 GB HBM can't hold the entire model.
Solution: Stream weights from HBM into SRAM cache. Compute on streamed data.
| Layer | Size | Bandwidth | Latency | Purpose |
|---|---|---|---|---|
| Register File | ~50 KB | 10 TB/s | 1 cycle | Immediate operands |
| SRAM (weight cache) | 24 MB | 500 GB/s | 5-10 cycles | Hold current layer weights |
| HBM3 | 8 GB | 2 TB/s | 50-100 cycles | Full model storage |
Design choice: 24 MB SRAM holds ~1 transformer layer (weights + activations). As computation moves to next layer, prefetch its weights into SRAM.
Why 2 TB/s HBM Bandwidth Is Critical
Design Decisions: Training-Specific Choices
Decision 1: BF16 for Forward & Backward (Not FP16)
🎯 Why BF16 Instead of FP16?
Problem with FP16: Range is only ±10^4. Gradients during backprop go as small as 10^-8. FP16 underflows—gradients become zero, training stalls.
BF16 solution: 1 sign + 8 exponent + 7 mantissa. Shares exponent with FP32, so range is ±10^38. No underflow!
Cost: Lower precision (7 vs 10 bits in mantissa), but neural networks don't need that precision anyway.
Measurement: LLaMA 70B training
- FP16 mixed precision: Would require frequent scaling, ~20% overhead
- BF16 mixed precision: Clean implementation, near 100% efficiency
Decision 2: All-Reduce Tree for Distributed Training
Problem: A single TPU is 430 TFLOPS. But LLaMA 70B training needs 2,500+ TFLOPS continuously to finish in weeks.
Solution: 8 TPUs in a Pod, connected with an all-reduce tree.
Decision 3: Unified Memory Layout (Not Separate I/D Cache)
Systolic arrays predict memory access patterns perfectly. Unlike CPUs, there's no cache miss problem. Design choice: Single unified weight SRAM. No instruction cache needed (control is off-chip).
Benefit: 100% of SRAM goes to data. No wasted space on instruction buffering.
Real Training Performance: Detailed Numbers
Scenario: Training LLaMA 70B on 8 TPU v4 Pod
| Metric | Value | Breakdown |
|---|---|---|
| Model Size | 70B params | 140 GB @ BF16 |
| Global Batch | 512 | 64 per TPU × 8 |
| Sequence Length | 2,048 tokens | Standard context window |
| Layers | 80 | Attention + MLP per layer |
| FLOPs per Batch | ~4.3B | 2×N×d×L + attention |
Phase 1: Forward Pass (All 8 TPUs)
| Layer Type | FLOPs | Throughput (TFLOPS) | Time (ms) | Utilization |
|---|---|---|---|---|
| Attention Q @ K (2048²×64) | 268M | 50 | 5.4 | 12% |
| Softmax | 10M | — | 2.0 | — |
| Attention × V | 268M | 50 | 5.4 | 12% |
| Linear (8192→8192) | 1.1B | 380 | 3.0 | 88% |
| Linear (8192→32768) | 4.3B | 420 | 10.2 | 98% |
Total forward pass: ~27 ms @ ~2,200 TFLOPS (67% peak)
Phase 2: Backward Pass (All 8 TPUs)
Backprop is roughly 2× forward pass (gradient computation for weights & activations).
Total backward pass: ~54 ms @ ~2,200 TFLOPS (67% peak)
Phase 3: All-Reduce Synchronization
After each layer's backward pass, gradients must be synchronized across 8 TPUs.
Time: ~1-2 ms (depending on data size)
Complete Iteration: ~85 ms
| Phase | Time (ms) | Utilization |
|---|---|---|
| Forward | 27 | 67% |
| Backward | 54 | 67% |
| All-reduce | 2 | 100% (specialized) |
| Total per step | 83 | ~68% avg |
Sustained throughput: 8 TPUs × 430 TFLOPS × 0.68 = 2,330 TFLOPS
Training speed:
Why This Design Dominates Matrix Math
🔷 The Systolic Advantage Explained
For matrix multiply (core AI operation):
- Memory bandwidth bottleneck: Computing C = A @ B requires loading N² elements, but computing N³ multiplies. Ratio = N.
- Systolic solution: Stream A rightward, B downward. Each element loaded once but used by N MACs.
- Result: Actual memory bandwidth = 2 TB/s / N (where N = 256 in practice).
- This enables: 430 TFLOPS sustained for matrix operations despite "only" 2 TB/s HBM.
GPU can't do this: Tensor Cores are SIMD-like with caches. They still load data multiple times per MAC. So they need higher bandwidth (3 TB/s for H100).
Limitations & Trade-offs
1. Systolic Array Can Stall
If data pipeline breaks (network congestion, memory stall), all 65K MACs sit idle. Unlike GPUs, there's no out-of-order execution or warp scheduling to hide latency.
2. Non-Matrix Operations Are Slow
Element-wise ops (ReLU, LayerNorm, Softmax) run on scalar control logic. While matmul gets 98% peak, activation gets 12% peak. Average: 67%.
3. Fixed to BF16
For inference, quantize to INT8. But TPU v4 doesn't have native INT8 compute (v5+ will). So inference still uses BF16 (less efficient than GPU's FP8 support).
4. All-Reduce Tree Synchronization
Distributed training requires gradients synchronized at the end of each batch. This is critical path and limits scaling beyond ~100 TPUs (latency dominates).
Why Companies Choose TPU vs GPU vs Custom Silicon
| Choose TPU if... | Choose GPU (H100) if... | Design Custom if... |
|---|---|---|
| Doing heavy matrix math (transformers) | Need flexibility (RL, graphs, sparse) | Billion-unit volume (phones, edge) |
| Training at scale (billions of tokens) | Using cutting-edge frameworks | Single product focus (very high ROI) |
| Can wait for Google Cloud availability | Need to own hardware (on-prem) | 10+ year product lifecycle |
| Budget optimizes for throughput | Budget optimizes for flexibility | NRE budget > $100M |
Takeaway: Why TPU v4 Is a Masterpiece
- ✅ Every design choice made for one goal: maximize transformer training efficiency
- ✅ 256×256 systolic is the Goldilocks size (not too big, not too small)
- ✅ BF16 + FP32 mixed precision is perfect for this workload
- ✅ 68% sustained utilization on production workloads (vs 34% for GPU)
- ✅ Cost per TFLOPS is lower than H100 despite headline TFLOPS disparity
- ❌ Trade-off: Less flexible than GPU (but Google didn't need flexibility)
Next (Day 23): NVIDIA H100—the opposite design philosophy (maximum flexibility vs specialization).