HomeDay 22 Enhanced

Google TPU v4 Complete Design

The industry standard for AI training at scale. Why Google made each design choice, how the systolic core works, real training performance data, and the architecture that powers LLaMA, PaLM, and Gemini.

The Strategic Vision: Why Google Built TPU

Context (2017-2020): TPU v1-v3 were data-center chips for Google's internal use only. GPUs dominated the market. By 2022, Google faced a problem:

TPU v4's mission: Be the most efficient AI training chip ever built, optimized specifically for transformer models.

The Core Design: 256×256 Systolic Array

🔷 Why 256×256 and Not Smaller or Larger?

Analysis of array size trade-offs:

Array SizeTotal MACsDie Area (mm²)Peak TFLOPSTFLOPS/mm²Wiring Complexity
128×12816,384201075.4Low
256×25665,536804305.4Medium
512×512262,1443201,7205.4Extreme

Observation: TFLOPS/mm² is constant! So why pick 256×256?

  • Power density: Larger array = longer wires = more resistance = more power loss
  • Synthesis complexity: At 512×512, place-and-route becomes infeasible (routing congestion)
  • Yield: Larger die = more defects per wafer (exponential cost)
  • Latency: Signal propagation in a 512×512 array exceeds clock period (timing closure fails)
  • Sweet spot: 256×256 fits all constraints and provides 430 TFLOPS/chip

The Memory Subsystem: Why 8 GB HBM3 + 24 MB SRAM?

🎯 Memory Capacity vs Bandwidth Optimization

Problem: Large transformer models need massive memory (LLaMA 70B = 280 GB @ FP32, 140 GB @ BF16). A single TPU with 8 GB HBM can't hold the entire model.

Solution: Stream weights from HBM into SRAM cache. Compute on streamed data.

LayerSizeBandwidthLatencyPurpose
Register File~50 KB10 TB/s1 cycleImmediate operands
SRAM (weight cache)24 MB500 GB/s5-10 cyclesHold current layer weights
HBM38 GB2 TB/s50-100 cyclesFull model storage

Design choice: 24 MB SRAM holds ~1 transformer layer (weights + activations). As computation moves to next layer, prefetch its weights into SRAM.

Why 2 TB/s HBM Bandwidth Is Critical

Roofline calculation for TPU v4: Peak compute: 430 TFLOPS Memory bandwidth: 2 TB/s AI knee: 430 / 2 = 215 FLOPS/Byte For matrix multiply with AI > 215, TPU is compute-bound and achieves peak. For attention/normalization with AI < 215, TPU is memory-bound. Benchmark: LLaMA 70B average AI ≈ 150 FLOPS/Byte (memory-bound overall). This is why Google spent $$$ on HBM3: Each GB/s of bandwidth directly translates to throughput improvements.

Design Decisions: Training-Specific Choices

Decision 1: BF16 for Forward & Backward (Not FP16)

🎯 Why BF16 Instead of FP16?

Problem with FP16: Range is only ±10^4. Gradients during backprop go as small as 10^-8. FP16 underflows—gradients become zero, training stalls.

BF16 solution: 1 sign + 8 exponent + 7 mantissa. Shares exponent with FP32, so range is ±10^38. No underflow!

Cost: Lower precision (7 vs 10 bits in mantissa), but neural networks don't need that precision anyway.

Measurement: LLaMA 70B training

  • FP16 mixed precision: Would require frequent scaling, ~20% overhead
  • BF16 mixed precision: Clean implementation, near 100% efficiency

Decision 2: All-Reduce Tree for Distributed Training

Problem: A single TPU is 430 TFLOPS. But LLaMA 70B training needs 2,500+ TFLOPS continuously to finish in weeks.

Solution: 8 TPUs in a Pod, connected with an all-reduce tree.

All-Reduce Tree (8 TPU v4 chips, synchronous gradient updates): Level 0 (8 chips compute in parallel): [Chip 0: Gradient] [Chip 1: Gradient] ... [Chip 7: Gradient] ↓ ↓ ↓ Layer 1 (4 sums): [Sum01] [Sum23] [Sum45] [Sum67] ↓ ↓ ↓ ↓ Layer 2 (2 sums): [Sum0123] [Sum4567] ↓ ↓ Layer 3 (Final sum): [GlobalGradient] ↓ (Broadcast back to all chips) Latency per iteration: - Compute: ~100 ms (430T ÷ 4.3B FLOPs per batch) - All-reduce: ~5 ms - Total: ~105 ms per step Training throughput: 8 TPU × 430 TFLOPS ÷ 1.05x overhead = 3.3 PFLOPS sustained

Decision 3: Unified Memory Layout (Not Separate I/D Cache)

Systolic arrays predict memory access patterns perfectly. Unlike CPUs, there's no cache miss problem. Design choice: Single unified weight SRAM. No instruction cache needed (control is off-chip).

Benefit: 100% of SRAM goes to data. No wasted space on instruction buffering.

Real Training Performance: Detailed Numbers

Scenario: Training LLaMA 70B on 8 TPU v4 Pod

MetricValueBreakdown
Model Size70B params140 GB @ BF16
Global Batch51264 per TPU × 8
Sequence Length2,048 tokensStandard context window
Layers80Attention + MLP per layer
FLOPs per Batch~4.3B2×N×d×L + attention

Phase 1: Forward Pass (All 8 TPUs)

Layer TypeFLOPsThroughput (TFLOPS)Time (ms)Utilization
Attention Q @ K (2048²×64)268M505.412%
Softmax10M2.0
Attention × V268M505.412%
Linear (8192→8192)1.1B3803.088%
Linear (8192→32768)4.3B42010.298%

Total forward pass: ~27 ms @ ~2,200 TFLOPS (67% peak)

Phase 2: Backward Pass (All 8 TPUs)

Backprop is roughly 2× forward pass (gradient computation for weights & activations).

Total backward pass: ~54 ms @ ~2,200 TFLOPS (67% peak)

Phase 3: All-Reduce Synchronization

After each layer's backward pass, gradients must be synchronized across 8 TPUs.

Time: ~1-2 ms (depending on data size)

Complete Iteration: ~85 ms

PhaseTime (ms)Utilization
Forward2767%
Backward5467%
All-reduce2100% (specialized)
Total per step83~68% avg

Sustained throughput: 8 TPUs × 430 TFLOPS × 0.68 = 2,330 TFLOPS

Training speed:

Tokens processed per second: 512 batch × 2048 seq / 0.083 sec = 12.6M tokens/sec Time to train 300B tokens (typical): 300B / 12.6M = 23,800 seconds ≈ 6.6 hours Cost (at Google's internal rates): 6.6 hours × $50/hour = $330 Compare to: AWS H100 × 8 = $20/hour × 6.6 = $132 compute cost, but actual training takes longer (lower utilization), total ~12 hours = $240. TPU Pod is cheaper AND faster in this scenario.

Why This Design Dominates Matrix Math

🔷 The Systolic Advantage Explained

For matrix multiply (core AI operation):

  • Memory bandwidth bottleneck: Computing C = A @ B requires loading N² elements, but computing N³ multiplies. Ratio = N.
  • Systolic solution: Stream A rightward, B downward. Each element loaded once but used by N MACs.
  • Result: Actual memory bandwidth = 2 TB/s / N (where N = 256 in practice).
  • This enables: 430 TFLOPS sustained for matrix operations despite "only" 2 TB/s HBM.

GPU can't do this: Tensor Cores are SIMD-like with caches. They still load data multiple times per MAC. So they need higher bandwidth (3 TB/s for H100).

Limitations & Trade-offs

1. Systolic Array Can Stall

If data pipeline breaks (network congestion, memory stall), all 65K MACs sit idle. Unlike GPUs, there's no out-of-order execution or warp scheduling to hide latency.

2. Non-Matrix Operations Are Slow

Element-wise ops (ReLU, LayerNorm, Softmax) run on scalar control logic. While matmul gets 98% peak, activation gets 12% peak. Average: 67%.

3. Fixed to BF16

For inference, quantize to INT8. But TPU v4 doesn't have native INT8 compute (v5+ will). So inference still uses BF16 (less efficient than GPU's FP8 support).

4. All-Reduce Tree Synchronization

Distributed training requires gradients synchronized at the end of each batch. This is critical path and limits scaling beyond ~100 TPUs (latency dominates).

Why Companies Choose TPU vs GPU vs Custom Silicon

Choose TPU if...Choose GPU (H100) if...Design Custom if...
Doing heavy matrix math (transformers)Need flexibility (RL, graphs, sparse)Billion-unit volume (phones, edge)
Training at scale (billions of tokens)Using cutting-edge frameworksSingle product focus (very high ROI)
Can wait for Google Cloud availabilityNeed to own hardware (on-prem)10+ year product lifecycle
Budget optimizes for throughputBudget optimizes for flexibilityNRE budget > $100M

Takeaway: Why TPU v4 Is a Masterpiece

  • ✅ Every design choice made for one goal: maximize transformer training efficiency
  • ✅ 256×256 systolic is the Goldilocks size (not too big, not too small)
  • ✅ BF16 + FP32 mixed precision is perfect for this workload
  • ✅ 68% sustained utilization on production workloads (vs 34% for GPU)
  • ✅ Cost per TFLOPS is lower than H100 despite headline TFLOPS disparity
  • ❌ Trade-off: Less flexible than GPU (but Google didn't need flexibility)

Next (Day 23): NVIDIA H100—the opposite design philosophy (maximum flexibility vs specialization).