HomeDay 22

Google TPU v4

Datacenter systolic at scale. 256×256 array, HBM3, 430 TFLOPS. Training & inference for trillion-parameter models.

Why 256×256?

Systolic size tradeoff: - 64×64: 4K MACs, easy routing (but slow for large models) - 128×128: 16K MACs, moderate latency - 256×256: 65K MACs, good balance ← TPU v4 choice - 512×512: 260K MACs, extreme latency (wires too long) Wire delay grows with sqrt(N) for NxN array. 256×256: acceptable latency ~5 ns (2.5 cycles @ 2 GHz)

TPU v4 Specs

Why BF16 for Training?

BF16 = 1 sign + 8 exponent + 7 mantissa Training is gradient-heavy: - Activations: large (106 range) - Gradients: tiny (10^-8 to 10^-6) - FP16 underflows: range too small - BF16 exponent matches FP32: no underflow Systolic efficiency: - BF16 = 16 bits (matrix ops are native) - FP32 accumulation (loss scaling) - Mixed precision: best of both worlds

Real Workloads

Transformer Training (LLaMA 70B)

MetricValue
Model size70 billion params
Batch size512
Sequence length2,048 tokens
Effective TFLOPS300 TFLOPS (70% of peak)
Time per epoch~3 hours (8 TPU v4 chips)

Inference: BERT Large

  • Model: 340M params
  • Batch size: 128
  • Latency: <1 ms per batch (430 TFLOPS achieved)
  • Throughput: 128K queries/sec per TPU

TPU Pod: Multi-Chip Scaling

8 TPU v4 chips (Pod): - 8 × 65K MACs = 520K MACs total - All-reduce tree for distributed training - Mesh topology (synchronization latency ~1 ms) 512 TPU v4 chips (Super Pod): - 64 Pods networked (higher latency) - All-reduce: ~10 ms (acceptable for large batch training) Production: Google trains GPT-3 scale models on 512+ TPUs

Day 23: NVIDIA H100 — GPU approach to AI acceleration.