Why 256×256?
Systolic size tradeoff:
- 64×64: 4K MACs, easy routing (but slow for large models)
- 128×128: 16K MACs, moderate latency
- 256×256: 65K MACs, good balance ← TPU v4 choice
- 512×512: 260K MACs, extreme latency (wires too long)
Wire delay grows with sqrt(N) for NxN array.
256×256: acceptable latency ~5 ns (2.5 cycles @ 2 GHz)
TPU v4 Specs
- Compute: 256×256 systolic array, 65,536 MACs
- Precision: BF16 (training), INT8 (inference), FP32 (loss scaling)
- Peak: 430 TFLOPS (BF16), 860 TFLOPS (INT8 dense)
- Memory: 8 GB HBM3, 2 TB/s bandwidth
- SRAM: 24 MB weights + 8 MB FIFO buffers
- Power: 200W sustained (v4 Pod has 8 chips = 1.6 kW)
Why BF16 for Training?
BF16 = 1 sign + 8 exponent + 7 mantissa
Training is gradient-heavy:
- Activations: large (106 range)
- Gradients: tiny (10^-8 to 10^-6)
- FP16 underflows: range too small
- BF16 exponent matches FP32: no underflow
Systolic efficiency:
- BF16 = 16 bits (matrix ops are native)
- FP32 accumulation (loss scaling)
- Mixed precision: best of both worlds
Real Workloads
Transformer Training (LLaMA 70B)
| Metric | Value |
|---|---|
| Model size | 70 billion params |
| Batch size | 512 |
| Sequence length | 2,048 tokens |
| Effective TFLOPS | 300 TFLOPS (70% of peak) |
| Time per epoch | ~3 hours (8 TPU v4 chips) |
Inference: BERT Large
- Model: 340M params
- Batch size: 128
- Latency: <1 ms per batch (430 TFLOPS achieved)
- Throughput: 128K queries/sec per TPU
TPU Pod: Multi-Chip Scaling
8 TPU v4 chips (Pod):
- 8 × 65K MACs = 520K MACs total
- All-reduce tree for distributed training
- Mesh topology (synchronization latency ~1 ms)
512 TPU v4 chips (Super Pod):
- 64 Pods networked (higher latency)
- All-reduce: ~10 ms (acceptable for large batch training)
Production: Google trains GPT-3 scale models on 512+ TPUs
Day 23: NVIDIA H100 — GPU approach to AI acceleration.