AI Chip Design Day 22

Why 256×256?

Systolic size tradeoff: - 64×64: 4K MACs, easy routing (but slow for large models) - 128×128: 16K MACs, moderate latency - 256×256: 65K MACs, good balance ← TPU v4 choice - 512×512: 260K MACs, extreme latency (wires too long) Wire delay grows with sqrt(N) for NxN array. 256×256: acceptable latency ~5 ns (2.5 cycles @ 2 GHz)

TPU v4 Specs

Compute: 256×256 systolic array, 65,536 MACs
Precision: BF16 (training), INT8 (inference), FP32 (loss scaling)
Peak: 430 TFLOPS (BF16), 860 TFLOPS (INT8 dense)
Memory: 8 GB HBM3, 2 TB/s bandwidth
SRAM: 24 MB weights + 8 MB FIFO buffers
Power: 200W sustained (v4 Pod has 8 chips = 1.6 kW)

Why BF16 for Training?

BF16 = 1 sign + 8 exponent + 7 mantissa Training is gradient-heavy: - Activations: large (106 range) - Gradients: tiny (10^-8 to 10^-6) - FP16 underflows: range too small - BF16 exponent matches FP32: no underflow Systolic efficiency: - BF16 = 16 bits (matrix ops are native) - FP32 accumulation (loss scaling) - Mixed precision: best of both worlds

Real Workloads

Transformer Training (LLaMA 70B)

Metric	Value
Model size	70 billion params
Batch size	512
Sequence length	2,048 tokens
Effective TFLOPS	300 TFLOPS (70% of peak)
Time per epoch	~3 hours (8 TPU v4 chips)

Inference: BERT Large

Model: 340M params
Batch size: 128
Latency: <1 ms per batch (430 TFLOPS achieved)
Throughput: 128K queries/sec per TPU

TPU Pod: Multi-Chip Scaling

8 TPU v4 chips (Pod): - 8 × 65K MACs = 520K MACs total - All-reduce tree for distributed training - Mesh topology (synchronization latency ~1 ms) 512 TPU v4 chips (Super Pod): - 64 Pods networked (higher latency) - All-reduce: ~10 ms (acceptable for large batch training) Production: Google trains GPT-3 scale models on 512+ TPUs

Day 23: NVIDIA H100 — GPU approach to AI acceleration.