AI Chip Design Day 16

The Problem: Is Memory the Bottleneck?

For matrix multiplication C = A @ B:

A[N×K], B[K×N], C[N×N] FLOPs: 2·N·K·N = 2·N²·K Memory: 2·N² (load A and B, store C) + 2·N·K (load for each pass) Arithmetic intensity = FLOPs / Bytes = 2·N²·K / (2·N² + 2·N·K) ≈ N / 2 for large N For N=256: AI ≈ 128 FLOPs/byte For N=1024: AI ≈ 512 FLOPs/byte

Roofline Formula

Peak performance = min(Peak Compute, Peak Memory × AI)

Peak Compute (FLOPS): What the chip can do per second Peak Memory (GB/s): Max bandwidth from memory Example: TPU v4 - Peak Compute: 430 TFLOPS = 430×10^12 FLOPs/sec - Peak Memory: 2000 GB/s (HBM bandwidth) - AI threshold = 430×10^12 / (2000×10^9) = 215 FLOPs/byte If AI < 215 → memory-bound (limited by bandwidth) If AI > 215 → compute-bound (limited by hardware FLOPs)

Roofline Chart

Performance (TFLOPS) | 430 |--------[Compute Roof]--------→ | / 215 |------/←[Knee: AI=215] | / 100 |--/ |/[Memory Roof: 2000 GB/s × AI] 0 |___________________________ 0 100 200 300 400 Arithmetic Intensity (FLOP/byte) Regions: - Left of knee: Memory-bound (roofline slopes with bandwidth) - Right of knee: Compute-bound (flat at 430 TFLOPS)

Why This Matters for Design

If memory-bound: Increase data reuse (better algorithm, larger systolic arrays, better caching)
If compute-bound: Add more compute units (more MACs, wider pipelines)
Design sweet spot: Operate at the knee (mix of both limits)

Real Examples

Google TPU v4 (Systolic Array)

Specs: - Compute: 430 TFLOPS - Memory: 2000 GB/s (HBM) - AI threshold: 215 FLOP/byte For matrix multiply (N=256×256): - AI = 128 FLOP/byte (< 215) - Result: Memory-bound, ~200 TFLOPS achieved (46% peak) For matrix multiply (N=1024×1024): - AI = 512 FLOP/byte (> 215) - Result: Compute-bound, ~430 TFLOPS achieved (100% peak!) Insight: Systolic arrays are memory-efficient but still memory-bound for small matrices. Large batch sizes increase AI.

Apple A17 Neural Engine

Config	Compute	Memory BW	AI Knee	Status
16×16 systolic	17 TOPS	100 GB/s	170 FLOP/B	Mem-bound
With 512KB cache	17 TOPS	500 GB/s	34 FLOP/B	Compute-bound

Day 17: Cache hierarchies and memory systems.