HomeDay 16

Roofline Model

The key tool for understanding whether your AI chip is compute-bound or memory-bound. Design insights for systolic arrays, GPUs, and CPUs.

The Problem: Is Memory the Bottleneck?

For matrix multiplication C = A @ B:

A[N×K], B[K×N], C[N×N] FLOPs: 2·N·K·N = 2·N²·K Memory: 2·N² (load A and B, store C) + 2·N·K (load for each pass) Arithmetic intensity = FLOPs / Bytes = 2·N²·K / (2·N² + 2·N·K) ≈ N / 2 for large N For N=256: AI ≈ 128 FLOPs/byte For N=1024: AI ≈ 512 FLOPs/byte

Roofline Formula

Peak performance = min(Peak Compute, Peak Memory × AI)

Peak Compute (FLOPS): What the chip can do per second Peak Memory (GB/s): Max bandwidth from memory Example: TPU v4 - Peak Compute: 430 TFLOPS = 430×10^12 FLOPs/sec - Peak Memory: 2000 GB/s (HBM bandwidth) - AI threshold = 430×10^12 / (2000×10^9) = 215 FLOPs/byte If AI < 215 → memory-bound (limited by bandwidth) If AI > 215 → compute-bound (limited by hardware FLOPs)

Roofline Chart

Performance (TFLOPS) | 430 |--------[Compute Roof]--------→ | / 215 |------/←[Knee: AI=215] | / 100 |--/ |/[Memory Roof: 2000 GB/s × AI] 0 |___________________________ 0 100 200 300 400 Arithmetic Intensity (FLOP/byte) Regions: - Left of knee: Memory-bound (roofline slopes with bandwidth) - Right of knee: Compute-bound (flat at 430 TFLOPS)

Why This Matters for Design

Real Examples

Google TPU v4 (Systolic Array)

Specs: - Compute: 430 TFLOPS - Memory: 2000 GB/s (HBM) - AI threshold: 215 FLOP/byte For matrix multiply (N=256×256): - AI = 128 FLOP/byte (< 215) - Result: Memory-bound, ~200 TFLOPS achieved (46% peak) For matrix multiply (N=1024×1024): - AI = 512 FLOP/byte (> 215) - Result: Compute-bound, ~430 TFLOPS achieved (100% peak!) Insight: Systolic arrays are memory-efficient but still memory-bound for small matrices. Large batch sizes increase AI.

Apple A17 Neural Engine

ConfigComputeMemory BWAI KneeStatus
16×16 systolic17 TOPS100 GB/s170 FLOP/BMem-bound
With 512KB cache17 TOPS500 GB/s34 FLOP/BCompute-bound

Day 17: Cache hierarchies and memory systems.