The Problem: Is Memory the Bottleneck?
For matrix multiplication C = A @ B:
A[N×K], B[K×N], C[N×N]
FLOPs: 2·N·K·N = 2·N²·K
Memory: 2·N² (load A and B, store C) + 2·N·K (load for each pass)
Arithmetic intensity = FLOPs / Bytes
= 2·N²·K / (2·N² + 2·N·K)
≈ N / 2 for large N
For N=256: AI ≈ 128 FLOPs/byte
For N=1024: AI ≈ 512 FLOPs/byte
Roofline Formula
Peak performance = min(Peak Compute, Peak Memory × AI)
Peak Compute (FLOPS): What the chip can do per second
Peak Memory (GB/s): Max bandwidth from memory
Example: TPU v4
- Peak Compute: 430 TFLOPS = 430×10^12 FLOPs/sec
- Peak Memory: 2000 GB/s (HBM bandwidth)
- AI threshold = 430×10^12 / (2000×10^9) = 215 FLOPs/byte
If AI < 215 → memory-bound (limited by bandwidth)
If AI > 215 → compute-bound (limited by hardware FLOPs)
Roofline Chart
Performance (TFLOPS) |
430 |--------[Compute Roof]--------→
| /
215 |------/←[Knee: AI=215]
| /
100 |--/
|/[Memory Roof: 2000 GB/s × AI]
0 |___________________________
0 100 200 300 400
Arithmetic Intensity (FLOP/byte)
Regions:
- Left of knee: Memory-bound (roofline slopes with bandwidth)
- Right of knee: Compute-bound (flat at 430 TFLOPS)
Why This Matters for Design
- If memory-bound: Increase data reuse (better algorithm, larger systolic arrays, better caching)
- If compute-bound: Add more compute units (more MACs, wider pipelines)
- Design sweet spot: Operate at the knee (mix of both limits)
Real Examples
Google TPU v4 (Systolic Array)
Specs:
- Compute: 430 TFLOPS
- Memory: 2000 GB/s (HBM)
- AI threshold: 215 FLOP/byte
For matrix multiply (N=256×256):
- AI = 128 FLOP/byte (< 215)
- Result: Memory-bound, ~200 TFLOPS achieved (46% peak)
For matrix multiply (N=1024×1024):
- AI = 512 FLOP/byte (> 215)
- Result: Compute-bound, ~430 TFLOPS achieved (100% peak!)
Insight: Systolic arrays are memory-efficient but still memory-bound for small matrices. Large batch sizes increase AI.
Apple A17 Neural Engine
| Config | Compute | Memory BW | AI Knee | Status |
|---|---|---|---|---|
| 16×16 systolic | 17 TOPS | 100 GB/s | 170 FLOP/B | Mem-bound |
| With 512KB cache | 17 TOPS | 500 GB/s | 34 FLOP/B | Compute-bound |
Day 17: Cache hierarchies and memory systems.