Why This Matters: The Critical Question
You've designed an AI chip. It can compute 1,000 TFLOPS. But in practice, does it achieve 1,000 TFLOPS on real workloads? Or does it get stuck at 400 TFLOPS?
The roofline model answers this fundamental question: Is your design limited by compute throughput or memory bandwidth?
This distinction changes everything:
- Compute-bound: Add more MACs to go faster
- Memory-bound: Faster MACs don't help; buy faster memory instead
The Mathematical Foundation
Roofline is based on one simple equation:
Let's break each term:
| Term | Definition | Example |
|---|---|---|
| Peak Compute | Theoretical max FLOPs if all ALUs/MACs fire | 430 TFLOPS for TPU v4 |
| Peak Memory BW | Theoretical max Bytes/sec from memory | 2 TB/s for TPU v4 HBM |
| Arithmetic Intensity (AI) | FLOPs per Byte loaded/stored | Depends on algorithm |
Computing Arithmetic Intensity
Example: Matrix Multiply C = A @ B where all matrices are NΓN
π Calculation for 256Γ256 Multiply
Notice: AI scales with N! Larger matrices have higher AI (better efficiency).
| Matrix Size | AI (FLOPS/Byte) | Category |
|---|---|---|
| 16Γ16 | 16.4 | Memory-bound |
| 64Γ64 | 32 | Memory-bound |
| 256Γ256 | 65.5 | Depends on roofline knee |
| 1024Γ1024 | 256 | Likely compute-bound |
The Roofline Graph: Visualizing the Trade-off
Calculating the Knee
The knee is where compute and memory limits meet:
Interpretation: Any operation with AI < 215 is memory-bound on TPU v4. Any operation with AI > 215 can achieve peak compute.
Real Examples: What's the AI for Common Operations?
| Operation | FLOPs | Bytes | AI (FLOPS/B) | On TPU v4 |
|---|---|---|---|---|
| Matrix Multiply (256Γ256) | 33.6M | 512K | 65.5 | Memory-bound (100 TFLOPS) |
| Matrix Multiply (1024Γ1024) | 2.1B | 8M | 262 | Compute-bound (430 TFLOPS) |
| Element-wise ReLU | N | 2N bytes | 0.5 | Memory-bound (memory only) |
| Layer Norm | 5N | 4N bytes | 1.25 | Memory-bound |
| Softmax (attention) | 5NΒ² | NΒ² bytes | 5.0 | Memory-bound |
| Conv2D (3Γ3 kernel) | 18HWKΒ² | 4(HΓW + KΒ²) | β9 | Memory-bound |
| Batch Matmul (GEMM) | Varies | Varies | Batch-dependent | Depends on batch |
Case Study 1: Why TPU Dominates Matrix Math
Scenario: Multiply two 1024Γ1024 matrices
π· TPU v4 (Systolic Array)
π· NVIDIA H100 (Tensor Core)
Conclusion: TPU achieves higher percentage of peak because systolic architecture is optimized for exactly this access pattern.
Case Study 2: Why Attention Layers Bottleneck Transformers
Problem: Self-attention in a 1B-parameter transformer
In transformer architecture, attention is the expensive operation:
πΆ Attention Computation
This is why attention scales poorly: Softmax and attention create many small operations with low AI. The roofline predicts 13% utilizationβreal measurements confirm this.
Design Implication: What Should You Optimize?
For memory-bound operations (AI < knee):
- β Increase memory bandwidth (HBM, 3D stacking, NVLink)
- β Reduce data reuse overhead (fewer loads)
- β Adding more MACs is useless (they'll starve)
- β Faster clock speed doesn't help
For compute-bound operations (AI > knee):
- β Add more MACs/ALUs
- β Increase clock speed
- β Faster memory doesn't help (already computing faster than memory feeds)
- β Larger cache doesn't help (computation is the limit)
The Systolic Array Advantage: Moving the Knee
Standard GPU (H100): AI_knee = 1450 TFLOPS / 3 TB/s β 480 FLOPS/B
Systolic Array (TPU): AI_knee = 430 TFLOPS / 2 TB/s β 215 FLOPS/B
But waitβGPU has more peak TFLOPS and higher knee! So why is TPU better?
Answer: Data reuse changes the actual AI:
Roofline in Practice: Profiling Your Design
// Pseudo-code: Measure roofline performance
def measure_roofline(chip, operation, size):
flops = count_flops(operation, size)
bytes_loaded = count_memory_bytes(operation, size)
ai = flops / bytes_loaded
peak_compute = chip.peak_tflops
peak_memory = chip.memory_bandwidth
knee = peak_compute / peak_memory
if ai < knee:
bottleneck = "MEMORY"
predicted = peak_memory * ai
else:
bottleneck = "COMPUTE"
predicted = peak_compute
actual = benchmark(operation, size)
efficiency = actual / predicted
print(f"AI={ai}, Knee={knee}")
print(f"Bottleneck: {bottleneck}")
print(f"Predicted: {predicted} TFLOPS")
print(f"Actual: {actual} TFLOPS ({efficiency*100:.0f}%)")
return {ai, bottleneck, predicted, actual, efficiency}
# Example results:
measure_roofline(TPU_v4, "matmul_1024x1024", 1024)
# Output: AI=256, Bottleneck=COMPUTE, Predicted=430T, Actual=415T (97%)
measure_roofline(H100, "matmul_1024x1024", 1024)
# Output: AI=256, Bottleneck=COMPUTE, Predicted=1450T, Actual=920T (63%)
measure_roofline(TPU_v4, "softmax", 2048)
# Output: AI=5.2, Bottleneck=MEMORY, Predicted=10T, Actual=9T (90%)
Summary: The Roofline Framework
The roofline model is your design debugging tool:
- β Understand bottlenecks before building silicon
- β Know whether to optimize compute or memory
- β Predict real-world performance from AI
- β Justify spending on expensive components (HBM costs $$$)
- β Identify which operations need algorithmic changes (fuse kernels, increase batch size to raise AI)
Next (Day 17): Cache hierarchies and how to reduce memory traffic.