Homeβ€ΊDay 16 Enhanced

The Roofline Model Deep Dive

The fundamental tool for understanding AI chip bottlenecks. Is your design compute-bound or memory-bound? Complete mathematical framework with real examples from Apple to NVIDIA.

Why This Matters: The Critical Question

You've designed an AI chip. It can compute 1,000 TFLOPS. But in practice, does it achieve 1,000 TFLOPS on real workloads? Or does it get stuck at 400 TFLOPS?

The roofline model answers this fundamental question: Is your design limited by compute throughput or memory bandwidth?

This distinction changes everything:

The Mathematical Foundation

Roofline is based on one simple equation:

Peak Performance = min( Peak Compute (FLOPS), Peak Memory Bandwidth (Bytes/sec) Γ— Arithmetic Intensity (FLOPS/Byte) )

Let's break each term:

TermDefinitionExample
Peak ComputeTheoretical max FLOPs if all ALUs/MACs fire430 TFLOPS for TPU v4
Peak Memory BWTheoretical max Bytes/sec from memory2 TB/s for TPU v4 HBM
Arithmetic Intensity (AI)FLOPs per Byte loaded/storedDepends on algorithm

Computing Arithmetic Intensity

Example: Matrix Multiply C = A @ B where all matrices are NΓ—N

πŸ“Š Calculation for 256Γ—256 Multiply

Total FLOPs: N Γ— N Γ— (2N - 1) β‰ˆ 2NΒ³ = 2Γ—256Β³ = 33.6 million
Memory Bytes (assuming FP32): 2 matrices Γ— NΒ² Γ— 4 bytes = 2Γ—256Β²Γ—4 = 512 KB
Arithmetic Intensity: 33.6M FLOPs Γ· 512K Bytes = 65.5 FLOPS/Byte

Notice: AI scales with N! Larger matrices have higher AI (better efficiency).

Matrix SizeAI (FLOPS/Byte)Category
16Γ—1616.4Memory-bound
64Γ—6432Memory-bound
256Γ—25665.5Depends on roofline knee
1024Γ—1024256Likely compute-bound

The Roofline Graph: Visualizing the Trade-off

Peak Compute Roof (430 TFLOPS) ───────────────────────────────── TFLOPS 400 | ╱─────────────────→ | β•± 300 | β•± | β•± 200 | β•± ← Memory Roof | β•± (slope = 2 TB/s bandwidth) 100 | β•± | β•± ← Knee (AI = 215) 0 |____β•±___________________________β†’ 0 50 100 150 200 250 300 Arithmetic Intensity (FLOPS/Byte) Regions: β€’ Left of knee (AI < 215): Memory-bound - Performance limited by bandwidth - Line has slope = 2 TB/s Γ— AI β€’ Right of knee (AI > 215): Compute-bound - Performance = flat 430 TFLOPS - Memory is not the bottleneck

Calculating the Knee

The knee is where compute and memory limits meet:

AI_knee = Peak Compute / Peak Bandwidth = 430 TFLOPS / 2 TB/s = 430 Γ— 10ΒΉΒ² / (2 Γ— 10ΒΉΒ² bytes/sec) = 215 FLOPS/Byte

Interpretation: Any operation with AI < 215 is memory-bound on TPU v4. Any operation with AI > 215 can achieve peak compute.

Real Examples: What's the AI for Common Operations?

OperationFLOPsBytesAI (FLOPS/B)On TPU v4
Matrix Multiply (256Γ—256)33.6M512K65.5Memory-bound (100 TFLOPS)
Matrix Multiply (1024Γ—1024)2.1B8M262Compute-bound (430 TFLOPS)
Element-wise ReLUN2N bytes0.5Memory-bound (memory only)
Layer Norm5N4N bytes1.25Memory-bound
Softmax (attention)5NΒ²NΒ² bytes5.0Memory-bound
Conv2D (3Γ—3 kernel)18HWKΒ²4(HΓ—W + KΒ²)β‰ˆ9Memory-bound
Batch Matmul (GEMM)VariesVariesBatch-dependentDepends on batch

Case Study 1: Why TPU Dominates Matrix Math

Scenario: Multiply two 1024Γ—1024 matrices

πŸ”· TPU v4 (Systolic Array)

Arithmetic Intensity: 256 FLOPS/Byte (> 215 knee)
Predicted Performance: 430 TFLOPS (peak)
Actual Performance: ~415 TFLOPS (97% peak!)
Why so high?: Systolic data reuse reduces real memory traffic

πŸ”· NVIDIA H100 (Tensor Core)

Arithmetic Intensity: 256 FLOPS/Byte (> knee)
Predicted Performance: 1,450 TFLOPS (peak)
Actual Performance: ~920 TFLOPS (63% peak)
Why lower?: Cache misses, warp divergence, not optimized for 1024Γ—1024

Conclusion: TPU achieves higher percentage of peak because systolic architecture is optimized for exactly this access pattern.

Case Study 2: Why Attention Layers Bottleneck Transformers

Problem: Self-attention in a 1B-parameter transformer

In transformer architecture, attention is the expensive operation:

πŸ”Ά Attention Computation

Sequence Length: L = 2,048 tokens
Attention Matrix (Q @ K^T): FLOPs = LΒ² Γ— d_k = 2048Β² Γ— 64 = 268M
Memory (Q, K, V): Bytes = 3 Γ— L Γ— d Γ— 2 (FP16) = 3 Γ— 2048 Γ— 768 Γ— 2 = 9.4 MB
Arithmetic Intensity: 268M / 9.4M = 28.5 FLOPS/Byte (< 215 knee!)
On TPU v4: min(430T, 2T/s Γ— 28.5) = 57 TFLOPS (13% peak!)

This is why attention scales poorly: Softmax and attention create many small operations with low AI. The roofline predicts 13% utilizationβ€”real measurements confirm this.

Design Implication: What Should You Optimize?

For memory-bound operations (AI < knee):

For compute-bound operations (AI > knee):

The Systolic Array Advantage: Moving the Knee

Standard GPU (H100): AI_knee = 1450 TFLOPS / 3 TB/s β‰ˆ 480 FLOPS/B

Systolic Array (TPU): AI_knee = 430 TFLOPS / 2 TB/s β‰ˆ 215 FLOPS/B

But waitβ€”GPU has more peak TFLOPS and higher knee! So why is TPU better?

Answer: Data reuse changes the actual AI:

For matrix multiply, actual memory traffic on: β€’ H100: O(NΒ²) bytes (limited caching helps, but still high) β€’ TPU: O(NΒ²/256) bytes (systolic streams data through 256 MACs) Result: TPU's actual AI is 256Γ— higher for this operation! Example (1024Γ—1024 multiply): β€’ H100 roofline predicts compute-bound, but cache misses occur β€’ TPU roofline predicts compute-bound, and it delivers 97% peak Systolic architecture changes the fundamental memory access pattern, improving actual AI beyond what the simple roofline predicts.

Roofline in Practice: Profiling Your Design

// Pseudo-code: Measure roofline performance def measure_roofline(chip, operation, size): flops = count_flops(operation, size) bytes_loaded = count_memory_bytes(operation, size) ai = flops / bytes_loaded peak_compute = chip.peak_tflops peak_memory = chip.memory_bandwidth knee = peak_compute / peak_memory if ai < knee: bottleneck = "MEMORY" predicted = peak_memory * ai else: bottleneck = "COMPUTE" predicted = peak_compute actual = benchmark(operation, size) efficiency = actual / predicted print(f"AI={ai}, Knee={knee}") print(f"Bottleneck: {bottleneck}") print(f"Predicted: {predicted} TFLOPS") print(f"Actual: {actual} TFLOPS ({efficiency*100:.0f}%)") return {ai, bottleneck, predicted, actual, efficiency} # Example results: measure_roofline(TPU_v4, "matmul_1024x1024", 1024) # Output: AI=256, Bottleneck=COMPUTE, Predicted=430T, Actual=415T (97%) measure_roofline(H100, "matmul_1024x1024", 1024) # Output: AI=256, Bottleneck=COMPUTE, Predicted=1450T, Actual=920T (63%) measure_roofline(TPU_v4, "softmax", 2048) # Output: AI=5.2, Bottleneck=MEMORY, Predicted=10T, Actual=9T (90%)

Summary: The Roofline Framework

The roofline model is your design debugging tool:

Next (Day 17): Cache hierarchies and how to reduce memory traffic.