AI Chip Design Day 16 Enhanced

Why This Matters: The Critical Question

You've designed an AI chip. It can compute 1,000 TFLOPS. But in practice, does it achieve 1,000 TFLOPS on real workloads? Or does it get stuck at 400 TFLOPS?

The roofline model answers this fundamental question: Is your design limited by compute throughput or memory bandwidth?

This distinction changes everything:

Compute-bound: Add more MACs to go faster
Memory-bound: Faster MACs don't help; buy faster memory instead

The Mathematical Foundation

Roofline is based on one simple equation:

Peak Performance = min( Peak Compute (FLOPS), Peak Memory Bandwidth (Bytes/sec) × Arithmetic Intensity (FLOPS/Byte) )

Let's break each term:

Term	Definition	Example
Peak Compute	Theoretical max FLOPs if all ALUs/MACs fire	430 TFLOPS for TPU v4
Peak Memory BW	Theoretical max Bytes/sec from memory	2 TB/s for TPU v4 HBM
Arithmetic Intensity (AI)	FLOPs per Byte loaded/stored	Depends on algorithm

Computing Arithmetic Intensity

Example: Matrix Multiply C = A @ B where all matrices are N×N

📊 Calculation for 256×256 Multiply

Total FLOPs: N × N × (2N - 1) ≈ 2N³ = 2×256³ = 33.6 million

Memory Bytes (assuming FP32): 2 matrices × N² × 4 bytes = 2×256²×4 = 512 KB

Arithmetic Intensity: 33.6M FLOPs ÷ 512K Bytes = 65.5 FLOPS/Byte

Notice: AI scales with N! Larger matrices have higher AI (better efficiency).

Matrix Size	AI (FLOPS/Byte)	Category
16×16	16.4	Memory-bound
64×64	32	Memory-bound
256×256	65.5	Depends on roofline knee
1024×1024	256	Likely compute-bound

The Roofline Graph: Visualizing the Trade-off

Peak Compute Roof (430 TFLOPS) ───────────────────────────────── TFLOPS 400 | ╱─────────────────→ | ╱ 300 | ╱ | ╱ 200 | ╱ ← Memory Roof | ╱ (slope = 2 TB/s bandwidth) 100 | ╱ | ╱ ← Knee (AI = 215) 0 |____╱___________________________→ 0 50 100 150 200 250 300 Arithmetic Intensity (FLOPS/Byte) Regions: • Left of knee (AI < 215): Memory-bound - Performance limited by bandwidth - Line has slope = 2 TB/s × AI • Right of knee (AI > 215): Compute-bound - Performance = flat 430 TFLOPS - Memory is not the bottleneck

Calculating the Knee

The knee is where compute and memory limits meet:

AI_knee = Peak Compute / Peak Bandwidth = 430 TFLOPS / 2 TB/s = 430 × 10¹² / (2 × 10¹² bytes/sec) = 215 FLOPS/Byte

Interpretation: Any operation with AI < 215 is memory-bound on TPU v4. Any operation with AI > 215 can achieve peak compute.

Real Examples: What's the AI for Common Operations?

Operation	FLOPs	Bytes	AI (FLOPS/B)	On TPU v4
Matrix Multiply (256×256)	33.6M	512K	65.5	Memory-bound (100 TFLOPS)
Matrix Multiply (1024×1024)	2.1B	8M	262	Compute-bound (430 TFLOPS)
Element-wise ReLU	N	2N bytes	0.5	Memory-bound (memory only)
Layer Norm	5N	4N bytes	1.25	Memory-bound
Softmax (attention)	5N²	N² bytes	5.0	Memory-bound
Conv2D (3×3 kernel)	18HWK²	4(H×W + K²)	≈9	Memory-bound
Batch Matmul (GEMM)	Varies	Varies	Batch-dependent	Depends on batch

Case Study 1: Why TPU Dominates Matrix Math

Scenario: Multiply two 1024×1024 matrices

🔷 TPU v4 (Systolic Array)

Arithmetic Intensity: 256 FLOPS/Byte (> 215 knee)

Predicted Performance: 430 TFLOPS (peak)

Actual Performance: ~415 TFLOPS (97% peak!)

Why so high?: Systolic data reuse reduces real memory traffic

🔷 NVIDIA H100 (Tensor Core)

Arithmetic Intensity: 256 FLOPS/Byte (> knee)

Predicted Performance: 1,450 TFLOPS (peak)

Actual Performance: ~920 TFLOPS (63% peak)

Why lower?: Cache misses, warp divergence, not optimized for 1024×1024

Conclusion: TPU achieves higher percentage of peak because systolic architecture is optimized for exactly this access pattern.

Case Study 2: Why Attention Layers Bottleneck Transformers

Problem: Self-attention in a 1B-parameter transformer

In transformer architecture, attention is the expensive operation:

🔶 Attention Computation

Sequence Length: L = 2,048 tokens

Attention Matrix (Q @ K^T): FLOPs = L² × d_k = 2048² × 64 = 268M

Memory (Q, K, V): Bytes = 3 × L × d × 2 (FP16) = 3 × 2048 × 768 × 2 = 9.4 MB

Arithmetic Intensity: 268M / 9.4M = 28.5 FLOPS/Byte (< 215 knee!)

On TPU v4: min(430T, 2T/s × 28.5) = 57 TFLOPS (13% peak!)

This is why attention scales poorly: Softmax and attention create many small operations with low AI. The roofline predicts 13% utilization—real measurements confirm this.

Design Implication: What Should You Optimize?

For memory-bound operations (AI < knee):

✅ Increase memory bandwidth (HBM, 3D stacking, NVLink)
✅ Reduce data reuse overhead (fewer loads)
❌ Adding more MACs is useless (they'll starve)
❌ Faster clock speed doesn't help

For compute-bound operations (AI > knee):

✅ Add more MACs/ALUs
✅ Increase clock speed
❌ Faster memory doesn't help (already computing faster than memory feeds)
❌ Larger cache doesn't help (computation is the limit)

The Systolic Array Advantage: Moving the Knee

Standard GPU (H100): AI_knee = 1450 TFLOPS / 3 TB/s ≈ 480 FLOPS/B

Systolic Array (TPU): AI_knee = 430 TFLOPS / 2 TB/s ≈ 215 FLOPS/B

But wait—GPU has more peak TFLOPS and higher knee! So why is TPU better?

Answer: Data reuse changes the actual AI:

For matrix multiply, actual memory traffic on: • H100: O(N²) bytes (limited caching helps, but still high) • TPU: O(N²/256) bytes (systolic streams data through 256 MACs) Result: TPU's actual AI is 256× higher for this operation! Example (1024×1024 multiply): • H100 roofline predicts compute-bound, but cache misses occur • TPU roofline predicts compute-bound, and it delivers 97% peak Systolic architecture changes the fundamental memory access pattern, improving actual AI beyond what the simple roofline predicts.

Roofline in Practice: Profiling Your Design

// Pseudo-code: Measure roofline performance
def measure_roofline(chip, operation, size):
    flops = count_flops(operation, size)
    bytes_loaded = count_memory_bytes(operation, size)

    ai = flops / bytes_loaded
    peak_compute = chip.peak_tflops
    peak_memory = chip.memory_bandwidth

    knee = peak_compute / peak_memory

    if ai < knee:
        bottleneck = "MEMORY"
        predicted = peak_memory * ai
    else:
        bottleneck = "COMPUTE"
        predicted = peak_compute

    actual = benchmark(operation, size)
    efficiency = actual / predicted

    print(f"AI={ai}, Knee={knee}")
    print(f"Bottleneck: {bottleneck}")
    print(f"Predicted: {predicted} TFLOPS")
    print(f"Actual: {actual} TFLOPS ({efficiency*100:.0f}%)")

    return {ai, bottleneck, predicted, actual, efficiency}

# Example results:
measure_roofline(TPU_v4, "matmul_1024x1024", 1024)
# Output: AI=256, Bottleneck=COMPUTE, Predicted=430T, Actual=415T (97%)

measure_roofline(H100, "matmul_1024x1024", 1024)
# Output: AI=256, Bottleneck=COMPUTE, Predicted=1450T, Actual=920T (63%)

measure_roofline(TPU_v4, "softmax", 2048)
# Output: AI=5.2, Bottleneck=MEMORY, Predicted=10T, Actual=9T (90%)

Summary: The Roofline Framework

The roofline model is your design debugging tool:

✅ Understand bottlenecks before building silicon
✅ Know whether to optimize compute or memory
✅ Predict real-world performance from AI
✅ Justify spending on expensive components (HBM costs $$$)
✅ Identify which operations need algorithmic changes (fuse kernels, increase batch size to raise AI)

Next (Day 17): Cache hierarchies and how to reduce memory traffic.

The Roofline Model Deep Dive