HomeDay 17

Cache Hierarchies

L1/L2/L3 caches vs weight SRAM in systolic arrays. How to design fast, power-efficient memory for AI chips.

GPU vs Systolic Cache Design

ArchitectureCache TypeStrategyLatency
GPU (H100)L1/L2/L3 hierarchyCaches miss data dynamically80+ cycles to HBM
Systolic (TPU)Weight SRAM + regsPre-load all data before compute1-2 cycles to local mem
Mobile (A17)Shared SRAMSmall buffer (512KB), tight loops3-5 cycles

Systolic Array Weight SRAM

Google TPU v4: 24 MB SRAM for weights

256×256 systolic array needs weights pre-loaded. 24 MB SRAM = 6M INT8 values For large model: - Model weights: 7 GB (1B-parameter model) - Working set: 24 MB at a time - Loading: Stream weights in tiles from HBM Memory loop: 1. Load 24 MB weight tile from HBM (100 ms @ 2000 GB/s) 2. Compute on 256×256 for ~10k operations 3. Repeat for next weight tile

Design Decision: Size vs Bandwidth

GPU L1/L2/L3 Design

H100: Hierarchical caches with coherence

L1: 128 KB per SM, 80 GB/s bandwidth L2: 50 MB shared, 4 TB/s bandwidth L3: Not implemented (coherent GPU mem) HBM: 80 GB, 2 TB/s bandwidth Design challenge: GPU threads don't know what to prefetch. Solution: Tensor Core operates on blocks (warp-level collective); prefetch entire blocks.

Power Implications

Energy per operation: - Access L1 cache: ~3 pJ - Access L2 cache: ~5 pJ - Access HBM: ~100 pJ For 1 billion MACs: - Perfect cache hit: 1B × 3 pJ = 3 mJ - L2 hit rate 50%: 1B × (0.5×3 + 0.5×5) = 4 mJ - L3/HBM hit: 1B × 100 pJ = 100 mJ ← 25× worse! Implication: Cache design is power-critical. Systolic pre-loading wins.

Day 18: High-bandwidth memory (HBM) and memory co-design.