AI Chip Design Day 17

GPU vs Systolic Cache Design

Architecture	Cache Type	Strategy	Latency
GPU (H100)	L1/L2/L3 hierarchy	Caches miss data dynamically	80+ cycles to HBM
Systolic (TPU)	Weight SRAM + regs	Pre-load all data before compute	1-2 cycles to local mem
Mobile (A17)	Shared SRAM	Small buffer (512KB), tight loops	3-5 cycles

Systolic Array Weight SRAM

Google TPU v4: 24 MB SRAM for weights

256×256 systolic array needs weights pre-loaded. 24 MB SRAM = 6M INT8 values For large model: - Model weights: 7 GB (1B-parameter model) - Working set: 24 MB at a time - Loading: Stream weights in tiles from HBM Memory loop: 1. Load 24 MB weight tile from HBM (100 ms @ 2000 GB/s) 2. Compute on 256×256 for ~10k operations 3. Repeat for next weight tile

Design Decision: Size vs Bandwidth

Larger SRAM: Better reuse, but more area, longer wires (latency)
Smaller SRAM: More bandwidth needed, but compact
Sweet spot: 16-32 MB (can hold 1-2 large layers)

GPU L1/L2/L3 Design

H100: Hierarchical caches with coherence

L1: 128 KB per SM, 80 GB/s bandwidth L2: 50 MB shared, 4 TB/s bandwidth L3: Not implemented (coherent GPU mem) HBM: 80 GB, 2 TB/s bandwidth Design challenge: GPU threads don't know what to prefetch. Solution: Tensor Core operates on blocks (warp-level collective); prefetch entire blocks.

Power Implications

Energy per operation: - Access L1 cache: ~3 pJ - Access L2 cache: ~5 pJ - Access HBM: ~100 pJ For 1 billion MACs: - Perfect cache hit: 1B × 3 pJ = 3 mJ - L2 hit rate 50%: 1B × (0.5×3 + 0.5×5) = 4 mJ - L3/HBM hit: 1B × 100 pJ = 100 mJ ← 25× worse! Implication: Cache design is power-critical. Systolic pre-loading wins.

Day 18: High-bandwidth memory (HBM) and memory co-design.

Cache Hierarchies

GPU vs Systolic Cache Design

Systolic Array Weight SRAM

Design Decision: Size vs Bandwidth

GPU L1/L2/L3 Design

Power Implications