GPU vs Systolic Cache Design
| Architecture | Cache Type | Strategy | Latency |
|---|---|---|---|
| GPU (H100) | L1/L2/L3 hierarchy | Caches miss data dynamically | 80+ cycles to HBM |
| Systolic (TPU) | Weight SRAM + regs | Pre-load all data before compute | 1-2 cycles to local mem |
| Mobile (A17) | Shared SRAM | Small buffer (512KB), tight loops | 3-5 cycles |
Systolic Array Weight SRAM
Google TPU v4: 24 MB SRAM for weights
256×256 systolic array needs weights pre-loaded.
24 MB SRAM = 6M INT8 values
For large model:
- Model weights: 7 GB (1B-parameter model)
- Working set: 24 MB at a time
- Loading: Stream weights in tiles from HBM
Memory loop:
1. Load 24 MB weight tile from HBM (100 ms @ 2000 GB/s)
2. Compute on 256×256 for ~10k operations
3. Repeat for next weight tile
Design Decision: Size vs Bandwidth
- Larger SRAM: Better reuse, but more area, longer wires (latency)
- Smaller SRAM: More bandwidth needed, but compact
- Sweet spot: 16-32 MB (can hold 1-2 large layers)
GPU L1/L2/L3 Design
H100: Hierarchical caches with coherence
L1: 128 KB per SM, 80 GB/s bandwidth
L2: 50 MB shared, 4 TB/s bandwidth
L3: Not implemented (coherent GPU mem)
HBM: 80 GB, 2 TB/s bandwidth
Design challenge: GPU threads don't know what to prefetch.
Solution: Tensor Core operates on blocks (warp-level collective); prefetch entire blocks.
Power Implications
Energy per operation:
- Access L1 cache: ~3 pJ
- Access L2 cache: ~5 pJ
- Access HBM: ~100 pJ
For 1 billion MACs:
- Perfect cache hit: 1B × 3 pJ = 3 mJ
- L2 hit rate 50%: 1B × (0.5×3 + 0.5×5) = 4 mJ
- L3/HBM hit: 1B × 100 pJ = 100 mJ ← 25× worse!
Implication: Cache design is power-critical. Systolic pre-loading wins.
Day 18: High-bandwidth memory (HBM) and memory co-design.