AI Chip Day 11 Enhanced — Cache Hierarchy & Data Movement

1. Memory Hierarchy for AI Chips

Bandwidth Pyramid (typical AI accelerator): Level 1: Register File / Local Memory Bandwidth: 10-100 TB/s Size: 1-10 KB Latency: 1-2 cycles Level 2: On-Chip SRAM Bandwidth: 1-5 TB/s Size: 1-16 MB Latency: 10-50 cycles Level 3: Off-Chip DRAM Bandwidth: 100-900 GB/s Size: 8-80 GB Latency: 100-300 cycles Design Principle: Keep weights in SRAM, avoid DRAM traffic

2. Weight Caching Strategy

Key insight: Weights don't change during inference. Cache permanently on chip.

ResNet-50: 100M params, 400MB (FP32). Fits in 500MB SRAM.
BERT: 340M params, 1.3GB (FP32). Need 2-3 chips or quantize.
GPT-3: 175B params, 700GB. Distributed inference needed.

Solution: Quantization (INT8 = 25% size). ResNet-50 fits in 100MB SRAM.

3. Data Reuse & Arithmetic Intensity

Roofline Model (Day 16 covers fully): Performance = min(compute_peak, bandwidth × intensity)

Arithmetic Intensity = MACs / Bytes Transferred Example: Convolution layer MACs: 100M Weights: 1M params (4MB @ FP32) Activations: 10MB Total memory: 14MB Intensity = 100M / 14M ≈ 7 ops/byte At 100 GB/s bandwidth: Time = 14MB / 100 GB/s = 140 µs Compute time = 100M ops / 1000 GOPS = 100 µs → Memory bottleneck (140 > 100) Solution: Increase intensity through tiling, use SRAM cache

4. Tiling & Data Reuse

Strategy: Divide large matrix multiply into smaller tiles that fit in SRAM

Load tile of weights into SRAM (once)
Stream activations through many times
Compute locally without DRAM traffic

Example: 256×256 systolic array processes 256×256 tile with full reuse

5. Memory Bandwidth Optimization

Coalesced access: Sequential memory reads (32-byte cache lines)
Prefetching: Load next tile while computing current
Double buffering: Ping-pong between two SRAM banks
Compression: Store weights compressed, decompress on-chip

6. Real-World Memory Examples

Google TPU v4:

8GB on-chip HBM3 (1.2 TB/s)
Typical batch processing: weights + activations fit in HBM
Rarely need external DRAM

NVIDIA H100:

141 GB/s HBM3 bandwidth
Multiple GPUs connected via NVLink (900 GB/s)
Distributed training spreads model across multiple chips

7. Cache Design Checklist

✅ Estimate SRAM needed: Model size (quantized) + batch activations
✅ Calculate arithmetic intensity: MACs / memory bytes
✅ Size SRAM buffer: Typical 10-20% of model size
✅ Plan data movement: Which data moves on/off chip each cycle
✅ Implement tiling: Divide compute into SRAM-resident tiles
✅ Prefetch next tile: Overlap compute and memory access
✅ Measure bandwidth utilization: Should be 70%+ of peak

Next (Day 12): Power and thermal management.