1. Memory Hierarchy for AI Chips
Bandwidth Pyramid (typical AI accelerator):
Level 1: Register File / Local Memory
Bandwidth: 10-100 TB/s
Size: 1-10 KB
Latency: 1-2 cycles
Level 2: On-Chip SRAM
Bandwidth: 1-5 TB/s
Size: 1-16 MB
Latency: 10-50 cycles
Level 3: Off-Chip DRAM
Bandwidth: 100-900 GB/s
Size: 8-80 GB
Latency: 100-300 cycles
Design Principle: Keep weights in SRAM, avoid DRAM traffic
2. Weight Caching Strategy
Key insight: Weights don't change during inference. Cache permanently on chip.
- ResNet-50: 100M params, 400MB (FP32). Fits in 500MB SRAM.
- BERT: 340M params, 1.3GB (FP32). Need 2-3 chips or quantize.
- GPT-3: 175B params, 700GB. Distributed inference needed.
Solution: Quantization (INT8 = 25% size). ResNet-50 fits in 100MB SRAM.
3. Data Reuse & Arithmetic Intensity
Roofline Model (Day 16 covers fully): Performance = min(compute_peak, bandwidth × intensity)
Arithmetic Intensity = MACs / Bytes Transferred
Example: Convolution layer
MACs: 100M
Weights: 1M params (4MB @ FP32)
Activations: 10MB
Total memory: 14MB
Intensity = 100M / 14M ≈ 7 ops/byte
At 100 GB/s bandwidth:
Time = 14MB / 100 GB/s = 140 µs
Compute time = 100M ops / 1000 GOPS = 100 µs
→ Memory bottleneck (140 > 100)
Solution: Increase intensity through tiling, use SRAM cache
4. Tiling & Data Reuse
Strategy: Divide large matrix multiply into smaller tiles that fit in SRAM
- Load tile of weights into SRAM (once)
- Stream activations through many times
- Compute locally without DRAM traffic
Example: 256×256 systolic array processes 256×256 tile with full reuse
5. Memory Bandwidth Optimization
- Coalesced access: Sequential memory reads (32-byte cache lines)
- Prefetching: Load next tile while computing current
- Double buffering: Ping-pong between two SRAM banks
- Compression: Store weights compressed, decompress on-chip
6. Real-World Memory Examples
Google TPU v4:
- 8GB on-chip HBM3 (1.2 TB/s)
- Typical batch processing: weights + activations fit in HBM
- Rarely need external DRAM
NVIDIA H100:
- 141 GB/s HBM3 bandwidth
- Multiple GPUs connected via NVLink (900 GB/s)
- Distributed training spreads model across multiple chips
7. Cache Design Checklist
- ✅ Estimate SRAM needed: Model size (quantized) + batch activations
- ✅ Calculate arithmetic intensity: MACs / memory bytes
- ✅ Size SRAM buffer: Typical 10-20% of model size
- ✅ Plan data movement: Which data moves on/off chip each cycle
- ✅ Implement tiling: Divide compute into SRAM-resident tiles
- ✅ Prefetch next tile: Overlap compute and memory access
- ✅ Measure bandwidth utilization: Should be 70%+ of peak
Next (Day 12): Power and thermal management.