HomeAI ChipDay 11

Cache Hierarchy & Data Movement

AI chip memory design. On-chip SRAM, bandwidth hierarchy, weight reuse patterns, and data movement optimization.

By EcrioniX · Published June 13, 2026 · ~3500 words · 10 min read

1. Memory Hierarchy for AI Chips

Bandwidth Pyramid (typical AI accelerator): Level 1: Register File / Local Memory Bandwidth: 10-100 TB/s Size: 1-10 KB Latency: 1-2 cycles Level 2: On-Chip SRAM Bandwidth: 1-5 TB/s Size: 1-16 MB Latency: 10-50 cycles Level 3: Off-Chip DRAM Bandwidth: 100-900 GB/s Size: 8-80 GB Latency: 100-300 cycles Design Principle: Keep weights in SRAM, avoid DRAM traffic

2. Weight Caching Strategy

Key insight: Weights don't change during inference. Cache permanently on chip.

Solution: Quantization (INT8 = 25% size). ResNet-50 fits in 100MB SRAM.

3. Data Reuse & Arithmetic Intensity

Roofline Model (Day 16 covers fully): Performance = min(compute_peak, bandwidth × intensity)

Arithmetic Intensity = MACs / Bytes Transferred Example: Convolution layer MACs: 100M Weights: 1M params (4MB @ FP32) Activations: 10MB Total memory: 14MB Intensity = 100M / 14M ≈ 7 ops/byte At 100 GB/s bandwidth: Time = 14MB / 100 GB/s = 140 µs Compute time = 100M ops / 1000 GOPS = 100 µs → Memory bottleneck (140 > 100) Solution: Increase intensity through tiling, use SRAM cache

4. Tiling & Data Reuse

Strategy: Divide large matrix multiply into smaller tiles that fit in SRAM

Example: 256×256 systolic array processes 256×256 tile with full reuse

5. Memory Bandwidth Optimization

6. Real-World Memory Examples

Google TPU v4:

NVIDIA H100:

7. Cache Design Checklist

Next (Day 12): Power and thermal management.