Memory Bandwidth Hierarchy
Register file: ~30 KB, ~100 TB/s bandwidth (local to ALU)
SRAM (on-chip): 1-100 MB, 5-50 TB/s
HBM3 (stacked): 8-80 GB, 2-5 TB/s
DDR5 (CPU memory):16-256 GB, 100 GB/s
| Tech | Capacity | Bandwidth | Latency | Power | Use |
|---|---|---|---|---|---|
| HBM3 | 8-16 GB | 2-5 TB/s | 50-100 ns | 5-10W | GPU/TPU |
| GDDR6X | 8-24 GB | 500-700 GB/s | 100-200 ns | 8-15W | Gaming GPU |
| LPDDR5X | 2-8 GB | 100-200 GB/s | 200-300 ns | 1-2W | Mobile |
HBM (High-Bandwidth Memory)
Used by: Google TPU, NVIDIA H100, Apple devices
How HBM Works
Standard DRAM:
CPU → 64-bit parallel bus → Memory Controller → DRAM module
Bandwidth: 100 GB/s (limited by bus width)
HBM (3D stacking):
CPU → Many small buses (in parallel) → Stacked DRAM layers
Bandwidth: 2 TB/s (16-32 parallel buses × 125 GB/s each)
Stacking:
Layer 1: DRAM array
Layer 2: DRAM array
...
Layer 12: DRAM array
All connected via Through-Silicon Vias (TSVs)
Design Tradeoffs
- Pro: 10-20× higher bandwidth than DDR (solves roofline problem)
- Con: Expensive (~$20 per GB vs $2 for GDDR6)
- Con: Thermal: stacking generates heat, needs active cooling
- Trend: HBM3e (6 TB/s), HBM4 (8 TB/s) in progress
TPU v4 HBM Configuration
8 GB HBM3 per chip
- 2000 GB/s bandwidth
- 12 layers of DRAM (256 Mb each)
- TSV connections between layers
Cost implications:
- Bare HBM die: $150-200
- HBM packaging (BGA): $400-600
- TPU v4 die (logic + HBM): $2000-3000
GDDR (Gaming-Optimized)
Cheaper than HBM, used in consumer GPUs and NVIDIA RTX cards.
- Bandwidth: ~700 GB/s (not 2 TB/s, but still good)
- Latency: Higher (100+ cycles)
- Cost: ~$5-8 per GB module
- Example: RTX 4090 has 24 GB GDDR6X at 576 GB/s
LPDDR (Low-Power)
Mobile and edge devices (Apple A17, Qualcomm):
- Integrated on SoC (lower cost)
- Bandwidth: ~100 GB/s
- Power: 1-2W (critical for battery)
- A17 Pro: 8 GB LPDDR5X @ 120 GB/s
Day 19: Network-on-Chip (NoC) for multi-tile designs.