AI Chip Design Day 18

Memory Bandwidth Hierarchy

Register file: ~30 KB, ~100 TB/s bandwidth (local to ALU) SRAM (on-chip): 1-100 MB, 5-50 TB/s HBM3 (stacked): 8-80 GB, 2-5 TB/s DDR5 (CPU memory):16-256 GB, 100 GB/s

Tech	Capacity	Bandwidth	Latency	Power	Use
HBM3	8-16 GB	2-5 TB/s	50-100 ns	5-10W	GPU/TPU
GDDR6X	8-24 GB	500-700 GB/s	100-200 ns	8-15W	Gaming GPU
LPDDR5X	2-8 GB	100-200 GB/s	200-300 ns	1-2W	Mobile

HBM (High-Bandwidth Memory)

Used by: Google TPU, NVIDIA H100, Apple devices

How HBM Works

Standard DRAM: CPU → 64-bit parallel bus → Memory Controller → DRAM module Bandwidth: 100 GB/s (limited by bus width) HBM (3D stacking): CPU → Many small buses (in parallel) → Stacked DRAM layers Bandwidth: 2 TB/s (16-32 parallel buses × 125 GB/s each) Stacking: Layer 1: DRAM array Layer 2: DRAM array ... Layer 12: DRAM array All connected via Through-Silicon Vias (TSVs)

Design Tradeoffs

Pro: 10-20× higher bandwidth than DDR (solves roofline problem)
Con: Expensive (~$20 per GB vs $2 for GDDR6)
Con: Thermal: stacking generates heat, needs active cooling
Trend: HBM3e (6 TB/s), HBM4 (8 TB/s) in progress

TPU v4 HBM Configuration

8 GB HBM3 per chip - 2000 GB/s bandwidth - 12 layers of DRAM (256 Mb each) - TSV connections between layers Cost implications: - Bare HBM die: $150-200 - HBM packaging (BGA): $400-600 - TPU v4 die (logic + HBM): $2000-3000

GDDR (Gaming-Optimized)

Cheaper than HBM, used in consumer GPUs and NVIDIA RTX cards.

Bandwidth: ~700 GB/s (not 2 TB/s, but still good)
Latency: Higher (100+ cycles)
Cost: ~$5-8 per GB module
Example: RTX 4090 has 24 GB GDDR6X at 576 GB/s

LPDDR (Low-Power)

Mobile and edge devices (Apple A17, Qualcomm):

Integrated on SoC (lower cost)
Bandwidth: ~100 GB/s
Power: 1-2W (critical for battery)
A17 Pro: 8 GB LPDDR5X @ 120 GB/s

Day 19: Network-on-Chip (NoC) for multi-tile designs.