AI Chip Design Day 29

Cost Breakdown: H100 Example

H100 cost structure ($40,000 retail): - NVIDIA R&D: $2,000 (amortized over 100k units) - Die cost: $3,000 (manufacturing) ├─ Wafer: $12,000 ├─ Good dies per wafer: 4 (80mm² die, 300mm wafer) ├─ Yield: 60% (240 dies scrap) └─ Cost per good die: $3,000 - HBM stacking: $800 - Packaging: $400 - PCB, interconnect: $300 - NVIDIA margin: $15,000+

Area Optimization Strategies

1. Systolic Array Sizing

Array Size	Die Area	Peak TFLOPS	TFLOPS/mm²
64×64	5 mm²	32	6.4
128×128	20 mm²	131	6.6
256×256	80 mm²	520	6.5
512×512	320 mm²	2,080	6.5

Insight: Efficiency is constant (roughly), but absolute cost is not. 512×512 has 4× the area but 4× the cost.

2. Memory Hierarchy

SRAM is expensive (area-wise):

Option A: Large SRAM (24 MB) - Faster, fewer HBM accesses - Area: ~10 mm² (large!) - Power: Lower (fewer HBM I/O) Option B: Small SRAM (4 MB) - Slower, more HBM accesses - Area: ~2 mm² - Power: Higher (HBM bandwidth-limited) Trade-off: 8 mm² area savings = $800 cost savings But 50% more HBM I/O = 20W more power = $5k in power supply cost Economics favor Option A (in datacenter)

Yield and Manufacturing Cost

Wafer Yield Formula

Good dies per wafer = π × (R / d) × (d / 2A)^2 × Y R = wafer radius (150 mm for 300mm wafer) d = defect density (defects/cm²) A = die area (mm²) Y = yield coefficient (0.6-0.8) Example: - Die area: 80 mm² (TPU v4) - Defect density: 0.1 defects/cm² (5nm) - Yield: 60% - Good dies: 4 per wafer Cost per good die = $12,000 wafer / 4 dies = $3,000

Cost per TFLOPS

Chip	Die Area	Cost	Peak TFLOPS	Cost/TFLOPS
A17 ANE	1.5 mm²	$100	17	$5.88
TPU v4	80 mm²	$3,000	430	$6.98
H100	815 mm²	$5,000	1,450	$3.45

Tiling for Cost Reduction

Instead of one large die, use multiple smaller dies:

Option A: Single 256×256 systolic (TPU-style) - Die area: 80 mm² - Yield: 60% - Cost: $3,000 Option B: Four 128×128 systolic (tiled) - Die area: 4 × 20 = 80 mm² - Yield: 60% per die = 84% when only 1 is needed - Cost: $2,400 (if smaller die is cheaper to fab) - Benefit: One bad die doesn't kill product Production reality: Google uses multi-chip TPU Pods (not monolithic)

Day 30: RTL to Silicon: the entire journey from HDL to tape-out.

Area & Cost Reduction