AI Chip Design Day 23

GPU vs Systolic Tradeoff

Metric	Systolic (TPU)	GPU (H100)
Peak TFLOPS	430	1,450
Practical efficiency	60-70%	30-40%
Memory bandwidth	2 TB/s	3 TB/s (HBM3)
Flexibility	Matmul-focused	All workloads
Training/Inference	Both	Both

H100 Tensor Cores

Not a true systolic array, but SIMD-like:

Warp (32 threads) cooperatively multiply 16×16 matrices
132 Tensor Cores per SM (Streaming Multiprocessor)
Shared memory acts as local cache (96 KB per SM)
Full GPU control: branches, loops, memory patterns

Why This Matters

TPU (systolic): Load weights, compute, store results. No flexibility. H100 (GPU): Can do anything: - Variable batch sizes - Dynamic neural networks - Attention mechanisms with loops - Activation functions interleaved - Layer norm, batch norm (custom kernels) Cost: 40% silicon area for control (vs 5% for TPU)

H100 Specs

Peak FP32: 1,450 TFLOPS Peak TF32: 1,450 TFLOPS (auto-mixed precision) Peak FP8: 1,450 TFLOPS (new Hopper feature) Peak BFLOAT16: 1,450 TFLOPS Peak Sparse (2:4 sparsity): 2,900 TFLOPS (2× from structured sparsity) Memory: 80 GB HBM3 @ 3 TB/s Interconnect: 900 GB/s NVLink (between GPUs)

Sparsity: New in Hopper

If 50% of weights are zero, skip the multiply:

Dense multiply: A[16×16] @ B[16×16] = 16×16×16 = 4,096 MACs Sparse multiply (2:4 structured): - Every 4 consecutive elements, 2 are zero - Effective: 4,096 / 2 = 2,048 MACs - H100: Same hardware, 2× throughput! Example: GPT-3 pruned to 2:4 sparsity: Theoretical: 1.45T → 2.9T TFLOPS Practical: 1.8T (63% vs 50% efficiency gain)

Real Training: GPT-3 Scale

175 billion parameters
8 × H100 (DGX machine): 11.6 TFLOPS combined
Time per epoch: ~6 hours (with full parallelism)
Cost: ~$15k per hour (at AWS rates)

Why Less Efficient Than TPU?

Control overhead (branching, dynamic memory access)
Tensor Cores only for matrix ops (rest of workload on generic ALU)
Memory hierarchy complexity (L1/L2/L3 caches vs TPU's flat SRAM)

Day 24: Specialized ASICs (Groq, Cerebras, SambaNova).