HomeDay 23

NVIDIA H100

General-purpose GPU, not specialized systolic. Tensor Cores, sparsity, FP8. Why GPUs are flexible but less efficient.

GPU vs Systolic Tradeoff

MetricSystolic (TPU)GPU (H100)
Peak TFLOPS4301,450
Practical efficiency60-70%30-40%
Memory bandwidth2 TB/s3 TB/s (HBM3)
FlexibilityMatmul-focusedAll workloads
Training/InferenceBothBoth

H100 Tensor Cores

Not a true systolic array, but SIMD-like:

Why This Matters

TPU (systolic): Load weights, compute, store results. No flexibility. H100 (GPU): Can do anything: - Variable batch sizes - Dynamic neural networks - Attention mechanisms with loops - Activation functions interleaved - Layer norm, batch norm (custom kernels) Cost: 40% silicon area for control (vs 5% for TPU)

H100 Specs

Peak FP32: 1,450 TFLOPS Peak TF32: 1,450 TFLOPS (auto-mixed precision) Peak FP8: 1,450 TFLOPS (new Hopper feature) Peak BFLOAT16: 1,450 TFLOPS Peak Sparse (2:4 sparsity): 2,900 TFLOPS (2× from structured sparsity) Memory: 80 GB HBM3 @ 3 TB/s Interconnect: 900 GB/s NVLink (between GPUs)

Sparsity: New in Hopper

If 50% of weights are zero, skip the multiply:

Dense multiply: A[16×16] @ B[16×16] = 16×16×16 = 4,096 MACs Sparse multiply (2:4 structured): - Every 4 consecutive elements, 2 are zero - Effective: 4,096 / 2 = 2,048 MACs - H100: Same hardware, 2× throughput! Example: GPT-3 pruned to 2:4 sparsity: Theoretical: 1.45T → 2.9T TFLOPS Practical: 1.8T (63% vs 50% efficiency gain)

Real Training: GPT-3 Scale

  • 175 billion parameters
  • 8 × H100 (DGX machine): 11.6 TFLOPS combined
  • Time per epoch: ~6 hours (with full parallelism)
  • Cost: ~$15k per hour (at AWS rates)

Why Less Efficient Than TPU?

  • Control overhead (branching, dynamic memory access)
  • Tensor Cores only for matrix ops (rest of workload on generic ALU)
  • Memory hierarchy complexity (L1/L2/L3 caches vs TPU's flat SRAM)

Day 24: Specialized ASICs (Groq, Cerebras, SambaNova).