GPU vs Systolic Tradeoff
| Metric | Systolic (TPU) | GPU (H100) |
|---|---|---|
| Peak TFLOPS | 430 | 1,450 |
| Practical efficiency | 60-70% | 30-40% |
| Memory bandwidth | 2 TB/s | 3 TB/s (HBM3) |
| Flexibility | Matmul-focused | All workloads |
| Training/Inference | Both | Both |
H100 Tensor Cores
Not a true systolic array, but SIMD-like:
- Warp (32 threads) cooperatively multiply 16×16 matrices
- 132 Tensor Cores per SM (Streaming Multiprocessor)
- Shared memory acts as local cache (96 KB per SM)
- Full GPU control: branches, loops, memory patterns
Why This Matters
TPU (systolic): Load weights, compute, store results. No flexibility.
H100 (GPU): Can do anything:
- Variable batch sizes
- Dynamic neural networks
- Attention mechanisms with loops
- Activation functions interleaved
- Layer norm, batch norm (custom kernels)
Cost: 40% silicon area for control (vs 5% for TPU)
H100 Specs
Peak FP32: 1,450 TFLOPS
Peak TF32: 1,450 TFLOPS (auto-mixed precision)
Peak FP8: 1,450 TFLOPS (new Hopper feature)
Peak BFLOAT16: 1,450 TFLOPS
Peak Sparse (2:4 sparsity): 2,900 TFLOPS (2× from structured sparsity)
Memory: 80 GB HBM3 @ 3 TB/s
Interconnect: 900 GB/s NVLink (between GPUs)
Sparsity: New in Hopper
If 50% of weights are zero, skip the multiply:
Dense multiply:
A[16×16] @ B[16×16] = 16×16×16 = 4,096 MACs
Sparse multiply (2:4 structured):
- Every 4 consecutive elements, 2 are zero
- Effective: 4,096 / 2 = 2,048 MACs
- H100: Same hardware, 2× throughput!
Example: GPT-3 pruned to 2:4 sparsity:
Theoretical: 1.45T → 2.9T TFLOPS
Practical: 1.8T (63% vs 50% efficiency gain)
Real Training: GPT-3 Scale
- 175 billion parameters
- 8 × H100 (DGX machine): 11.6 TFLOPS combined
- Time per epoch: ~6 hours (with full parallelism)
- Cost: ~$15k per hour (at AWS rates)
Why Less Efficient Than TPU?
- Control overhead (branching, dynamic memory access)
- Tensor Cores only for matrix ops (rest of workload on generic ALU)
- Memory hierarchy complexity (L1/L2/L3 caches vs TPU's flat SRAM)
Day 24: Specialized ASICs (Groq, Cerebras, SambaNova).