1. Training vs Inference Hardware
Key differences:
- Inference: Read-only weights, forward pass only, optimized for latency
- Training: Read-write weights, forward + backward pass, optimized for throughput
- Memory: Training needs 3-4x more memory (activations for backward pass)
- Compute: Training is 3x compute (forward + 2 gradient computations)
2. Backpropagation Hardware
3. Memory Requirements for Training
| Component | Size | Notes |
|---|---|---|
| Model weights | X GB | Parameters to learn |
| Forward activations | 2-5X GB | Stored for backward pass |
| Gradients (dW, db) | X GB | Same size as weights |
| Optimizer state | 1-2X GB | Momentum, Adam second moment |
| Total | 5-10X GB | 5-10x model size needed |
Example: Training GPT-3 (175B parameters) requires 1-2TB of GPU memory (H100 has 80GB, needs ~20 GPUs)
4. Optimization Algorithms
SGD (Stochastic Gradient Descent)
Simple: W_new = W - lr * dW
Hardware: Just subtract (cheap)
Momentum
Accelerates: v = β*v + dW; W = W - lr*v
Hardware: Extra memory for velocity, one extra multiply
Adam (Adaptive Moment Estimation)
State-of-the-art: Maintains first and second moments of gradients
Hardware cost: 2x memory (m, v per parameter), more ALU operations
Trade-off: Better convergence, but higher memory and compute cost
5. Distributed Training
Problem: No single chip can fit LLM training (175B params, 5TB memory needed)
Solution: Distribute across multiple chips
- Data parallelism: Each chip processes different batch samples, all have same weights
- Model parallelism: Different layers on different chips
- Pipeline parallelism: Stagger batches across chips to overlap compute
Hardware requirement: High-speed interconnect (NVLink, InfiniBand) to synchronize gradients
6. Gradient Accumulation & Mixed Precision
Problem: Can't fit large batch in memory, but large batches train faster
Solution: Gradient accumulation (accumulate gradients over smaller batches)
7. Hardware Optimizations for Training
- Tensor cores: Specialized matrix multiply (faster, lower precision)
- Gradient clipping: Prevent exploding gradients (simple threshold hardware)
- Mixed precision: Forward in FP16, backward in FP32 (slower but saves memory)
- Checkpointing: Don't store all activations, recompute during backward pass
- Sparsity: Skip zero-weight gradients (complex hardware, big payoff)
8. Real-World Training Examples
ResNet-50 Training
- Model size: 100M parameters
- Batch size: 256-512
- Time: Hours on GPU, days on TPU
- Memory: 16-32GB (1 GPU)
- Optimizer: SGD with momentum
BERT Training
- Model size: 110M-340M parameters
- Batch size: 256-1024
- Time: Days on TPUv3 (8 chips)
- Memory: 256GB+ (distributed)
- Optimizer: Adam (higher memory, better convergence)
GPT-3 Training
- Model size: 175B parameters
- Batch size: 3.2M tokens (distributed)
- Time: Weeks on GPU cluster (1024+ A100s)
- Memory: 1-2TB total (distributed)
- Optimizer: Adam with gradient checkpointing
- Cost: Millions of dollars in compute
9. Training Hardware Checklist
- ✅ Support both forward and backward: Same compute units do both
- ✅ Enough memory: 5-10x model size for activations, gradients, optimizer
- ✅ Efficient gradient computation: Parallel systolic arrays
- ✅ Fast interconnect: For distributed gradient synchronization
- ✅ Mixed precision support: FP32 + FP16 for memory/speed tradeoff
- ✅ Gradient accumulation: Handle smaller batches
- ✅ Checkpointing capability: Recompute vs store trade-off
- ✅ Measure energy/iteration: Larger training = more power consumption
Next (Day 5): Floating-point and precision formats.