1. Why Mixed Precision?
Problem: FP16 training is faster (2x) but loses precision (gradients underflow, diverge)
Solution: FP16 for fast matrix multiply, FP32 for gradient accumulation
Standard FP32 Training:
Forward: FP32 → FP32 (slower)
Backward: FP32 → FP32 (slower)
Weight update: FP32 (precise)
Mixed Precision:
Forward: FP32 → FP16 (2x faster compute)
Backward: FP16 → FP32 (gradient scaling prevents underflow)
Accumulation: FP32 (maintains precision)
Weight update: FP32 (small, precise updates)
Result: ~2x speedup, minimal accuracy loss
2. Loss Scaling Strategy
Core issue: Small gradients in FP16 underflow to zero
Solution: Multiply loss by large scale factor, then divide gradients
Standard (FP16 only):
gradient = dL/dW (may be 1e-7, underflows in FP16 min ≈ 6e-5)
Loss Scaling:
scaled_loss = loss × 1024
scaled_gradient = dL/dW × 1024 (now ≈ 1e-4, representable in FP16)
actual_gradient = scaled_gradient / 1024 (back to original)
update: W -= lr × actual_gradient
Hardware: Scaling is cheap (bit shift or multiply by power of 2)
3. Automatic Mixed Precision (AMP)
Idea: Framework automatically chooses which ops in FP16 vs FP32
- TensorFlow Automatic Mixed Precision: Automatic casting
- PyTorch AMP (Apex): Loss scaling, gradient accumulation
- NVIDIA Automatic Mixed Precision (AMP): TensorRT, Volta+
Rules:
- Matrix multiply → FP16 (fast, robust to precision loss)
- Reductions (sum, mean) → FP32 (accumulate precisely)
- Exponentials (softmax, sigmoid) → FP32 (numerical stability)
4. Training Stability with Mixed Precision
Challenges:
- Gradient overflow: dL/dW × scale > FP16 max (rare, skip that batch)
- Gradient underflow: dL/dW × scale < FP16 min (increase scale factor)
- Loss divergence: scale factor too high, loss explodes
Dynamic loss scaling: Automatically adjust scale factor based on overflow frequency
5. Hardware Support
| Hardware | FP16 Speed | Tensor Cores | Mixed Precision Support |
|---|---|---|---|
| NVIDIA V100 | 2x FP32 | Yes (Volta) | Partial |
| NVIDIA A100 | 2x FP32 | Yes (Ampere) | Full (TF32, BF16) |
| Google TPU v3 | 2x FP32 | Systolic | Full (BF16) |
| Intel Gaudi | 8x FP32 | Habana Gaudi | Full (BF16, FP8) |
6. Real-World Mixed Precision Results
ResNet-50 Training:
- FP32: 100 hours on 8 V100s
- Mixed (FP16): 50 hours (2x speedup)
- Accuracy: No loss (same final accuracy)
BERT Training:
- FP32: 96 hours on 32 TPUv3
- Mixed (BF16): 48 hours (2x speedup)
- Accuracy: < 0.1% difference
7. Mixed Precision Checklist
- ✅ Use AMP in framework: TensorFlow or PyTorch built-in
- ✅ Enable tensor cores: Ensure GPU uses specialized hardware
- ✅ Set loss scale: Start with 1024, adjust if overflow
- ✅ Monitor gradients: Check for NaN or inf values
- ✅ Test accuracy: Verify FP32 vs mixed precision convergence
- ✅ Measure speedup: Should be ~2x with minimal accuracy loss
Next (Day 9): Sparsity and pruning for further optimization.