AI Chip Day 8 Enhanced — Mixed Precision Training

1. Why Mixed Precision?

Problem: FP16 training is faster (2x) but loses precision (gradients underflow, diverge)

Solution: FP16 for fast matrix multiply, FP32 for gradient accumulation

Standard FP32 Training: Forward: FP32 → FP32 (slower) Backward: FP32 → FP32 (slower) Weight update: FP32 (precise) Mixed Precision: Forward: FP32 → FP16 (2x faster compute) Backward: FP16 → FP32 (gradient scaling prevents underflow) Accumulation: FP32 (maintains precision) Weight update: FP32 (small, precise updates) Result: ~2x speedup, minimal accuracy loss

2. Loss Scaling Strategy

Core issue: Small gradients in FP16 underflow to zero

Solution: Multiply loss by large scale factor, then divide gradients

Standard (FP16 only): gradient = dL/dW (may be 1e-7, underflows in FP16 min ≈ 6e-5) Loss Scaling: scaled_loss = loss × 1024 scaled_gradient = dL/dW × 1024 (now ≈ 1e-4, representable in FP16) actual_gradient = scaled_gradient / 1024 (back to original) update: W -= lr × actual_gradient Hardware: Scaling is cheap (bit shift or multiply by power of 2)

3. Automatic Mixed Precision (AMP)

Idea: Framework automatically chooses which ops in FP16 vs FP32

TensorFlow Automatic Mixed Precision: Automatic casting
PyTorch AMP (Apex): Loss scaling, gradient accumulation
NVIDIA Automatic Mixed Precision (AMP): TensorRT, Volta+

Rules:

Matrix multiply → FP16 (fast, robust to precision loss)
Reductions (sum, mean) → FP32 (accumulate precisely)
Exponentials (softmax, sigmoid) → FP32 (numerical stability)

4. Training Stability with Mixed Precision

Challenges:

Gradient overflow: dL/dW × scale > FP16 max (rare, skip that batch)
Gradient underflow: dL/dW × scale < FP16 min (increase scale factor)
Loss divergence: scale factor too high, loss explodes

Dynamic loss scaling: Automatically adjust scale factor based on overflow frequency

5. Hardware Support

Hardware	FP16 Speed	Tensor Cores	Mixed Precision Support
NVIDIA V100	2x FP32	Yes (Volta)	Partial
NVIDIA A100	2x FP32	Yes (Ampere)	Full (TF32, BF16)
Google TPU v3	2x FP32	Systolic	Full (BF16)
Intel Gaudi	8x FP32	Habana Gaudi	Full (BF16, FP8)

6. Real-World Mixed Precision Results

ResNet-50 Training:

FP32: 100 hours on 8 V100s
Mixed (FP16): 50 hours (2x speedup)
Accuracy: No loss (same final accuracy)

BERT Training:

FP32: 96 hours on 32 TPUv3
Mixed (BF16): 48 hours (2x speedup)
Accuracy: < 0.1% difference

7. Mixed Precision Checklist

✅ Use AMP in framework: TensorFlow or PyTorch built-in
✅ Enable tensor cores: Ensure GPU uses specialized hardware
✅ Set loss scale: Start with 1024, adjust if overflow
✅ Monitor gradients: Check for NaN or inf values
✅ Test accuracy: Verify FP32 vs mixed precision convergence
✅ Measure speedup: Should be ~2x with minimal accuracy loss

Next (Day 9): Sparsity and pruning for further optimization.