HomeAI ChipDay 8

Mixed Precision Training

Optimize training with mixed precision. FP16 forward/backward, FP32 accumulation, loss scaling, and automatic mixed precision.

By EcrioniX · Published June 13, 2026 · ~3800 words · 11 min read

1. Why Mixed Precision?

Problem: FP16 training is faster (2x) but loses precision (gradients underflow, diverge)

Solution: FP16 for fast matrix multiply, FP32 for gradient accumulation

Standard FP32 Training: Forward: FP32 → FP32 (slower) Backward: FP32 → FP32 (slower) Weight update: FP32 (precise) Mixed Precision: Forward: FP32 → FP16 (2x faster compute) Backward: FP16 → FP32 (gradient scaling prevents underflow) Accumulation: FP32 (maintains precision) Weight update: FP32 (small, precise updates) Result: ~2x speedup, minimal accuracy loss

2. Loss Scaling Strategy

Core issue: Small gradients in FP16 underflow to zero

Solution: Multiply loss by large scale factor, then divide gradients

Standard (FP16 only): gradient = dL/dW (may be 1e-7, underflows in FP16 min ≈ 6e-5) Loss Scaling: scaled_loss = loss × 1024 scaled_gradient = dL/dW × 1024 (now ≈ 1e-4, representable in FP16) actual_gradient = scaled_gradient / 1024 (back to original) update: W -= lr × actual_gradient Hardware: Scaling is cheap (bit shift or multiply by power of 2)

3. Automatic Mixed Precision (AMP)

Idea: Framework automatically chooses which ops in FP16 vs FP32

Rules:

4. Training Stability with Mixed Precision

Challenges:

Dynamic loss scaling: Automatically adjust scale factor based on overflow frequency

5. Hardware Support

HardwareFP16 SpeedTensor CoresMixed Precision Support
NVIDIA V1002x FP32Yes (Volta)Partial
NVIDIA A1002x FP32Yes (Ampere)Full (TF32, BF16)
Google TPU v32x FP32SystolicFull (BF16)
Intel Gaudi8x FP32Habana GaudiFull (BF16, FP8)

6. Real-World Mixed Precision Results

ResNet-50 Training:

BERT Training:

7. Mixed Precision Checklist

Next (Day 9): Sparsity and pruning for further optimization.