HomeDay 14

Mixed-Precision Training

Using different precisions for different parts of the model: FP32 for stability, FP16/BF16 for speed. How NVIDIA and Google do it.

The Problem: Pure FP16 Training

If you train the entire model in FP16, gradients underflow:

FP16 range: ±10^-5 to ±10^4 Backward pass gradients: Can be 10^-8 to 10^-6 Result: Gradients drop to zero → weights don't update! Solution: Mixed-precision training

Standard Mixed-Precision Approach

Keep FP32 master weights, compute in FP16/BF16.

Forward pass: weights_fp16 = weights_fp32.to(fp16) activations_fp16 = forward_pass(inputs, weights_fp16) loss_fp16 = loss_function(activations_fp16, targets) Backward pass: gradients_fp16 = backward_pass(loss_fp16) Loss scaling (prevent gradient underflow): scaled_loss_fp16 = loss_fp16 * 2^15 (scale up) scaled_grads_fp16 = backward_pass(scaled_loss_fp16) Update: gradients_fp32 = scaled_grads_fp16 / 2^15 (scale back down) weights_fp32 -= lr * gradients_fp32

Why Loss Scaling Works

StepWithout ScalingWith Scaling (2^15)
Gradient magnitude1e-7 (underflow!)3.3 (safe)
FP16 represents it?NO → 0YES
After backward0 (loss)Accurate
After scaling back1e-7 (correct)

Hardware Support

NVIDIA Automatic Mixed Precision (AMP)

Tensor Cores: H100 has special hardware for mixed-precision: - TF32 (32-bit float, 19-bit effective) - FP8 (8-bit float) - Automatic dtype casting Usage: Just add `torch.cuda.amp.autocast()`

Google TPU Mixed-Precision

TPU v4 computes in BF16, stores in FP32:

Which Layers Need Which Precision?

LayerPrecisionReason
Attention (matmul)FP32Softmax needs stability
Linear (weights)FP16/BF16Doesn't hurt accuracy
Layer normFP32Normalizes by variance
LossFP32Scaling needs headroom
Example: PyTorch Mixed-Precision
from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for batch in dataloader: with autocast(): # Forward in FP16 output = model(batch) loss = loss_fn(output, target) scaler.scale(loss).backward() # Backward in FP16 (scaled) scaler.step(optimizer) # Update in FP32 scaler.update()

Day 15: Production quantization in real chips (Apple, Google, NVIDIA).