Two Approaches
1. Post-Training Quantization (PTQ)
When: You have a trained FP32 model, need to quantize it fast.
How:
Step 1: Take FP32 weights
Step 2: Collect activation statistics (min/max values) on validation set
Step 3: Compute scale factors per layer
Step 4: Quantize weights to INT8
Step 5: Test on validation set
Pro: Fast (no retraining)
Con: 1-2% accuracy loss
2. Quantization-Aware Training (QAT)
When: You need < 0.5% accuracy loss, or model is sensitive to quantization.
How:
Step 1: Start with trained FP32 model
Step 2: Insert fake quantization ops in forward pass
(simulate INT8 rounding without actually quantizing)
Step 3: Fine-tune on training data (10-20 epochs, lower learning rate)
Step 4: Weights learn to be quantization-friendly
Step 5: Export to actual INT8 model
Pro: Better accuracy retention (< 0.5% loss)
Con: Requires retraining, slower
Calibration: Finding the Right Scale
The hardest part: picking the scale factor that minimizes quantization error.
Min-Max Calibration
Simple: scale = max(|activations|) / 127
Problem: Outliers ruin everything
Activation range: [0.001, 0.05, 0.02, ..., 5.0] (one extreme outlier)
Scale = 5.0 / 127 = 0.0394
Result: 0.001 maps to INT8(0) - lost precision!
Entropy Calibration (Better)
Use KL divergence to find best scale. Ignore extreme outliers.
Approach: Test different scales on validation set
- Scale = 2.0: KL_div = 0.05
- Scale = 3.0: KL_div = 0.03 ← Best!
- Scale = 4.0: KL_div = 0.08
Pick scale=3.0 (minimizes information loss)
Comparison Table
| Method | Accuracy Loss | Speed | Effort | Best For |
|---|---|---|---|---|
| PTQ | 0.5-2% | Minutes | Low | Fast prototypes |
| QAT | 0-0.5% | Hours | High | Production |
| Per-channel quant | 0.1-0.5% | Hours | Medium | CNNs, Transformers |
Apple Neural Engine: Uses QAT. Models are quantized offline with retraining, deployed as INT8 on A-series chips. Zero loss compared to FP32 after fine-tuning.
Google TPU: Supports both PTQ (fast) and QAT (accurate). Production models use QAT.
Day 14: Mixed-precision: when to use different precisions in the same model.