1. Quantization Overview
Goal: Reduce model size and compute by using lower precision (INT8 instead of FP32)
Trade-off: Accuracy loss vs 4x memory saving, 4x compute speedup
Key insight: Neural networks are robust to quantization. ResNet-50 FP32→INT8 loses < 0.5% accuracy.
2. Post-Training Quantization (PTQ)
Process: Train model in FP32, then convert to INT8 without retraining
3. Quantization-Aware Training (QAT)
Better accuracy: Simulate quantization during training, let model adapt
Process:
- Start with pre-trained FP32 model
- Insert fake quantization ops (simulate rounding)
- Train for few epochs (learning rate lower than initial training)
- Weights learn to work well in INT8 space
- Convert to actual INT8 hardware
Trade-off: Longer than PTQ (hours vs minutes), but better accuracy
4. Symmetric vs Asymmetric Quantization
| Type | Formula | Range | Accuracy | Hardware |
|---|---|---|---|---|
| Symmetric | q = x / scale | [-127, 127] | Slightly lower | Simpler (no zero-point) |
| Asymmetric | q = (x - z) / scale | [-128, 127] | Better (uses full range) | Needs zero-point calc |
Example: ReLU output is [0, 6.5] (all positive)
- Symmetric: Maps to [-127, 127], wastes negative range
- Asymmetric: Maps to [0, 127], uses full range, better precision
5. Calibration Methods
MinMax Calibration
Simple: Use min/max of activation range on calibration set
Problem: Outliers can skew the range. One sample with max=100 while others are [0, 5] wastes quantization range.
KL Divergence Calibration
Better: Find range that minimizes KL divergence between FP32 and INT8 distributions
Process: Vary the clipping range, measure KL divergence, choose best
Entropy-Based Calibration
Alternative: Minimize entropy loss (information-theoretic approach)
6. Per-Channel vs Per-Layer Quantization
Per-layer: Single scale for entire layer (simple, less accurate)
Per-channel: Different scale for each output channel (better accuracy, more hardware)
7. Mixed-Bit Quantization
Idea: Not all layers need INT8. Some can be INT4 (even better compression).
- Early layers: INT8 (more sensitive to quantization)
- Middle layers: INT4 (robust to quantization)
- Late layers: INT8 (output accuracy critical)
Result: Average 6 bits/weight instead of 8, still good accuracy
8. Accuracy Preservation Strategies
- Fine-tuning: QAT with low learning rate (1/100 of original)
- Activation clipping: Avoid extreme outliers
- Weight clipping: Some weights less important, can be clipped
- Hessian-weighted quantization: Quantize weights with low Hessian (less important)
- Distillation: Train INT8 student model from FP32 teacher
9. Real-World Examples
Mobile Deployment (Apple)
ResNet-50 on iPhone:
- Original: 100MB (FP32)
- Quantized: 25MB (INT8)
- Speedup: 4x
- Accuracy: 75.8% (vs 76.1% FP32, < 0.3% loss)
- Power: 50% reduction
Server Inference (Google)
BERT for search ranking:
- Original: 340M params, 1.3GB (FP32)
- Quantized: 340M params, 330MB (INT8)
- Speedup: 3-4x
- Accuracy: Minimal impact on ranking
- Cost: 4x more inferences per GPU
10. Quantization Checklist
- ✅ Start with PTQ: Easy, fast, often good enough
- ✅ Measure baseline: FP32 accuracy before quantizing
- ✅ Try calibration methods: MinMax first, then KL if accuracy low
- ✅ Per-channel quantization: For CNNs, especially early layers
- ✅ If accuracy drops > 1%: Use QAT (fine-tune)
- ✅ Test on real hardware: Simulator ≠ actual device
- ✅ Document quantization method: PTQ or QAT, calibration set, accuracy
Next (Day 7): Systolic arrays (already enhanced). Then Days 8-15...