HomeDay 13

Quantization Techniques

Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT). How to reduce FP32 models to INT8 with minimal accuracy loss.

Two Approaches

1. Post-Training Quantization (PTQ)

When: You have a trained FP32 model, need to quantize it fast.

How:

Step 1: Take FP32 weights Step 2: Collect activation statistics (min/max values) on validation set Step 3: Compute scale factors per layer Step 4: Quantize weights to INT8 Step 5: Test on validation set Pro: Fast (no retraining) Con: 1-2% accuracy loss

2. Quantization-Aware Training (QAT)

When: You need < 0.5% accuracy loss, or model is sensitive to quantization.

How:

Step 1: Start with trained FP32 model Step 2: Insert fake quantization ops in forward pass (simulate INT8 rounding without actually quantizing) Step 3: Fine-tune on training data (10-20 epochs, lower learning rate) Step 4: Weights learn to be quantization-friendly Step 5: Export to actual INT8 model Pro: Better accuracy retention (< 0.5% loss) Con: Requires retraining, slower

Calibration: Finding the Right Scale

The hardest part: picking the scale factor that minimizes quantization error.

Min-Max Calibration

Simple: scale = max(|activations|) / 127 Problem: Outliers ruin everything Activation range: [0.001, 0.05, 0.02, ..., 5.0] (one extreme outlier) Scale = 5.0 / 127 = 0.0394 Result: 0.001 maps to INT8(0) - lost precision!

Entropy Calibration (Better)

Use KL divergence to find best scale. Ignore extreme outliers.

Approach: Test different scales on validation set - Scale = 2.0: KL_div = 0.05 - Scale = 3.0: KL_div = 0.03 ← Best! - Scale = 4.0: KL_div = 0.08 Pick scale=3.0 (minimizes information loss)

Comparison Table

MethodAccuracy LossSpeedEffortBest For
PTQ0.5-2%MinutesLowFast prototypes
QAT0-0.5%HoursHighProduction
Per-channel quant0.1-0.5%HoursMediumCNNs, Transformers
Apple Neural Engine: Uses QAT. Models are quantized offline with retraining, deployed as INT8 on A-series chips. Zero loss compared to FP32 after fine-tuning. Google TPU: Supports both PTQ (fast) and QAT (accurate). Production models use QAT.

Day 14: Mixed-precision: when to use different precisions in the same model.