AI Chip Day 6 Enhanced — Quantization Techniques

1. Quantization Overview

Goal: Reduce model size and compute by using lower precision (INT8 instead of FP32)

Trade-off: Accuracy loss vs 4x memory saving, 4x compute speedup

Key insight: Neural networks are robust to quantization. ResNet-50 FP32→INT8 loses < 0.5% accuracy.

2. Post-Training Quantization (PTQ)

Process: Train model in FP32, then convert to INT8 without retraining

PTQ Steps: Step 1: Collect activation statistics Gather min/max of activations on calibration dataset Example: softmax layer output range: [0, 1] Step 2: Calculate quantization scale/zero-point For symmetric quantization: scale = max_value / 127 For asymmetric: scale = (max_value - min_value) / 255 zero_point = -min_value / scale Step 3: Convert weights quantized_weight = round(original_weight * scale) Step 4: Inference quantized_output = quantized_weight × quantized_input (INT8) actual_output = quantized_output / scale (dequantize) Time: Minutes (no retraining needed) Accuracy: 0.1-1% loss (depends on model, layer)

3. Quantization-Aware Training (QAT)

Better accuracy: Simulate quantization during training, let model adapt

Process:

Start with pre-trained FP32 model
Insert fake quantization ops (simulate rounding)
Train for few epochs (learning rate lower than initial training)
Weights learn to work well in INT8 space
Convert to actual INT8 hardware

Trade-off: Longer than PTQ (hours vs minutes), but better accuracy

4. Symmetric vs Asymmetric Quantization

Type	Formula	Range	Accuracy	Hardware
Symmetric	q = x / scale	[-127, 127]	Slightly lower	Simpler (no zero-point)
Asymmetric	q = (x - z) / scale	[-128, 127]	Better (uses full range)	Needs zero-point calc

Example: ReLU output is [0, 6.5] (all positive)

Symmetric: Maps to [-127, 127], wastes negative range
Asymmetric: Maps to [0, 127], uses full range, better precision

5. Calibration Methods

MinMax Calibration

Simple: Use min/max of activation range on calibration set

Problem: Outliers can skew the range. One sample with max=100 while others are [0, 5] wastes quantization range.

KL Divergence Calibration

Better: Find range that minimizes KL divergence between FP32 and INT8 distributions

Process: Vary the clipping range, measure KL divergence, choose best

Entropy-Based Calibration

Alternative: Minimize entropy loss (information-theoretic approach)

6. Per-Channel vs Per-Layer Quantization

Per-layer: Single scale for entire layer (simple, less accurate)

Per-channel: Different scale for each output channel (better accuracy, more hardware)

Example: Convolution with 64 output channels Per-layer: single_scale = max(all 64 channels) All channels quantized with same scale Accuracy: 74.5% (ResNet-50) Per-channel: 64 different scales (one per channel) Each channel quantized independently Accuracy: 76.1% (ResNet-50) Improvement: 1.6% accuracy, 64x more storage for scales

7. Mixed-Bit Quantization

Idea: Not all layers need INT8. Some can be INT4 (even better compression).

Early layers: INT8 (more sensitive to quantization)
Middle layers: INT4 (robust to quantization)
Late layers: INT8 (output accuracy critical)

Result: Average 6 bits/weight instead of 8, still good accuracy

8. Accuracy Preservation Strategies

Fine-tuning: QAT with low learning rate (1/100 of original)
Activation clipping: Avoid extreme outliers
Weight clipping: Some weights less important, can be clipped
Hessian-weighted quantization: Quantize weights with low Hessian (less important)
Distillation: Train INT8 student model from FP32 teacher

9. Real-World Examples

Mobile Deployment (Apple)

ResNet-50 on iPhone:

Original: 100MB (FP32)
Quantized: 25MB (INT8)
Speedup: 4x
Accuracy: 75.8% (vs 76.1% FP32, < 0.3% loss)
Power: 50% reduction

Server Inference (Google)

BERT for search ranking:

Original: 340M params, 1.3GB (FP32)
Quantized: 340M params, 330MB (INT8)
Speedup: 3-4x
Accuracy: Minimal impact on ranking
Cost: 4x more inferences per GPU

10. Quantization Checklist

✅ Start with PTQ: Easy, fast, often good enough
✅ Measure baseline: FP32 accuracy before quantizing
✅ Try calibration methods: MinMax first, then KL if accuracy low
✅ Per-channel quantization: For CNNs, especially early layers
✅ If accuracy drops > 1%: Use QAT (fine-tune)
✅ Test on real hardware: Simulator ≠ actual device
✅ Document quantization method: PTQ or QAT, calibration set, accuracy

Next (Day 7): Systolic arrays (already enhanced). Then Days 8-15...