HomeAI ChipDay 6 Enhanced

Quantization Techniques

Post-training quantization, quantization-aware training, calibration methods, and accuracy preservation strategies for neural networks.

By EcrioniX · Published June 13, 2026 · ~4000 words · 12 min read

1. Quantization Overview

Goal: Reduce model size and compute by using lower precision (INT8 instead of FP32)

Trade-off: Accuracy loss vs 4x memory saving, 4x compute speedup

Key insight: Neural networks are robust to quantization. ResNet-50 FP32→INT8 loses < 0.5% accuracy.

2. Post-Training Quantization (PTQ)

Process: Train model in FP32, then convert to INT8 without retraining

PTQ Steps: Step 1: Collect activation statistics Gather min/max of activations on calibration dataset Example: softmax layer output range: [0, 1] Step 2: Calculate quantization scale/zero-point For symmetric quantization: scale = max_value / 127 For asymmetric: scale = (max_value - min_value) / 255 zero_point = -min_value / scale Step 3: Convert weights quantized_weight = round(original_weight * scale) Step 4: Inference quantized_output = quantized_weight × quantized_input (INT8) actual_output = quantized_output / scale (dequantize) Time: Minutes (no retraining needed) Accuracy: 0.1-1% loss (depends on model, layer)

3. Quantization-Aware Training (QAT)

Better accuracy: Simulate quantization during training, let model adapt

Process:

  1. Start with pre-trained FP32 model
  2. Insert fake quantization ops (simulate rounding)
  3. Train for few epochs (learning rate lower than initial training)
  4. Weights learn to work well in INT8 space
  5. Convert to actual INT8 hardware

Trade-off: Longer than PTQ (hours vs minutes), but better accuracy

4. Symmetric vs Asymmetric Quantization

TypeFormulaRangeAccuracyHardware
Symmetricq = x / scale[-127, 127]Slightly lowerSimpler (no zero-point)
Asymmetricq = (x - z) / scale[-128, 127]Better (uses full range)Needs zero-point calc

Example: ReLU output is [0, 6.5] (all positive)

5. Calibration Methods

MinMax Calibration

Simple: Use min/max of activation range on calibration set

Problem: Outliers can skew the range. One sample with max=100 while others are [0, 5] wastes quantization range.

KL Divergence Calibration

Better: Find range that minimizes KL divergence between FP32 and INT8 distributions

Process: Vary the clipping range, measure KL divergence, choose best

Entropy-Based Calibration

Alternative: Minimize entropy loss (information-theoretic approach)

6. Per-Channel vs Per-Layer Quantization

Per-layer: Single scale for entire layer (simple, less accurate)

Per-channel: Different scale for each output channel (better accuracy, more hardware)

Example: Convolution with 64 output channels Per-layer: single_scale = max(all 64 channels) All channels quantized with same scale Accuracy: 74.5% (ResNet-50) Per-channel: 64 different scales (one per channel) Each channel quantized independently Accuracy: 76.1% (ResNet-50) Improvement: 1.6% accuracy, 64x more storage for scales

7. Mixed-Bit Quantization

Idea: Not all layers need INT8. Some can be INT4 (even better compression).

Result: Average 6 bits/weight instead of 8, still good accuracy

8. Accuracy Preservation Strategies

9. Real-World Examples

Mobile Deployment (Apple)

ResNet-50 on iPhone:

Server Inference (Google)

BERT for search ranking:

10. Quantization Checklist

Next (Day 7): Systolic arrays (already enhanced). Then Days 8-15...