HomeDay 15

Production Quantization

Real-world quantization workflows in Apple A-series, Google TPU, and NVIDIA H100. From training to deployment.

Apple Neural Engine (A17 Pro): INT8 Inference Pipeline

Workflow

1. Train model on GPU (FP32) → ResNet-50, MobileNetV3, etc. 2. Quantize on macOS (QAT) → Use Apple's Core ML Tools → Input: .pt, .pb (PyTorch/TensorFlow) → Output: .mlmodel (quantized) 3. Benchmark accuracy → If < 0.5% loss, proceed → If > 0.5% loss, retrain with more epochs 4. Deploy to iOS/iPadOS → ANE runs INT8 inference → ~2W power, ~17 TOPS Example model sizes: - MobileNetV3: 4.2 MB (FP32) → 1.2 MB (INT8) [71% reduction] - ResNet-50: 102 MB (FP32) → 26 MB (INT8)

Quantization Details

Google TPU (Cloud): BF16 Training + INT8 Inference

Production Workflow

1. Train on TPU (BF16 + loss scaling) → PyTorch/JAX/TensorFlow → 2-4× faster than GPU 2. Validation on TPU (BF16) → Check model quality 3. Export to SavedModel format → Includes quantization params 4. Deploy to Cloud TPU → Inference in INT8 or BF16 → 430 TFLOPS (v4), 8 GB HBM Example: LLaMA 7B model - Training: 3.5 hours on TPU-v4 Pod (8 chips) - FP32 size: 28 GB → BF16: 14 GB → INT8: 7 GB - Deployment: Fits on single TPU (8 GB HBM)

TPU-Specific Optimizations

NVIDIA H100: FP8 Quantization

Hopper Architecture Support

FormatBitsLayoutSpeed vs FP32
FP3232S[1] E[8] M[23]
TF3232S[1] E[8] M[10]1.5×
FP1616S[1] E[5] M[10]
FP88S[1] E[4] M[3]

H100 Workflow

1. Train on H100 (FP32 with AMP) → Automatic Mixed Precision → Uses TF32 for matrix ops → 3-4× faster than pure FP32 2. Fine-tune with FP8 → torch.distributed.quantize(model, dtype=torch.float8) → 5-10 extra epochs 3. Deploy → inference_engine = torch.compile(model, dtype=torch.float8) → 4× memory savings → 4× throughput vs FP32

Comparison: Real Hardware

ChipQuantizationAccuracy LossSpeedupPower
Apple A17INT80.3-0.5%2W
Google TPU v4INT8 + BF160.1-0.3%2-4×430 TFLOPS
NVIDIA H100FP80.5-1%700W
Why different quantizations? - Apple: Mobile power budget → INT8 only - Google: Data center scale → BF16 for training, INT8 for inference - NVIDIA: Flexible workloads → FP8 (new, hybrid approach) Key insight: Production quantization is not just a math operation—it's a chip-specific optimization aligned with hardware compute units (Tensor Cores, systolic arrays, MAC units).

Next Phase: Memory & Bandwidth

You now understand precision. Days 16-20 cover memory: the actual bottleneck for AI chips. How to design memory hierarchies, high-bandwidth memory (HBM), and roofline models.