AI Chip Design Day 15

Apple Neural Engine (A17 Pro): INT8 Inference Pipeline

Workflow

1. Train model on GPU (FP32) → ResNet-50, MobileNetV3, etc. 2. Quantize on macOS (QAT) → Use Apple's Core ML Tools → Input: .pt, .pb (PyTorch/TensorFlow) → Output: .mlmodel (quantized) 3. Benchmark accuracy → If < 0.5% loss, proceed → If > 0.5% loss, retrain with more epochs 4. Deploy to iOS/iPadOS → ANE runs INT8 inference → ~2W power, ~17 TOPS Example model sizes: - MobileNetV3: 4.2 MB (FP32) → 1.2 MB (INT8) [71% reduction] - ResNet-50: 102 MB (FP32) → 26 MB (INT8)

Quantization Details

Format: Symmetric INT8 ([-128, 127])
Per-layer or per-channel: Per-channel for weights (better accuracy)
Activations: INT8, calibrated on representative data
Verification: Side-by-side inference on GPU vs ANE

Google TPU (Cloud): BF16 Training + INT8 Inference

Production Workflow

1. Train on TPU (BF16 + loss scaling) → PyTorch/JAX/TensorFlow → 2-4× faster than GPU 2. Validation on TPU (BF16) → Check model quality 3. Export to SavedModel format → Includes quantization params 4. Deploy to Cloud TPU → Inference in INT8 or BF16 → 430 TFLOPS (v4), 8 GB HBM Example: LLaMA 7B model - Training: 3.5 hours on TPU-v4 Pod (8 chips) - FP32 size: 28 GB → BF16: 14 GB → INT8: 7 GB - Deployment: Fits on single TPU (8 GB HBM)

TPU-Specific Optimizations

Systolic compute: 256×256 array runs INT8 matmuls natively
Weight layout: Transposed to match systolic flow (Row-major for A, column-major for B)
Batch size: Multiples of 128 (fills systolic array efficiently)

NVIDIA H100: FP8 Quantization

Hopper Architecture Support

Format	Bits	Layout	Speed vs FP32
FP32	32	S[1] E[8] M[23]	1×
TF32	32	S[1] E[8] M[10]	1.5×
FP16	16	S[1] E[5] M[10]	2×
FP8	8	S[1] E[4] M[3]	4×

H100 Workflow

1. Train on H100 (FP32 with AMP) → Automatic Mixed Precision → Uses TF32 for matrix ops → 3-4× faster than pure FP32 2. Fine-tune with FP8 → torch.distributed.quantize(model, dtype=torch.float8) → 5-10 extra epochs 3. Deploy → inference_engine = torch.compile(model, dtype=torch.float8) → 4× memory savings → 4× throughput vs FP32

Comparison: Real Hardware

Chip	Quantization	Accuracy Loss	Speedup	Power
Apple A17	INT8	0.3-0.5%	4×	2W
Google TPU v4	INT8 + BF16	0.1-0.3%	2-4×	430 TFLOPS
NVIDIA H100	FP8	0.5-1%	4×	700W

Why different quantizations? - Apple: Mobile power budget → INT8 only - Google: Data center scale → BF16 for training, INT8 for inference - NVIDIA: Flexible workloads → FP8 (new, hybrid approach) Key insight: Production quantization is not just a math operation—it's a chip-specific optimization aligned with hardware compute units (Tensor Cores, systolic arrays, MAC units).

Next Phase: Memory & Bandwidth

You now understand precision. Days 16-20 cover memory: the actual bottleneck for AI chips. How to design memory hierarchies, high-bandwidth memory (HBM), and roofline models.

Production Quantization