Apple Neural Engine (A17 Pro): INT8 Inference Pipeline
Workflow
1. Train model on GPU (FP32)
→ ResNet-50, MobileNetV3, etc.
2. Quantize on macOS (QAT)
→ Use Apple's Core ML Tools
→ Input: .pt, .pb (PyTorch/TensorFlow)
→ Output: .mlmodel (quantized)
3. Benchmark accuracy
→ If < 0.5% loss, proceed
→ If > 0.5% loss, retrain with more epochs
4. Deploy to iOS/iPadOS
→ ANE runs INT8 inference
→ ~2W power, ~17 TOPS
Example model sizes:
- MobileNetV3: 4.2 MB (FP32) → 1.2 MB (INT8) [71% reduction]
- ResNet-50: 102 MB (FP32) → 26 MB (INT8)
Quantization Details
- Format: Symmetric INT8 ([-128, 127])
- Per-layer or per-channel: Per-channel for weights (better accuracy)
- Activations: INT8, calibrated on representative data
- Verification: Side-by-side inference on GPU vs ANE
Google TPU (Cloud): BF16 Training + INT8 Inference
Production Workflow
1. Train on TPU (BF16 + loss scaling)
→ PyTorch/JAX/TensorFlow
→ 2-4× faster than GPU
2. Validation on TPU (BF16)
→ Check model quality
3. Export to SavedModel format
→ Includes quantization params
4. Deploy to Cloud TPU
→ Inference in INT8 or BF16
→ 430 TFLOPS (v4), 8 GB HBM
Example: LLaMA 7B model
- Training: 3.5 hours on TPU-v4 Pod (8 chips)
- FP32 size: 28 GB → BF16: 14 GB → INT8: 7 GB
- Deployment: Fits on single TPU (8 GB HBM)
TPU-Specific Optimizations
- Systolic compute: 256×256 array runs INT8 matmuls natively
- Weight layout: Transposed to match systolic flow (Row-major for A, column-major for B)
- Batch size: Multiples of 128 (fills systolic array efficiently)
NVIDIA H100: FP8 Quantization
Hopper Architecture Support
| Format | Bits | Layout | Speed vs FP32 |
|---|---|---|---|
| FP32 | 32 | S[1] E[8] M[23] | 1× |
| TF32 | 32 | S[1] E[8] M[10] | 1.5× |
| FP16 | 16 | S[1] E[5] M[10] | 2× |
| FP8 | 8 | S[1] E[4] M[3] | 4× |
H100 Workflow
1. Train on H100 (FP32 with AMP)
→ Automatic Mixed Precision
→ Uses TF32 for matrix ops
→ 3-4× faster than pure FP32
2. Fine-tune with FP8
→ torch.distributed.quantize(model, dtype=torch.float8)
→ 5-10 extra epochs
3. Deploy
→ inference_engine = torch.compile(model, dtype=torch.float8)
→ 4× memory savings
→ 4× throughput vs FP32
Comparison: Real Hardware
| Chip | Quantization | Accuracy Loss | Speedup | Power |
|---|---|---|---|---|
| Apple A17 | INT8 | 0.3-0.5% | 4× | 2W |
| Google TPU v4 | INT8 + BF16 | 0.1-0.3% | 2-4× | 430 TFLOPS |
| NVIDIA H100 | FP8 | 0.5-1% | 4× | 700W |
Why different quantizations?
- Apple: Mobile power budget → INT8 only
- Google: Data center scale → BF16 for training, INT8 for inference
- NVIDIA: Flexible workloads → FP8 (new, hybrid approach)
Key insight: Production quantization is not just a math operation—it's a chip-specific optimization aligned with hardware compute units (Tensor Cores, systolic arrays, MAC units).
Next Phase: Memory & Bandwidth
You now understand precision. Days 16-20 cover memory: the actual bottleneck for AI chips. How to design memory hierarchies, high-bandwidth memory (HBM), and roofline models.