Constraint: 2W Power Budget
iPhones are battery-powered. ANE must fit in 2W sustained.
A17 Pro power breakdown:
- CPU cores: 3W
- GPU: 4W
- Neural Engine: 2W ← Hard constraint
- Memory/control: 1W
- Total: 10W (max, brief)
ANE Architecture
- 16×16 systolic array: 256 MACs per core
- 16 neural cores: 4,096 total MACs (parallelism via batching)
- Precision: INT8 only (no FP32, no training)
- Memory: 512 KB shared SRAM per core
- Peak: 17 TOPS @ 2 GHz (16 cores × 256 MACs × 2 GHz / 1000)
Why 16×16?
Design tradeoff analysis:
| Size | MACs | Area | Power (est.) | Fit in A17? |
|---|---|---|---|---|
| 8×8 | 64 | 0.1 mm² | 0.5W | Yes (too small) |
| 16×16 | 256 | 0.4 mm² | 2W | Yes (perfect) |
| 32×32 | 1,024 | 1.6 mm² | 8W | No (too hot) |
Key Design Choices
1. Shared SRAM (not HBM)
iPhone doesn't have room for stacked memory. Shares LPDDR5X with CPU/GPU.
LPDDR5X: 120 GB/s bandwidth
A17 peak compute: 17 TOPS = 17 × 10^12 FLOPs/s
Arithmetic intensity threshold: 17 × 10^12 / (120 × 10^9) = 142 FLOP/byte
For 16×16 arrays (AI = 64 FLOP/byte):
→ Memory-bound (roofline limited)
Mitigation: 512 KB per-core SRAM (1-layer working buffer)
2. INT8 Only (No Training)
Training happens offline on GPU. A17 ANE runs inference only.
- Integer multiply: 2× faster hardware than FP
- 2× power savings
- No loss scaling needed
- Quantized models: Core ML format with scale/zero-point
3. Per-Core Autonomy
16 cores can run independent 16×16 systolic ops in parallel:
Core 0: Inference on layer 0 (16×16)
Core 1: Inference on layer 1 (16×16)
...
Core 15: Inference on layer 15 (16×16)
All cores share: L1 fetch queue, LPDDR5X bus
Benefit: Reduces per-core power (smaller clock domains)
Real Performance: Vision Models
- MobileNetV3: 10-15 ms (17 TOPS sustained)
- EfficientNet-B0: 12-18 ms
- Vision Transformer (ViT): 50-100 ms (attention bottleneck)
vs TPU vs H100
| Chip | Peak TOPS | Power | TOPS/W | Context |
|---|---|---|---|---|
| A17 ANE | 17 | 2W | 8.5 | Inference, mobile |
| TPU v4 | 430 | 200W | 2.15 | Training, datacenter |
| H100 | 1,450 | 700W | 2.07 | All workloads |
Day 22: Google TPU v4 — systolic at datacenter scale.