HomeDay 21

Apple Neural Engine A17 Pro

Inference-only systolic array for iPhone 15 Pro. Design for power, not peak throughput. INT8, shared memory, thermal design.

Constraint: 2W Power Budget

iPhones are battery-powered. ANE must fit in 2W sustained.

A17 Pro power breakdown: - CPU cores: 3W - GPU: 4W - Neural Engine: 2W ← Hard constraint - Memory/control: 1W - Total: 10W (max, brief)

ANE Architecture

Why 16×16?

Design tradeoff analysis:

SizeMACsAreaPower (est.)Fit in A17?
8×8640.1 mm²0.5WYes (too small)
16×162560.4 mm²2WYes (perfect)
32×321,0241.6 mm²8WNo (too hot)

Key Design Choices

1. Shared SRAM (not HBM)

iPhone doesn't have room for stacked memory. Shares LPDDR5X with CPU/GPU.

LPDDR5X: 120 GB/s bandwidth A17 peak compute: 17 TOPS = 17 × 10^12 FLOPs/s Arithmetic intensity threshold: 17 × 10^12 / (120 × 10^9) = 142 FLOP/byte For 16×16 arrays (AI = 64 FLOP/byte): → Memory-bound (roofline limited) Mitigation: 512 KB per-core SRAM (1-layer working buffer)

2. INT8 Only (No Training)

Training happens offline on GPU. A17 ANE runs inference only.

  • Integer multiply: 2× faster hardware than FP
  • 2× power savings
  • No loss scaling needed
  • Quantized models: Core ML format with scale/zero-point

3. Per-Core Autonomy

16 cores can run independent 16×16 systolic ops in parallel:

Core 0: Inference on layer 0 (16×16) Core 1: Inference on layer 1 (16×16) ... Core 15: Inference on layer 15 (16×16) All cores share: L1 fetch queue, LPDDR5X bus Benefit: Reduces per-core power (smaller clock domains)

Real Performance: Vision Models

  • MobileNetV3: 10-15 ms (17 TOPS sustained)
  • EfficientNet-B0: 12-18 ms
  • Vision Transformer (ViT): 50-100 ms (attention bottleneck)

vs TPU vs H100

ChipPeak TOPSPowerTOPS/WContext
A17 ANE172W8.5Inference, mobile
TPU v4430200W2.15Training, datacenter
H1001,450700W2.07All workloads

Day 22: Google TPU v4 — systolic at datacenter scale.