AI Chip Design Day 21

Constraint: 2W Power Budget

iPhones are battery-powered. ANE must fit in 2W sustained.

A17 Pro power breakdown: - CPU cores: 3W - GPU: 4W - Neural Engine: 2W ← Hard constraint - Memory/control: 1W - Total: 10W (max, brief)

ANE Architecture

16×16 systolic array: 256 MACs per core
16 neural cores: 4,096 total MACs (parallelism via batching)
Precision: INT8 only (no FP32, no training)
Memory: 512 KB shared SRAM per core
Peak: 17 TOPS @ 2 GHz (16 cores × 256 MACs × 2 GHz / 1000)

Why 16×16?

Design tradeoff analysis:

Size	MACs	Area	Power (est.)	Fit in A17?
8×8	64	0.1 mm²	0.5W	Yes (too small)
16×16	256	0.4 mm²	2W	Yes (perfect)
32×32	1,024	1.6 mm²	8W	No (too hot)

Key Design Choices

1. Shared SRAM (not HBM)

iPhone doesn't have room for stacked memory. Shares LPDDR5X with CPU/GPU.

LPDDR5X: 120 GB/s bandwidth A17 peak compute: 17 TOPS = 17 × 10^12 FLOPs/s Arithmetic intensity threshold: 17 × 10^12 / (120 × 10^9) = 142 FLOP/byte For 16×16 arrays (AI = 64 FLOP/byte): → Memory-bound (roofline limited) Mitigation: 512 KB per-core SRAM (1-layer working buffer)

2. INT8 Only (No Training)

Training happens offline on GPU. A17 ANE runs inference only.

Integer multiply: 2× faster hardware than FP
2× power savings
No loss scaling needed
Quantized models: Core ML format with scale/zero-point

3. Per-Core Autonomy

16 cores can run independent 16×16 systolic ops in parallel:

Core 0: Inference on layer 0 (16×16) Core 1: Inference on layer 1 (16×16) ... Core 15: Inference on layer 15 (16×16) All cores share: L1 fetch queue, LPDDR5X bus Benefit: Reduces per-core power (smaller clock domains)

Real Performance: Vision Models

MobileNetV3: 10-15 ms (17 TOPS sustained)
EfficientNet-B0: 12-18 ms
Vision Transformer (ViT): 50-100 ms (attention bottleneck)

vs TPU vs H100

Chip	Peak TOPS	Power	TOPS/W	Context
A17 ANE	17	2W	8.5	Inference, mobile
TPU v4	430	200W	2.15	Training, datacenter
H100	1,450	700W	2.07	All workloads

Day 22: Google TPU v4 — systolic at datacenter scale.

Apple Neural Engine A17 Pro