AI Chip Design Day 25

Mobile NPU Landscape

OEM	Name	Peak TOPS	Power	Use Case
Apple	Neural Engine	17	2W	iPhone (all models)
Qualcomm	Hexagon DSP	4-8	0.5-1W	Android flagship
MediaTek	APU (AI Processor)	2-4	0.3-0.5W	Budget/mid-range
Samsung	NPU (Exynos)	1-2	0.2W	Galaxy A/M series
Huawei	Da Vinci (Kirin)	8	1W	Restricted (sanctions)

Qualcomm Hexagon DSP

Digital Signal Processor, not systolic

Architecture

128-wide SIMD (vector unit)
Integer + floating-point (FP32, FP16)
No special matrix hardware (generic DSP)
Integrated in Snapdragon (same die as CPU/GPU)
Power: ~1W peak, <100 mW idle

Why Not Systolic?

Systolic arrays assume: - Large, regular matrix multiplies - Batch processing Mobile use: - Small models (MobileNetV3, ResNet-50 pruned) - One image at a time (batch=1) - Variable layer sizes - Tight latency budget (<10 ms) Result: SIMD DSP is more flexible, even if less throughput.

Power Budget Reality

Smartphone power consumption (active use): - Screen: 2-3W - CPU: 1-2W - GPU: 2-3W - Modem: 0.5W - NPU: 0.2-1W ← This is the constraint! Battery capacity: 3,000-4,000 mAh (10-15 Wh) Target endurance: 10+ hours NPU for facial recognition: ~10 ms per frame @ 30 fps → 0.3W average if running continuously → 2% of total power budget (acceptable)

Real Mobile AI Workloads

Common Use Cases

Face recognition: 10-20 ms (MobileNetV2 backbone)
Object detection: 50-100 ms (SSD-MobileNet)
Scene understanding: 100-200 ms (semantic segmentation)
Speech recognition: Real-time (DSP or CPU)
Generative AI: Not yet (<100M param models only)

Model Sizes

Model	Params	Size (INT8)	Device
MobileNetV3	5.4M	2 MB	Any phone
ResNet-50	25M	100 MB	Flagship
BERT-base	110M	440 MB	Rare (storage)
LLaMA-7B	7B	3.5 GB	Not feasible

Integration: SoC Perspective

Mobile NPUs are on the same chip as CPU/GPU, sharing memory and power rails:

Reduced latency (no external I/O)
Shared HBM? No (size + cost constraints)
Shared cache? Partial (L3 sometimes shared)
Power gating: All NPU components can be disabled when idle

Day 26: Practical design: building a simple 4×4 systolic MAC in Verilog. From theory to HDL.

Mobile Accelerators