Mobile NPU Landscape
| OEM | Name | Peak TOPS | Power | Use Case |
|---|---|---|---|---|
| Apple | Neural Engine | 17 | 2W | iPhone (all models) |
| Qualcomm | Hexagon DSP | 4-8 | 0.5-1W | Android flagship |
| MediaTek | APU (AI Processor) | 2-4 | 0.3-0.5W | Budget/mid-range |
| Samsung | NPU (Exynos) | 1-2 | 0.2W | Galaxy A/M series |
| Huawei | Da Vinci (Kirin) | 8 | 1W | Restricted (sanctions) |
Qualcomm Hexagon DSP
Digital Signal Processor, not systolic
Architecture
- 128-wide SIMD (vector unit)
- Integer + floating-point (FP32, FP16)
- No special matrix hardware (generic DSP)
- Integrated in Snapdragon (same die as CPU/GPU)
- Power: ~1W peak, <100 mW idle
Why Not Systolic?
Systolic arrays assume:
- Large, regular matrix multiplies
- Batch processing
Mobile use:
- Small models (MobileNetV3, ResNet-50 pruned)
- One image at a time (batch=1)
- Variable layer sizes
- Tight latency budget (<10 ms)
Result: SIMD DSP is more flexible, even if less throughput.
Power Budget Reality
Smartphone power consumption (active use):
- Screen: 2-3W
- CPU: 1-2W
- GPU: 2-3W
- Modem: 0.5W
- NPU: 0.2-1W ← This is the constraint!
Battery capacity: 3,000-4,000 mAh (10-15 Wh)
Target endurance: 10+ hours
NPU for facial recognition: ~10 ms per frame @ 30 fps
→ 0.3W average if running continuously
→ 2% of total power budget (acceptable)
Real Mobile AI Workloads
Common Use Cases
- Face recognition: 10-20 ms (MobileNetV2 backbone)
- Object detection: 50-100 ms (SSD-MobileNet)
- Scene understanding: 100-200 ms (semantic segmentation)
- Speech recognition: Real-time (DSP or CPU)
- Generative AI: Not yet (<100M param models only)
Model Sizes
| Model | Params | Size (INT8) | Device |
|---|---|---|---|
| MobileNetV3 | 5.4M | 2 MB | Any phone |
| ResNet-50 | 25M | 100 MB | Flagship |
| BERT-base | 110M | 440 MB | Rare (storage) |
| LLaMA-7B | 7B | 3.5 GB | Not feasible |
Integration: SoC Perspective
Mobile NPUs are on the same chip as CPU/GPU, sharing memory and power rails:
- Reduced latency (no external I/O)
- Shared HBM? No (size + cost constraints)
- Shared cache? Partial (L3 sometimes shared)
- Power gating: All NPU components can be disabled when idle
Day 26: Practical design: building a simple 4×4 systolic MAC in Verilog. From theory to HDL.