What You Sacrifice for Efficiency
NPU Trade-off #1: Inference Only
Can it train? Most mobile NPUs: NO. They're inference-only machines.
Why? Training requires FP32 precision, gradient computation, backprop. Overkill for mobile.
Result: Apple A18 can run a 3B parameter LLM in 50ms. Can't fine-tune on device.
NPU Trade-off #2: Fixed Operations
An NPU is one instruction: systolic array multiply.
Can't run:
- Custom layers (you add later)
- Complex branching logic
- General-purpose code
Workaround: Hybrid chip. CPU handles control, NPU handles compute.
NPU Trade-off #3: Precision
Int8 quantization loses accuracy.
Top-1 accuracy of ResNet50:
- FP32: 76.1%
- INT8: 75.8% (minimal loss)
For inference, almost no loss. For training, unacceptable.
NPU vs GPU: The Design Decision
Choose NPU when:
• Inference only
• Energy budget tight (mobile, edge)
• Single workload (AI)
• Scale (billions of units)
Choose GPU when:
• Training required
• Multiple workloads
• Flexibility needed
• High power budget available
• Inference only
• Energy budget tight (mobile, edge)
• Single workload (AI)
• Scale (billions of units)
Choose GPU when:
• Training required
• Multiple workloads
• Flexibility needed
• High power budget available
Tomorrow (Day 6): The building block: multiply-accumulate (MAC) units.