The Problem
In 2023, you ran a neural network on your phone. ChatGPT queries, image recognition, voice transcription—all powered by special-purpose chips that don't exist on a desktop CPU. Why?
Answer: Matrix multiplication is 99% of deep learning. A general-purpose CPU is terrible at it. A GPU is better. A custom neural processor is 100-1000× better per watt.
CPU vs GPU vs NPU
CPU (General Purpose)
What it's good at: One instruction at a time. Complex branching. Cache-friendly random access.
For AI: Terrible.
// CPU trying to multiply two 256x256 matrices
for (int i = 0; i < 256; i++) {
for (int j = 0; j < 256; j++) {
float sum = 0;
for (int k = 0; k < 256; k++) {
sum += A[i][k] * B[k][j]; // One multiply per cycle
}
C[i][j] = sum;
}
}
// Total: 256^3 = 16.7 million cycles
// On a 3 GHz CPU = 5.6 millisecondsReality: Modern CPUs do run this faster (out-of-order, pipelining, AVX). Still, you get maybe 50-100 GFLOPS for matrix multiply.
GPU (Graphics Processor)
What it's good at: Thousands of identical operations in parallel. Massive memory bandwidth. Designed for matrix math (pixel shaders = SIMD).
For AI: Actually pretty good.
• 65 TFLOPS (FP32) for matrix multiply
• 300 GB/sec memory bandwidth
• Can run a large neural network in ~10ms
• But uses 70W of power (hot, needs cooling)
NPU (Neural Processing Unit) — Specialized
What it's good at: Only one thing: matrix multiplication with quantized values (INT8, BF16). Systolic arrays. No cache misses.
For AI: Phenomenal.
• 17 TFLOPS (INT8) for inference
• Uses only 2W of power
• Can run a full language model in ~50ms
• 10× more efficient than a GPU per watt
Matrix Multiplication: The Core Problem
Deep learning is 97% matrix multiply. Everything else (activation functions, normalization, attention) is noise.
// A transformer layer is basically:
output = attention(Q @ K.T) @ V // 3 matrix muls
output = MLPffn(output) // 2 matrix muls
// All the rest (softmax, layer norm, etc) = <1% of computeSo the question becomes: Can we design a chip that does matrix multiply incredibly fast and efficiently, and nothing else?
Answer: Yes, and it's called a systolic array.
Energy Efficiency & Performance
Why a custom chip wins:
- ✅ No instruction fetch/decode: One operation: MAC (multiply-accumulate)
- ✅ No cache hierarchy: Data flows directly through systolic array
- ✅ No branch prediction: No branches at all
- ✅ Quantization: INT8 instead of FP32 = 4× less data, 4× less power
- ✅ Dataflow architecture: Data streams, not stored in memory
Performance Per Watt
| Processor | TFLOPS | Power | Eff (TFLOPS/W) | Use Case |
|---|---|---|---|---|
| Intel Xeon (CPU) | 0.1 | 200W | 0.0005 | General compute |
| RTX 4090 (GPU) | 650 | 450W | 1.4 | Gaming + AI (overkill) |
| Google TPU v4 | 430 | 150W | 2.9 | Data center AI |
| Apple Neural Engine | 17 | 2W | 8.5 | Mobile inference |
Real Examples in Your Pocket
iPhone 15 Pro: A17 Pro chip. Neural Engine inside.
- Camera: Real-time portrait mode (AI segmentation)
- Siri: On-device voice recognition
- Photos: Smart search, object detection
- Live Translate: On-device language models
All running on 2 watts. Try that on a GPU.
Key Takeaways
- ✅ AI is matrix multiplication. 97% of the compute.
- ✅ General-purpose CPUs are slow. Designed for flexibility, not speed at one task.
- ✅ GPUs are better but power-hungry. Still fetch instructions, manage caches, handle general compute.
- ✅ Custom NPUs dominate on efficiency. Do one thing perfectly: MAC operations.
- ✅ The tradeoff: NPUs can't run general code. But for AI inference, they're unbeatable.
Tomorrow (Day 2): The problem in detail—why general-purpose chips will never win at AI.