Every time you talk to ChatGPT, unlock your phone with your face, or get a real-time translation, a specialized piece of silicon does the heavy lifting. Not a CPU. Not a GPU. A Neural Processing Unit (NPU) — built from the ground up to run AI as fast and cheaply as physics allows. This is how it works.
Running a single forward pass through GPT-4 requires roughly 1.8 trillion multiply-add operations. A high-end laptop CPU performs about 500 billion FP32 operations per second. The math tells you it would take several seconds per token — completely unusable.
But the deeper problem isn't raw speed. CPUs are designed for unpredictable, branchy code — the kind humans write. They have huge out-of-order execution engines, branch predictors, and caches all designed to handle code that jumps around. Neural networks are the exact opposite: perfectly regular, embarrassingly parallel matrix multiplications with zero branches. A CPU wastes most of its transistors on machinery that AI simply never uses.
GPUs solved this partially — thousands of simple cores work well for AI training. But for inference on the edge (your phone, your laptop), GPUs are too power-hungry. That's the gap NPUs fill: all the compute you need, a fraction of the power.
Strip away the theory. At the silicon level, running a neural network is almost entirely matrix multiplication: multiply an input vector by a weight matrix, add a bias, apply an activation function, repeat hundreds of times.
Each layer of a transformer or CNN is: output = activation(W × input + b). The weight matrix W can be millions of numbers. You need to multiply every input value by every weight and accumulate the results — that's a Multiply-Accumulate (MAC) operation, and you need to do billions of them per inference.
A systolic array is the cleverest piece of hardware in modern AI chips. It's a 2D grid of MAC units that passes data rhythmically (like a heartbeat — "systolic") from cell to cell, so every weight gets reused many times without re-fetching it from memory. This is the key: data reuse beats raw compute speed, because memory bandwidth is always the bottleneck.
A systolic array is like a perfectly optimised factory assembly line. Each worker (PE) has one job: multiply two numbers and add to a running total. The input slides down the belt; the worker grabs it, does their step, passes it on. Every worker is busy every clock cycle. No one is waiting for a manager to decide what to do next — zero control flow overhead.
Training a large model produces weights at FP32 precision — 32 bits per number. Running inference at FP32 is wasteful: the model is nearly as accurate at INT8 (8-bit integer) with 4× smaller weights and 4× faster MAC operations. This is quantization.
Modern techniques go even further: INT4 (4 bits) and GPTQ quantization can compress large language models to run on laptop NPUs with under 1% quality degradation. The entire 7-billion-parameter Llama model — quantized to INT4 — fits in 4 GB of RAM and runs at real-time speed on an Apple M-series chip.
The biggest constraint in AI inference isn't compute — it's memory bandwidth. Loading a 70-billion parameter model requires trillions of bytes of data to flow from storage to the MAC units. NPUs solve this with a carefully designed on-chip memory hierarchy:
The goal is to keep as many weights as possible in on-chip SRAM and avoid fetching from off-chip DRAM. Apple's Unified Memory Architecture — where CPU, GPU, and Neural Engine share one pool of LPDDR5 — is a major advantage: no copying between separate memory pools.
When you ask your phone's on-device AI model a question, here's what happens in hardware:
| Step | What happens | Where it runs |
|---|---|---|
| 1. Tokenize | Convert text to integer token IDs | CPU (fast, sequential) |
| 2. Embedding lookup | Map each token to a 4096-dim vector from a lookup table | NPU unified buffer |
| 3. Attention | Query × Key matrix multiply (most compute-intensive) | NPU systolic array (INT8) |
| 4. MLP layers | Two dense matrix multiplications per transformer layer | NPU systolic array (INT8) |
| 5. Softmax / RoPE | Special functions (element-wise) | NPU vector unit or CPU |
| 6. Sampling | Pick the next token from the probability distribution | CPU |
| 7. Decode token | Convert token ID back to text | CPU |
Steps 3 and 4 are where 95%+ of the time is spent — and exactly what the systolic array is built for.
Everything above is about inference — running a finished model. Training is a completely different beast:
| Training | Inference | |
|---|---|---|
| Goal | Adjust weights to minimise loss | Run model on new input |
| Precision | FP32 / BF16 (high precision needed) | INT8 / FP16 (low precision fine) |
| Hardware | NVIDIA H100/H200 clusters | NPUs, edge chips |
| Scale | Months, thousands of GPUs | Milliseconds, one chip |
| Cost | $50M–$500M+ per frontier model | Fraction of a cent per query |
When Anthropic trains Claude or OpenAI trains GPT-5, they run thousands of NVIDIA H100 GPUs for months. When you use the model, a single NPU or small GPU cluster handles your query in milliseconds.
AI chips are not magic — they are extremely optimised matrix multiplication machines. The entire "intelligence" of a model lives in billions of learned weights stored in memory. An NPU's job is to multiply those weights by input vectors as fast and cheaply as possible. The cleverness is in the systolic array (maximum data reuse), quantization (4× more weights in on-chip memory), and unified memory (no copying bottleneck). Everything else is engineering execution on that foundation.
A Neural Processing Unit — a chip designed specifically for neural network inference. Built around systolic arrays of MAC units running at INT8/FP16 precision with large on-chip SRAM.
A 2D grid of MAC processing elements where data flows rhythmically through the grid. Inputs flow right, weights are pre-loaded; every PE multiplies and accumulates each cycle. Maximum data reuse, minimal memory fetches.
CPUs are designed for branchy, unpredictable code. AI is regular, parallel matrix math — the CPU wastes most transistors on hardware AI never uses. NPUs dedicate every transistor to MAC operations.
Reducing weight precision from FP32 to INT8 (or INT4). 4× smaller weights, 4× more fit in SRAM, 4× faster MAC units — with <1% accuracy loss when done correctly.
Training adjusts weights using FP32 on GPU clusters (months, $millions). Inference runs the trained model on new input using INT8 on NPUs (milliseconds, near-free).
As of 2026: phone NPUs range 35–50 TOPS (Apple A18, Qualcomm Snapdragon 8 Elite, MediaTek 9400). Data center chips like NVIDIA H100 reach 3,958 TOPS INT8.