A Neural Processing Unit (NPU) is a specialized processor designed specifically to accelerate the matrix multiplications and convolutions that make up neural network inference. Unlike CPUs (which optimize for single-thread latency) or GPUs (which optimize for parallel floating-point throughput), NPUs are optimized for the specific arithmetic patterns of AI — enormous numbers of multiply-accumulate (MAC) operations at low precision (INT8, FP16) with minimal data movement.

Why can't CPUs run AI well?

Running a single prompt through a large language model requires hundreds of billions of multiply-add operations. A modern CPU can execute perhaps 1-2 trillion FP32 operations per second, but it was designed for branchy, unpredictable code with complex control flow. AI workloads are the opposite — highly regular, predictable, embarrassingly parallel matrix operations. NPUs achieve 10-100x better performance per watt for these specific patterns because every transistor is dedicated to MAC operations.

What is quantization in AI?

Quantization reduces the numerical precision of model weights and activations. A weight trained at FP32 (32-bit float) can be compressed to INT8 (8-bit integer) with minimal accuracy loss. This quadruples the number of weights that fit in on-chip SRAM, multiplies throughput (INT8 MACs are cheaper to build than FP32), and reduces power consumption. Most on-device NPUs run INT8 or INT4 quantized models.

AI HARDWARE DEEP DIVE

How AI Chips Work — Inside an NPU

Q: What is a systolic array?

A systolic array is a grid of processing elements (PEs) that passes data rhythmically from one cell to the next like blood pumping through a heart. In a matrix multiplication context, input values flow from the left and weight values flow from the top; each PE multiplies two numbers and accumulates the result. The key insight is that data is reused — each value passes through many PEs before leaving, dramatically reducing the memory bandwidth needed.

Q: What is the difference between training and inference?

Training is the process of adjusting billions of model weights to minimize prediction error — it requires high-precision arithmetic (FP32/BF16), enormous datasets, and massive compute clusters (NVIDIA H100 GPUs). Inference is running the trained model on new input data — it can use lower precision (INT8/FP16), runs on the edge (phones, laptops), and is what NPUs are optimized for. Most people interact only with inference — every time you ask ChatGPT or unlock your phone with Face ID.

By EcrioniX · Updated Jun 12, 2026

Every time you talk to ChatGPT, unlock your phone with your face, or get a real-time translation, a specialized piece of silicon does the heavy lifting. Not a CPU. Not a GPU. A Neural Processing Unit (NPU) — built from the ground up to run AI as fast and cheaply as physics allows. This is how it works.

Figure — Three processor types, three different design philosophies. NPUs win at AI inference efficiency by 10–100×.

1. Why a CPU can't run AI fast enough

Running a single forward pass through GPT-4 requires roughly 1.8 trillion multiply-add operations. A high-end laptop CPU performs about 500 billion FP32 operations per second. The math tells you it would take several seconds per token — completely unusable.

But the deeper problem isn't raw speed. CPUs are designed for unpredictable, branchy code — the kind humans write. They have huge out-of-order execution engines, branch predictors, and caches all designed to handle code that jumps around. Neural networks are the exact opposite: perfectly regular, embarrassingly parallel matrix multiplications with zero branches. A CPU wastes most of its transistors on machinery that AI simply never uses.

GPUs solved this partially — thousands of simple cores work well for AI training. But for inference on the edge (your phone, your laptop), GPUs are too power-hungry. That's the gap NPUs fill: all the compute you need, a fraction of the power.

⚡ The numbers: Apple's A17 Pro Neural Engine runs at 35 TOPS (trillion operations per second) — for AI inference. The CPU cores on the same chip do perhaps 0.5 TOPS. The NPU is 70× more efficient at AI math, using far less power.

2. What a neural network actually does (in hardware terms)

Strip away the theory. At the silicon level, running a neural network is almost entirely matrix multiplication: multiply an input vector by a weight matrix, add a bias, apply an activation function, repeat hundreds of times.

Each layer of a transformer or CNN is: output = activation(W × input + b). The weight matrix W can be millions of numbers. You need to multiply every input value by every weight and accumulate the results — that's a Multiply-Accumulate (MAC) operation, and you need to do billions of them per inference.

Figure — Every neural network layer is matrix multiplication. Each output element is a chain of MAC (Multiply-Accumulate) operations across the input and weight row/column.

3. The systolic array — the heart of every NPU

A systolic array is the cleverest piece of hardware in modern AI chips. It's a 2D grid of MAC units that passes data rhythmically (like a heartbeat — "systolic") from cell to cell, so every weight gets reused many times without re-fetching it from memory. This is the key: data reuse beats raw compute speed, because memory bandwidth is always the bottleneck.

Figure — A 4×4 systolic array. Input values flow rightward; weights are pre-loaded in each Processing Element (PE). Every PE does one MAC per clock cycle. Google's TPU uses a 256×256 systolic array — 65,536 MAC units firing simultaneously.

💡 The assembly line analogy

A systolic array is like a perfectly optimised factory assembly line. Each worker (PE) has one job: multiply two numbers and add to a running total. The input slides down the belt; the worker grabs it, does their step, passes it on. Every worker is busy every clock cycle. No one is waiting for a manager to decide what to do next — zero control flow overhead.

4. Quantization — how weights shrink by 4×

Training a large model produces weights at FP32 precision — 32 bits per number. Running inference at FP32 is wasteful: the model is nearly as accurate at INT8 (8-bit integer) with 4× smaller weights and 4× faster MAC operations. This is quantization.

Figure — Quantization maps FP32 weights to INT8. Accuracy loss is typically <1% with calibration. Most on-device NPUs run INT8 or INT4.

Modern techniques go even further: INT4 (4 bits) and GPTQ quantization can compress large language models to run on laptop NPUs with under 1% quality degradation. The entire 7-billion-parameter Llama model — quantized to INT4 — fits in 4 GB of RAM and runs at real-time speed on an Apple M-series chip.

5. The NPU memory hierarchy

The biggest constraint in AI inference isn't compute — it's memory bandwidth. Loading a 70-billion parameter model requires trillions of bytes of data to flow from storage to the MAC units. NPUs solve this with a carefully designed on-chip memory hierarchy:

Figure — NPU memory hierarchy. Data must climb up to the register file to be computed. Keeping weights in on-chip SRAM eliminates expensive off-chip DRAM access.

The goal is to keep as many weights as possible in on-chip SRAM and avoid fetching from off-chip DRAM. Apple's Unified Memory Architecture — where CPU, GPU, and Neural Engine share one pool of LPDDR5 — is a major advantage: no copying between separate memory pools.

6. Real AI chips — who builds what

Apple

Neural Engine (A18 Pro)

35 TOPS · 2nm process · integrated in SoC with CPU + GPU + ISP · powers Face ID, Siri, on-device LLM

Qualcomm

Hexagon NPU (Snapdragon 8 Elite)

45 TOPS · 3nm · on-device Llama inference · image enhancement · powers most Android flagships

Google

TPU v5e

393 TFLOPS BF16 · 256×256 systolic array · drives all Google Search, Translate, Gemini inference at scale

NVIDIA

Tensor Core (H100)

3,958 TOPS INT8 · 4th-gen Tensor Cores · 80 GB HBM3 · powers ChatGPT, Gemini, Claude training

MediaTek

APU 790 (Dimensity 9400)

50 TOPS · 3nm · on-device AI for mid-range Android · powers camera AI and LLM chat

Intel

NPU 4 (Core Ultra 200)

48 TOPS · laptop AI PC · powers Windows Copilot+ features locally on x86 laptops

7. The full inference pipeline

When you ask your phone's on-device AI model a question, here's what happens in hardware:

Step	What happens	Where it runs
1. Tokenize	Convert text to integer token IDs	CPU (fast, sequential)
2. Embedding lookup	Map each token to a 4096-dim vector from a lookup table	NPU unified buffer
3. Attention	Query × Key matrix multiply (most compute-intensive)	NPU systolic array (INT8)
4. MLP layers	Two dense matrix multiplications per transformer layer	NPU systolic array (INT8)
5. Softmax / RoPE	Special functions (element-wise)	NPU vector unit or CPU
6. Sampling	Pick the next token from the probability distribution	CPU
7. Decode token	Convert token ID back to text	CPU

Steps 3 and 4 are where 95%+ of the time is spent — and exactly what the systolic array is built for.

8. Training vs inference — two different problems

Everything above is about inference — running a finished model. Training is a completely different beast:

	Training	Inference
Goal	Adjust weights to minimise loss	Run model on new input
Precision	FP32 / BF16 (high precision needed)	INT8 / FP16 (low precision fine)
Hardware	NVIDIA H100/H200 clusters	NPUs, edge chips
Scale	Months, thousands of GPUs	Milliseconds, one chip
Cost	$50M–$500M+ per frontier model	Fraction of a cent per query

When Anthropic trains Claude or OpenAI trains GPT-5, they run thousands of NVIDIA H100 GPUs for months. When you use the model, a single NPU or small GPU cluster handles your query in milliseconds.

The big picture

AI chips are not magic — they are extremely optimised matrix multiplication machines. The entire "intelligence" of a model lives in billions of learned weights stored in memory. An NPU's job is to multiply those weights by input vectors as fast and cheaply as possible. The cleverness is in the systolic array (maximum data reuse), quantization (4× more weights in on-chip memory), and unified memory (no copying bottleneck). Everything else is engineering execution on that foundation.

🎯 Key takeaways

NPU = specialized chip for neural network inference — optimized for MAC operations at low precision and power.
Systolic array — grid of MAC PEs that reuses data flowing through, eliminating memory bandwidth waste.
Quantization — INT8 weights are 4× smaller than FP32; 4× faster MACs; minimal accuracy loss.
Memory hierarchy: keep weights in on-chip SRAM — every off-chip DRAM fetch is a performance and power penalty.
Training = FP32 on GPU clusters (months, millions $). Inference = INT8 on NPU (milliseconds, near-free).
Every AI chip — Apple, Qualcomm, Google, NVIDIA — is a variation on these same principles.

FAQ

What is an NPU?

A Neural Processing Unit — a chip designed specifically for neural network inference. Built around systolic arrays of MAC units running at INT8/FP16 precision with large on-chip SRAM.

What is a systolic array?

A 2D grid of MAC processing elements where data flows rhythmically through the grid. Inputs flow right, weights are pre-loaded; every PE multiplies and accumulates each cycle. Maximum data reuse, minimal memory fetches.

Why can't a CPU run AI fast?

CPUs are designed for branchy, unpredictable code. AI is regular, parallel matrix math — the CPU wastes most transistors on hardware AI never uses. NPUs dedicate every transistor to MAC operations.

What is quantization?

Reducing weight precision from FP32 to INT8 (or INT4). 4× smaller weights, 4× more fit in SRAM, 4× faster MAC units — with <1% accuracy loss when done correctly.

What is the difference between training and inference?

Training adjusts weights using FP32 on GPU clusters (months, $millions). Inference runs the trained model on new input using INT8 on NPUs (milliseconds, near-free).

How many TOPS does a good NPU have?

As of 2026: phone NPUs range 35–50 TOPS (Apple A18, Qualcomm Snapdragon 8 Elite, MediaTek 9400). Data center chips like NVIDIA H100 reach 3,958 TOPS INT8.