HomeVLSIHow AI Chips Work
AI HARDWARE DEEP DIVE

How AI Chips Work — Inside an NPU

By EcrioniX · Updated Jun 12, 2026

Every time you talk to ChatGPT, unlock your phone with your face, or get a real-time translation, a specialized piece of silicon does the heavy lifting. Not a CPU. Not a GPU. A Neural Processing Unit (NPU) — built from the ground up to run AI as fast and cheaply as physics allows. This is how it works.

CPU vs GPU vs NPU — What Each Is Built For CPU Few cores (8–32) High clock, branch predictor Best for: general code GPU Thousands of small cores FP32/FP16 parallel throughput Best for: AI training, graphics NPU Systolic array of MAC units INT8/FP16, huge SRAM Best for: AI inference at low power
Figure — Three processor types, three different design philosophies. NPUs win at AI inference efficiency by 10–100×.

1. Why a CPU can't run AI fast enough

Running a single forward pass through GPT-4 requires roughly 1.8 trillion multiply-add operations. A high-end laptop CPU performs about 500 billion FP32 operations per second. The math tells you it would take several seconds per token — completely unusable.

But the deeper problem isn't raw speed. CPUs are designed for unpredictable, branchy code — the kind humans write. They have huge out-of-order execution engines, branch predictors, and caches all designed to handle code that jumps around. Neural networks are the exact opposite: perfectly regular, embarrassingly parallel matrix multiplications with zero branches. A CPU wastes most of its transistors on machinery that AI simply never uses.

GPUs solved this partially — thousands of simple cores work well for AI training. But for inference on the edge (your phone, your laptop), GPUs are too power-hungry. That's the gap NPUs fill: all the compute you need, a fraction of the power.

⚡ The numbers: Apple's A17 Pro Neural Engine runs at 35 TOPS (trillion operations per second) — for AI inference. The CPU cores on the same chip do perhaps 0.5 TOPS. The NPU is 70× more efficient at AI math, using far less power.

2. What a neural network actually does (in hardware terms)

Strip away the theory. At the silicon level, running a neural network is almost entirely matrix multiplication: multiply an input vector by a weight matrix, add a bias, apply an activation function, repeat hundreds of times.

Each layer of a transformer or CNN is: output = activation(W × input + b). The weight matrix W can be millions of numbers. You need to multiply every input value by every weight and accumulate the results — that's a Multiply-Accumulate (MAC) operation, and you need to do billions of them per inference.

Matrix Multiplication = the core of every neural network layer x₀ x₁ x₂ x₃ input × w₀₀ w₀₁ w₀₂ w₀₃ w₁₀ w₁₁ w₁₂ w₁₃ w₂₀ w₂₁ w₂₂ w₂₃ w₃₀ w₃₁ w₃₂ w₃₃ weight matrix (billions of these) = y₀ y₁ y₂ y₃ output Each output is a MAC chain: y₀ = x₀×w₀₀ + x₁×w₀₁ + x₂×w₀₂ + x₃×w₀₃ = multiply, then accumulate, then multiply, then accumulate... This is ALL an NPU does — but billions of times per second. MAC = Multiply-Accumulate: acc += A × B
Figure — Every neural network layer is matrix multiplication. Each output element is a chain of MAC (Multiply-Accumulate) operations across the input and weight row/column.

3. The systolic array — the heart of every NPU

A systolic array is the cleverest piece of hardware in modern AI chips. It's a 2D grid of MAC units that passes data rhythmically (like a heartbeat — "systolic") from cell to cell, so every weight gets reused many times without re-fetching it from memory. This is the key: data reuse beats raw compute speed, because memory bandwidth is always the bottleneck.

4×4 Systolic Array — How Matrix Multiplication Flows ← weights preloaded in each PE → inputs flow → MAC acc+=a×b out₀ out₁ out₂ out₃ Why systolic arrays win: • Input flows right, reused by every PE in its row • Weights stay in each PE — no re-fetching • N² MACs with N² PEs = maximum data reuse
Figure — A 4×4 systolic array. Input values flow rightward; weights are pre-loaded in each Processing Element (PE). Every PE does one MAC per clock cycle. Google's TPU uses a 256×256 systolic array — 65,536 MAC units firing simultaneously.
💡 The assembly line analogy

A systolic array is like a perfectly optimised factory assembly line. Each worker (PE) has one job: multiply two numbers and add to a running total. The input slides down the belt; the worker grabs it, does their step, passes it on. Every worker is busy every clock cycle. No one is waiting for a manager to decide what to do next — zero control flow overhead.

4. Quantization — how weights shrink by 4×

Training a large model produces weights at FP32 precision — 32 bits per number. Running inference at FP32 is wasteful: the model is nearly as accurate at INT8 (8-bit integer) with 4× smaller weights and 4× faster MAC operations. This is quantization.

Quantization — Shrinking Weights Without Losing Accuracy FP32 (32 bits/weight) sign(1) + exp(8) + mantissa(23) 4 bytes · 7 decimal digits INT8 (8 bits/weight) -128 to +127 integer 1 byte · scale factor stored separately INT8 wins: 4× more weights in SRAM 4× faster MAC · 4× less power
Figure — Quantization maps FP32 weights to INT8. Accuracy loss is typically <1% with calibration. Most on-device NPUs run INT8 or INT4.

Modern techniques go even further: INT4 (4 bits) and GPTQ quantization can compress large language models to run on laptop NPUs with under 1% quality degradation. The entire 7-billion-parameter Llama model — quantized to INT4 — fits in 4 GB of RAM and runs at real-time speed on an Apple M-series chip.

5. The NPU memory hierarchy

The biggest constraint in AI inference isn't compute — it's memory bandwidth. Loading a 70-billion parameter model requires trillions of bytes of data to flow from storage to the MAC units. NPUs solve this with a carefully designed on-chip memory hierarchy:

NPU Memory Hierarchy — Speed vs Capacity Register file — 256KB L1 SRAM (weights buffer) — 4–32 MB Unified SRAM buffer — 32–256 MB (Apple: 192 MB) HBM / LPDDR5 DRAM — GBs (shared with CPU/GPU) fastest largest
Figure — NPU memory hierarchy. Data must climb up to the register file to be computed. Keeping weights in on-chip SRAM eliminates expensive off-chip DRAM access.

The goal is to keep as many weights as possible in on-chip SRAM and avoid fetching from off-chip DRAM. Apple's Unified Memory Architecture — where CPU, GPU, and Neural Engine share one pool of LPDDR5 — is a major advantage: no copying between separate memory pools.

6. Real AI chips — who builds what

Apple
Neural Engine (A18 Pro)
35 TOPS · 2nm process · integrated in SoC with CPU + GPU + ISP · powers Face ID, Siri, on-device LLM
Qualcomm
Hexagon NPU (Snapdragon 8 Elite)
45 TOPS · 3nm · on-device Llama inference · image enhancement · powers most Android flagships
Google
TPU v5e
393 TFLOPS BF16 · 256×256 systolic array · drives all Google Search, Translate, Gemini inference at scale
NVIDIA
Tensor Core (H100)
3,958 TOPS INT8 · 4th-gen Tensor Cores · 80 GB HBM3 · powers ChatGPT, Gemini, Claude training
MediaTek
APU 790 (Dimensity 9400)
50 TOPS · 3nm · on-device AI for mid-range Android · powers camera AI and LLM chat
Intel
NPU 4 (Core Ultra 200)
48 TOPS · laptop AI PC · powers Windows Copilot+ features locally on x86 laptops

7. The full inference pipeline

When you ask your phone's on-device AI model a question, here's what happens in hardware:

StepWhat happensWhere it runs
1. TokenizeConvert text to integer token IDsCPU (fast, sequential)
2. Embedding lookupMap each token to a 4096-dim vector from a lookup tableNPU unified buffer
3. AttentionQuery × Key matrix multiply (most compute-intensive)NPU systolic array (INT8)
4. MLP layersTwo dense matrix multiplications per transformer layerNPU systolic array (INT8)
5. Softmax / RoPESpecial functions (element-wise)NPU vector unit or CPU
6. SamplingPick the next token from the probability distributionCPU
7. Decode tokenConvert token ID back to textCPU

Steps 3 and 4 are where 95%+ of the time is spent — and exactly what the systolic array is built for.

8. Training vs inference — two different problems

Everything above is about inference — running a finished model. Training is a completely different beast:

TrainingInference
GoalAdjust weights to minimise lossRun model on new input
PrecisionFP32 / BF16 (high precision needed)INT8 / FP16 (low precision fine)
HardwareNVIDIA H100/H200 clustersNPUs, edge chips
ScaleMonths, thousands of GPUsMilliseconds, one chip
Cost$50M–$500M+ per frontier modelFraction of a cent per query

When Anthropic trains Claude or OpenAI trains GPT-5, they run thousands of NVIDIA H100 GPUs for months. When you use the model, a single NPU or small GPU cluster handles your query in milliseconds.

The big picture

AI chips are not magic — they are extremely optimised matrix multiplication machines. The entire "intelligence" of a model lives in billions of learned weights stored in memory. An NPU's job is to multiply those weights by input vectors as fast and cheaply as possible. The cleverness is in the systolic array (maximum data reuse), quantization (4× more weights in on-chip memory), and unified memory (no copying bottleneck). Everything else is engineering execution on that foundation.

🎯 Key takeaways

FAQ

What is an NPU?

A Neural Processing Unit — a chip designed specifically for neural network inference. Built around systolic arrays of MAC units running at INT8/FP16 precision with large on-chip SRAM.

What is a systolic array?

A 2D grid of MAC processing elements where data flows rhythmically through the grid. Inputs flow right, weights are pre-loaded; every PE multiplies and accumulates each cycle. Maximum data reuse, minimal memory fetches.

Why can't a CPU run AI fast?

CPUs are designed for branchy, unpredictable code. AI is regular, parallel matrix math — the CPU wastes most transistors on hardware AI never uses. NPUs dedicate every transistor to MAC operations.

What is quantization?

Reducing weight precision from FP32 to INT8 (or INT4). 4× smaller weights, 4× more fit in SRAM, 4× faster MAC units — with <1% accuracy loss when done correctly.

What is the difference between training and inference?

Training adjusts weights using FP32 on GPU clusters (months, $millions). Inference runs the trained model on new input using INT8 on NPUs (milliseconds, near-free).

How many TOPS does a good NPU have?

As of 2026: phone NPUs range 35–50 TOPS (Apple A18, Qualcomm Snapdragon 8 Elite, MediaTek 9400). Data center chips like NVIDIA H100 reach 3,958 TOPS INT8.

Keep learning