Which processor is best for AI inference at the edge?

NPUs win for edge inference — they deliver the best performance-per-watt for quantized neural network inference. Examples include Apple Neural Engine, Qualcomm Hexagon DSP, and MediaTek APU found in smartphones.

CPU vs GPU vs TPU vs NPU — Architecture, Differences & Use Cases

Q: What is the main difference between CPU and GPU?

A CPU has a few powerful cores (4–128) optimized for low-latency sequential tasks with large caches and branch prediction. A GPU has thousands of smaller cores designed for massively parallel workloads like graphics rendering and matrix math for AI.

Q: What is a TPU and how is it different from a GPU?

A TPU (Tensor Processing Unit) is Google's custom ASIC built around a systolic array for matrix multiply operations. Unlike a GPU which is general-purpose parallel hardware, a TPU is purpose-built for TensorFlow/JAX tensor operations, delivering higher throughput per watt for specific ML workloads.

Q: What is an NPU used for?

An NPU (Neural Processing Unit) is a low-power AI accelerator integrated into SoCs for mobile and edge devices. It handles on-device inference tasks like face recognition, voice commands, and camera AI using INT8/INT4 quantized models, at a fraction of the power a GPU would consume.

Q: Which processor is best for AI training?

GPUs (especially NVIDIA H100/A100) and TPUs dominate AI training because they provide high memory bandwidth and massive parallel compute for matrix operations. CPUs are too slow for large-scale training; NPUs are designed for inference only.

Q: Can a CPU replace a GPU for deep learning?

No for large-scale training. A CPU can run small neural networks but will be 10–100× slower than a GPU for matrix-heavy workloads like transformer training. For inference of small models, modern CPUs with AVX-512 or AMX extensions can be competitive.

The Four Processor Architectures

Each chip is optimised for a fundamentally different execution model

Central Processing Unit

CPU

Central Processing Unit

The brain of every computer. Designed for low-latency sequential execution with complex control flow, branch prediction, and deep cache hierarchies. A master of one task at a time.

Core count4 – 192

Clock speed3 – 6 GHz

Cache (L3)16 – 256 MB

Memory BW50 – 300 GB/s

TDP5 – 350 W

ParallelismMIMD / OoO

Examples: Intel Core Ultra 9, AMD Ryzen 9 9950X, Apple M4 Pro CPU, ARM Cortex-X4

Graphics Processing Unit

GPU

Graphics Processing Unit

Thousands of small shaders executing the same instruction on different data simultaneously. Originally for pixels, now the workhorse of AI training and scientific simulation.

Core count2,000 – 18,432

Clock speed1 – 3.5 GHz

VRAM8 – 192 GB HBM

Memory BW900 – 3,350 GB/s

TDP75 – 1,000 W

ParallelismSIMT / SIMD

Examples: NVIDIA H100, AMD MI300X, NVIDIA RTX 5090, Apple M4 GPU

Tensor Processing Unit

TPU

Tensor Processing Unit

Google's custom ASIC built around a systolic array matrix multiply unit (MXU). Not programmable for arbitrary code — purpose-built to execute TensorFlow and JAX tensor graphs with extreme efficiency.

MXU size128×128 systolic

Precisionbfloat16 / int8

HBM16 – 32 GB / chip

Memory BW1,200 GB/s (v4)

TDP~170 W / chip

DeploymentGoogle Cloud only

Examples: Google TPU v4 (pod: 4,096 chips), Google TPU v5p, Edge TPU (Coral)

Neural Processing Unit

NPU

Neural Processing Unit

A power-sipping AI accelerator embedded in a SoC for smartphones, laptops, and IoT devices. Runs quantized neural networks for on-device inference without draining your battery.

TOPS10 – 100+ TOPS

PrecisionINT4 / INT8

Power0.5 – 5 W

MemoryShared LPDDR5

TDP< 5 W total SoC

TaskInference only

Examples: Apple Neural Engine (38 TOPS), Qualcomm Hexagon (75 TOPS), Intel AI Boost (11 TOPS)

Full Spec Comparison

Side-by-side technical breakdown

Feature	CPU	GPU	TPU	NPU
Full Name	Central Processing Unit	Graphics Processing Unit	Tensor Processing Unit	Neural Processing Unit
Core count	4 – 192	2,000 – 18,432	MXU 128×128	MAC arrays (dedicated)
Clock speed	3 – 6 GHz	1 – 3.5 GHz	~940 MHz (v4)	Variable (power-gated)
Numeric precision	FP64 / FP32 / INT	FP64 / FP32 / FP16 / BF16 / INT8	BF16 / INT8 (native)	INT8 / INT4 (quantized)
Memory type	DDR5 / LPDDR5	HBM2e / HBM3e	HBM2e	Shared LPDDR5
Memory bandwidth	50 – 300 GB/s	900 – 3,350 GB/s	1,200 GB/s	50 – 130 GB/s
Peak FP16 TFLOPS	~1–4 TFLOPS	~990 TFLOPS (H100 SXM)	~275 TFLOPS (v4)	~0.5–2 TFLOPS
AI TOPS (INT8)	~1–10 TOPS	~2,000 TOPS	~550 TOPS	10 – 100+ TOPS
TDP / Power	5 – 350 W	75 – 1,000 W	~170 W / chip	< 5 W
Parallelism model	MIMD, OoO, superscalar	SIMT / SIMD	Systolic dataflow	Fixed dataflow pipelines
Programmability	Any language / OS	CUDA / HIP / OpenCL	TensorFlow / JAX only	Fixed ops / NNAPI
Deployment	Universal	Server / Desktop	Google Cloud only	Mobile / Edge SoC
Best for	OS, databases, latency-critical	AI training, gaming, HPC	Large-scale ML training	On-device inference
Worst for	Parallel ML training	Power-constrained inference	Non-tensor workloads	Training, general code
Example chips	Intel Core Ultra, AMD Ryzen, Apple M4	NVIDIA H100, RTX 5090, AMD MI300X	Google TPU v4, v5p	Apple ANE, Qualcomm Hexagon

Architecture Deep Dive

CPU Why does a CPU have so few cores compared to a GPU?

CPU cores are complex. Each one contains an out-of-order execution engine, branch predictor, multi-level TLB, L1/L2 cache, speculative execution units, and a superscalar pipeline that can retire multiple instructions per cycle. A single AMD Zen 5 core occupies roughly 5–8 mm² of die area.

GPUs trade all of that complexity for simplicity. A CUDA core is just a floating-point ALU — no branch prediction, no OoO, no large private cache. It's tiny. You can pack 18,000 of them in the same die area. The trade-off: each individual CUDA core is ~100× slower than a CPU core per instruction. But 18,000 × (1/100) = 180 simultaneous FP operations, beating a CPU by 10–50× on parallel workloads.

The fundamental insight: latency × bandwidth = work done. CPUs minimise latency. GPUs maximise bandwidth.

TPU What is a systolic array and why is it perfect for matrix multiply?

A systolic array is a grid of simple multiply-accumulate (MAC) cells. Data flows through the grid like blood through a heart — each cell receives values from its neighbours, multiplies, adds to its accumulator, and passes the result onward. No memory fetch, no cache, no branch: pure dataflow.

For a matrix multiply C = A × B, the A matrix flows horizontally across rows, B flows vertically down columns, and each cell accumulates one element of C. A 128×128 systolic array completes a 128×128 matrix multiply in 128 cycles — versus thousands of cycles on a CPU with cache misses.

The TPU v4's MXU does 128×128 BF16 multiply-accumulate in one clock cycle — equivalent to 32,768 multiply-adds per cycle. At ~940 MHz that's ~275 TFLOPS, with no DRAM access, just data flowing through cells. This is why TPUs dominate per-chip efficiency for transformer training where everything reduces to matrix math.

NPU Why does an NPU use INT8 instead of FP32 like a CPU?

An FP32 multiplier requires ~100× more transistors than an INT8 multiplier, and consumes ~16× more power. On a battery-powered device, FP32 for inference would drain your phone in hours.

Quantization solves this: after training in FP32, the model weights are mapped to INT8 (or INT4) with a scale factor. The accuracy loss for most vision/NLP inference tasks is under 1% — invisible to the end user. But the power savings are enormous: an INT8 MAC array can do 4× more operations per mm² and per watt.

Apple's Neural Engine on the A18 Pro achieves 38 TOPS at roughly 0.5W — that's 76 TOPS/W. An NVIDIA H100 at 700W achieves ~2000 TOPS INT8 — only 2.9 TOPS/W. The NPU wins power-efficiency by a factor of 26× for inference.

GPU What are Tensor Cores and how are they different from CUDA cores?

CUDA cores are general-purpose FP32 ALUs — one multiply-add per clock. Tensor Cores (introduced in Volta, 2017) are 4×4 matrix multiply hardware units. A single 4th-gen Tensor Core computes a 4×4×4 matmul in one cycle — 64 FP16 MACs vs 1 for a CUDA core. That's why H100 jumps from ~66 TFLOPS FP32 (CUDA cores) to ~990 TFLOPS FP16 (Tensor Cores).

Tensor Cores only activate when your code uses torch.matmul on compatible shapes. Poorly shaped tensors (non-multiples of 16) fall back to CUDA cores and lose the 15× speedup. This is why practitioners obsess over "tensor core alignment" in model design.

CPU When does a CPU beat a GPU for AI inference?

GPUs have high launch overhead — dispatching a CUDA kernel takes microseconds. For tiny models or single-sample inference (batch size = 1), the GPU is idle most of the time and a CPU can win on latency.

Modern CPUs with AVX-512 or Intel's AMX (Advanced Matrix Extensions) can execute 512-bit SIMD operations and small tile-based matrix multiply natively. For distilled models under ~1B parameters running at batch-1, a high-end CPU like Xeon Sapphire Rapids with AMX is competitive with a data-centre GPU — at a fraction of the cost and power.

For sparse models and models with dynamic control flow (tree-of-thought reasoning, variable-length decoding), the CPU's low branch-prediction overhead can also beat a GPU's penalty for irregular computation.

Frequently Asked Questions

What is the main difference between CPU and GPU?

A CPU has a few powerful cores optimized for low-latency sequential tasks with large caches and branch prediction. A GPU has thousands of smaller cores designed for massively parallel workloads like graphics rendering and matrix math for AI. CPUs excel at "do this complex thing once." GPUs excel at "do this simple thing ten thousand times simultaneously."

What is a TPU and how is it different from a GPU?

A TPU is Google's custom ASIC built around a systolic array for matrix multiply — the core operation of neural networks. Unlike a GPU which is general-purpose parallel hardware you can program with CUDA, a TPU is purpose-built only for TensorFlow and JAX tensor graphs. It delivers higher throughput per watt for those specific ML workloads, but cannot run arbitrary code.

What is an NPU used for?

An NPU is a low-power AI accelerator integrated into mobile and edge SoCs. It handles on-device inference: face recognition, voice commands, camera scene detection, real-time translation. It uses INT8/INT4 quantized models and delivers the best performance-per-watt of any processor type for these fixed workloads — critical when you're running on a phone battery.

Which processor is best for AI training?

GPUs (NVIDIA H100/A100) and TPUs dominate AI training. Both provide high memory bandwidth and massive parallel compute for the matrix operations that dominate transformer training. CPUs are 10–100× too slow. NPUs don't support training at all — they're inference-only by design.

Can a CPU replace a GPU for deep learning?

No for large-scale training. A CPU will be 10–100× slower for matrix-heavy workloads. For inference of small models, modern CPUs with AVX-512 or Intel AMX extensions can be surprisingly competitive at batch size 1, beating GPU on latency due to lower kernel launch overhead. But for anything resembling production training, GPUs or TPUs are required.

What does TOPS mean for NPUs?

TOPS stands for Tera Operations Per Second — one trillion integer operations per second, typically measured at INT8 precision. It's the standard benchmark for NPU performance. The Apple A18 Pro Neural Engine delivers 38 TOPS. Qualcomm Snapdragon 8 Elite's Hexagon NPU delivers 75 TOPS. Higher TOPS means faster on-device AI inference, but the number is only meaningful when comparing at the same precision (INT8 vs INT4 are very different).

CPU vs GPU vs TPU vs NPU

CPU Die Layout

GPU Die Layout

TPU Die Layout

NPU Die Layout

Architecture Deep Dive

Frequently Asked Questions