Architecture Deep Dive

CPU vs GPU vs TPU vs NPU

Four processor architectures, four completely different philosophies. Understand the silicon-level decisions that make each one dominate its domain — and fail at everyone else's.

CPU · General Purpose GPU · Parallel Compute TPU · Tensor Math NPU · Edge AI
The Four Processor Architectures
Each chip is optimised for a fundamentally different execution model
Central Processing Unit
CPU
Central Processing Unit
The brain of every computer. Designed for low-latency sequential execution with complex control flow, branch prediction, and deep cache hierarchies. A master of one task at a time.
Core count4 – 192
Clock speed3 – 6 GHz
Cache (L3)16 – 256 MB
Memory BW50 – 300 GB/s
TDP5 – 350 W
ParallelismMIMD / OoO
Examples: Intel Core Ultra 9, AMD Ryzen 9 9950X, Apple M4 Pro CPU, ARM Cortex-X4
Graphics Processing Unit
GPU
Graphics Processing Unit
Thousands of small shaders executing the same instruction on different data simultaneously. Originally for pixels, now the workhorse of AI training and scientific simulation.
Core count2,000 – 18,432
Clock speed1 – 3.5 GHz
VRAM8 – 192 GB HBM
Memory BW900 – 3,350 GB/s
TDP75 – 1,000 W
ParallelismSIMT / SIMD
Examples: NVIDIA H100, AMD MI300X, NVIDIA RTX 5090, Apple M4 GPU
Tensor Processing Unit
TPU
Tensor Processing Unit
Google's custom ASIC built around a systolic array matrix multiply unit (MXU). Not programmable for arbitrary code — purpose-built to execute TensorFlow and JAX tensor graphs with extreme efficiency.
MXU size128×128 systolic
Precisionbfloat16 / int8
HBM16 – 32 GB / chip
Memory BW1,200 GB/s (v4)
TDP~170 W / chip
DeploymentGoogle Cloud only
Examples: Google TPU v4 (pod: 4,096 chips), Google TPU v5p, Edge TPU (Coral)
Neural Processing Unit
NPU
Neural Processing Unit
A power-sipping AI accelerator embedded in a SoC for smartphones, laptops, and IoT devices. Runs quantized neural networks for on-device inference without draining your battery.
TOPS10 – 100+ TOPS
PrecisionINT4 / INT8
Power0.5 – 5 W
MemoryShared LPDDR5
TDP< 5 W total SoC
TaskInference only
Examples: Apple Neural Engine (38 TOPS), Qualcomm Hexagon (75 TOPS), Intel AI Boost (11 TOPS)
Architecture Diagrams
How the silicon is actually organised inside each chip

CPU Die Layout

Core 0 L1$+L2$+OoO Core 1 L1$+L2$+OoO Core 2 L1$+L2$+OoO Core 3 L1$+L2$+OoO Shared L3 Cache (32–256 MB) Memory Controller PCIe / I/O Hub DDR5 / LPDDR5

Few powerful OoO cores + large shared cache. Low latency, high IPC.

GPU Die Layout

Streaming Multiprocessors (×132 H100) SM SM SM SM SM SM SM SM SM SM … ×122 more SMs … L2 Cache (50 MB) HBM3e (80 GB · 3.35 TB/s) NVLink 900 GB/s · PCIe 5.0

Thousands of SIMT cores + HBM. Maximises throughput over latency.

TPU Die Layout

Matrix Multiply Unit (MXU) 128 × 128 systolic array … ×128 Vector Unit Activation / norm Scalar Unit Control flow HBM2e (32 GB · 1.2 TB/s) ICI Interconnect (pod: 4096 chips)

Systolic MXU feeds data through fixed multiply-accumulate cells — no cache misses, no branch prediction.

NPU Die Layout

SoC (e.g. Apple A18 Pro) CPU Cluster 2P + 4E cores GPU Cluster 38-core GPU Neural Engine (NPU) 38 TOPS · INT8/INT4 MAC arrays + activation units ISP Secure Enc 5G Modem Shared LPDDR5X (16 GB · 120 GB/s) On-chip Fabric / NoC

Dedicated MAC array tile inside an SoC. Shares LPDDR with CPU/GPU. Ultra low-power inference only.

Performance Benchmarks
Relative scores — higher is better (normalised per category)
Sequential / Single-Thread Speed Branch prediction, OoO, high IPC
CPU
95
95
GPU
30
TPU
10
NPU
12
AI Training Throughput Transformer / LLM training (TFLOPS)
CPU
5
GPU
90
90
TPU
95
95
NPU
2
Edge Inference Efficiency Perf / Watt for INT8 quantized models
CPU
15
GPU
45
TPU
60
NPU
98
98
Memory Bandwidth GB/s — data hungry workloads
CPU
300
GPU
3.35 TB/s
100
TPU
1.2 TB
NPU
120
Programmability / Flexibility How easily it handles new workloads
CPU
100
100
GPU
80
TPU
35
NPU
20
Full Spec Comparison
Side-by-side technical breakdown
Feature CPU GPU TPU NPU
Full NameCentral Processing UnitGraphics Processing UnitTensor Processing UnitNeural Processing Unit
Core count4 – 1922,000 – 18,432MXU 128×128MAC arrays (dedicated)
Clock speed3 – 6 GHz1 – 3.5 GHz~940 MHz (v4)Variable (power-gated)
Numeric precisionFP64 / FP32 / INTFP64 / FP32 / FP16 / BF16 / INT8BF16 / INT8 (native)INT8 / INT4 (quantized)
Memory typeDDR5 / LPDDR5HBM2e / HBM3eHBM2eShared LPDDR5
Memory bandwidth50 – 300 GB/s900 – 3,350 GB/s1,200 GB/s50 – 130 GB/s
Peak FP16 TFLOPS~1–4 TFLOPS~990 TFLOPS (H100 SXM)~275 TFLOPS (v4)~0.5–2 TFLOPS
AI TOPS (INT8)~1–10 TOPS~2,000 TOPS~550 TOPS10 – 100+ TOPS
TDP / Power5 – 350 W75 – 1,000 W~170 W / chip< 5 W
Parallelism modelMIMD, OoO, superscalarSIMT / SIMDSystolic dataflowFixed dataflow pipelines
ProgrammabilityAny language / OSCUDA / HIP / OpenCLTensorFlow / JAX onlyFixed ops / NNAPI
DeploymentUniversalServer / DesktopGoogle Cloud onlyMobile / Edge SoC
Best forOS, databases, latency-criticalAI training, gaming, HPCLarge-scale ML trainingOn-device inference
Worst forParallel ML trainingPower-constrained inferenceNon-tensor workloadsTraining, general code
Example chipsIntel Core Ultra, AMD Ryzen, Apple M4NVIDIA H100, RTX 5090, AMD MI300XGoogle TPU v4, v5pApple ANE, Qualcomm Hexagon
Which Processor Wins Each Use Case?
Pick the right hardware for the right job
🧠 LLM Training TPU / GPU
CPU❌ Too slow
GPU✅ Winner (H100)
TPU✅ Winner (pods)
NPU❌ Inference only
📱 On-device AI NPU
CPU⚡ Fallback
GPU⚡ Possible
TPU❌ Cloud only
NPU✅ Best perf/W
🎮 Gaming / Graphics GPU
CPU⚡ Integrated only
GPU✅ Dominant
TPU❌ No rasterization
NPU❌ Not designed
🗄️ Database / OS CPU
CPU✅ Dominant
GPU❌ No OS support
TPU❌ Not programmable
NPU❌ Fixed ops only
🔬 HPC / Simulation GPU
CPU⚡ MPI clusters
GPU✅ CUDA / HIP
TPU⚡ JAX possible
NPU❌ Too limited
🌐 LLM Inference (cloud) GPU
CPU❌ Low throughput
GPU✅ vLLM / TRT-LLM
TPU✅ JAX serving
NPU❌ No cloud deploy

Architecture Deep Dive

CPU Why does a CPU have so few cores compared to a GPU?

CPU cores are complex. Each one contains an out-of-order execution engine, branch predictor, multi-level TLB, L1/L2 cache, speculative execution units, and a superscalar pipeline that can retire multiple instructions per cycle. A single AMD Zen 5 core occupies roughly 5–8 mm² of die area.

GPUs trade all of that complexity for simplicity. A CUDA core is just a floating-point ALU — no branch prediction, no OoO, no large private cache. It's tiny. You can pack 18,000 of them in the same die area. The trade-off: each individual CUDA core is ~100× slower than a CPU core per instruction. But 18,000 × (1/100) = 180 simultaneous FP operations, beating a CPU by 10–50× on parallel workloads.

The fundamental insight: latency × bandwidth = work done. CPUs minimise latency. GPUs maximise bandwidth.

TPU What is a systolic array and why is it perfect for matrix multiply?

A systolic array is a grid of simple multiply-accumulate (MAC) cells. Data flows through the grid like blood through a heart — each cell receives values from its neighbours, multiplies, adds to its accumulator, and passes the result onward. No memory fetch, no cache, no branch: pure dataflow.

For a matrix multiply C = A × B, the A matrix flows horizontally across rows, B flows vertically down columns, and each cell accumulates one element of C. A 128×128 systolic array completes a 128×128 matrix multiply in 128 cycles — versus thousands of cycles on a CPU with cache misses.

The TPU v4's MXU does 128×128 BF16 multiply-accumulate in one clock cycle — equivalent to 32,768 multiply-adds per cycle. At ~940 MHz that's ~275 TFLOPS, with no DRAM access, just data flowing through cells. This is why TPUs dominate per-chip efficiency for transformer training where everything reduces to matrix math.

NPU Why does an NPU use INT8 instead of FP32 like a CPU?

An FP32 multiplier requires ~100× more transistors than an INT8 multiplier, and consumes ~16× more power. On a battery-powered device, FP32 for inference would drain your phone in hours.

Quantization solves this: after training in FP32, the model weights are mapped to INT8 (or INT4) with a scale factor. The accuracy loss for most vision/NLP inference tasks is under 1% — invisible to the end user. But the power savings are enormous: an INT8 MAC array can do 4× more operations per mm² and per watt.

Apple's Neural Engine on the A18 Pro achieves 38 TOPS at roughly 0.5W — that's 76 TOPS/W. An NVIDIA H100 at 700W achieves ~2000 TOPS INT8 — only 2.9 TOPS/W. The NPU wins power-efficiency by a factor of 26× for inference.

GPU What are Tensor Cores and how are they different from CUDA cores?

CUDA cores are general-purpose FP32 ALUs — one multiply-add per clock. Tensor Cores (introduced in Volta, 2017) are 4×4 matrix multiply hardware units. A single 4th-gen Tensor Core computes a 4×4×4 matmul in one cycle — 64 FP16 MACs vs 1 for a CUDA core. That's why H100 jumps from ~66 TFLOPS FP32 (CUDA cores) to ~990 TFLOPS FP16 (Tensor Cores).

Tensor Cores only activate when your code uses torch.matmul on compatible shapes. Poorly shaped tensors (non-multiples of 16) fall back to CUDA cores and lose the 15× speedup. This is why practitioners obsess over "tensor core alignment" in model design.

CPU When does a CPU beat a GPU for AI inference?

GPUs have high launch overhead — dispatching a CUDA kernel takes microseconds. For tiny models or single-sample inference (batch size = 1), the GPU is idle most of the time and a CPU can win on latency.

Modern CPUs with AVX-512 or Intel's AMX (Advanced Matrix Extensions) can execute 512-bit SIMD operations and small tile-based matrix multiply natively. For distilled models under ~1B parameters running at batch-1, a high-end CPU like Xeon Sapphire Rapids with AMX is competitive with a data-centre GPU — at a fraction of the cost and power.

For sparse models and models with dynamic control flow (tree-of-thought reasoning, variable-length decoding), the CPU's low branch-prediction overhead can also beat a GPU's penalty for irregular computation.

Frequently Asked Questions

What is the main difference between CPU and GPU?
A CPU has a few powerful cores optimized for low-latency sequential tasks with large caches and branch prediction. A GPU has thousands of smaller cores designed for massively parallel workloads like graphics rendering and matrix math for AI. CPUs excel at "do this complex thing once." GPUs excel at "do this simple thing ten thousand times simultaneously."
What is a TPU and how is it different from a GPU?
A TPU is Google's custom ASIC built around a systolic array for matrix multiply — the core operation of neural networks. Unlike a GPU which is general-purpose parallel hardware you can program with CUDA, a TPU is purpose-built only for TensorFlow and JAX tensor graphs. It delivers higher throughput per watt for those specific ML workloads, but cannot run arbitrary code.
What is an NPU used for?
An NPU is a low-power AI accelerator integrated into mobile and edge SoCs. It handles on-device inference: face recognition, voice commands, camera scene detection, real-time translation. It uses INT8/INT4 quantized models and delivers the best performance-per-watt of any processor type for these fixed workloads — critical when you're running on a phone battery.
Which processor is best for AI training?
GPUs (NVIDIA H100/A100) and TPUs dominate AI training. Both provide high memory bandwidth and massive parallel compute for the matrix operations that dominate transformer training. CPUs are 10–100× too slow. NPUs don't support training at all — they're inference-only by design.
Can a CPU replace a GPU for deep learning?
No for large-scale training. A CPU will be 10–100× slower for matrix-heavy workloads. For inference of small models, modern CPUs with AVX-512 or Intel AMX extensions can be surprisingly competitive at batch size 1, beating GPU on latency due to lower kernel launch overhead. But for anything resembling production training, GPUs or TPUs are required.
What does TOPS mean for NPUs?
TOPS stands for Tera Operations Per Second — one trillion integer operations per second, typically measured at INT8 precision. It's the standard benchmark for NPU performance. The Apple A18 Pro Neural Engine delivers 38 TOPS. Qualcomm Snapdragon 8 Elite's Hexagon NPU delivers 75 TOPS. Higher TOPS means faster on-device AI inference, but the number is only meaningful when comparing at the same precision (INT8 vs INT4 are very different).