Four processor architectures, four completely different philosophies. Understand the silicon-level decisions that make each one dominate its domain — and fail at everyone else's.
Few powerful OoO cores + large shared cache. Low latency, high IPC.
Thousands of SIMT cores + HBM. Maximises throughput over latency.
Systolic MXU feeds data through fixed multiply-accumulate cells — no cache misses, no branch prediction.
Dedicated MAC array tile inside an SoC. Shares LPDDR with CPU/GPU. Ultra low-power inference only.
| Feature | CPU | GPU | TPU | NPU |
|---|---|---|---|---|
| Full Name | Central Processing Unit | Graphics Processing Unit | Tensor Processing Unit | Neural Processing Unit |
| Core count | 4 – 192 | 2,000 – 18,432 | MXU 128×128 | MAC arrays (dedicated) |
| Clock speed | 3 – 6 GHz | 1 – 3.5 GHz | ~940 MHz (v4) | Variable (power-gated) |
| Numeric precision | FP64 / FP32 / INT | FP64 / FP32 / FP16 / BF16 / INT8 | BF16 / INT8 (native) | INT8 / INT4 (quantized) |
| Memory type | DDR5 / LPDDR5 | HBM2e / HBM3e | HBM2e | Shared LPDDR5 |
| Memory bandwidth | 50 – 300 GB/s | 900 – 3,350 GB/s | 1,200 GB/s | 50 – 130 GB/s |
| Peak FP16 TFLOPS | ~1–4 TFLOPS | ~990 TFLOPS (H100 SXM) | ~275 TFLOPS (v4) | ~0.5–2 TFLOPS |
| AI TOPS (INT8) | ~1–10 TOPS | ~2,000 TOPS | ~550 TOPS | 10 – 100+ TOPS |
| TDP / Power | 5 – 350 W | 75 – 1,000 W | ~170 W / chip | < 5 W |
| Parallelism model | MIMD, OoO, superscalar | SIMT / SIMD | Systolic dataflow | Fixed dataflow pipelines |
| Programmability | Any language / OS | CUDA / HIP / OpenCL | TensorFlow / JAX only | Fixed ops / NNAPI |
| Deployment | Universal | Server / Desktop | Google Cloud only | Mobile / Edge SoC |
| Best for | OS, databases, latency-critical | AI training, gaming, HPC | Large-scale ML training | On-device inference |
| Worst for | Parallel ML training | Power-constrained inference | Non-tensor workloads | Training, general code |
| Example chips | Intel Core Ultra, AMD Ryzen, Apple M4 | NVIDIA H100, RTX 5090, AMD MI300X | Google TPU v4, v5p | Apple ANE, Qualcomm Hexagon |
CPU cores are complex. Each one contains an out-of-order execution engine, branch predictor, multi-level TLB, L1/L2 cache, speculative execution units, and a superscalar pipeline that can retire multiple instructions per cycle. A single AMD Zen 5 core occupies roughly 5–8 mm² of die area.
GPUs trade all of that complexity for simplicity. A CUDA core is just a floating-point ALU — no branch prediction, no OoO, no large private cache. It's tiny. You can pack 18,000 of them in the same die area. The trade-off: each individual CUDA core is ~100× slower than a CPU core per instruction. But 18,000 × (1/100) = 180 simultaneous FP operations, beating a CPU by 10–50× on parallel workloads.
The fundamental insight: latency × bandwidth = work done. CPUs minimise latency. GPUs maximise bandwidth.
A systolic array is a grid of simple multiply-accumulate (MAC) cells. Data flows through the grid like blood through a heart — each cell receives values from its neighbours, multiplies, adds to its accumulator, and passes the result onward. No memory fetch, no cache, no branch: pure dataflow.
For a matrix multiply C = A × B, the A matrix flows horizontally across rows, B flows vertically down columns, and each cell accumulates one element of C. A 128×128 systolic array completes a 128×128 matrix multiply in 128 cycles — versus thousands of cycles on a CPU with cache misses.
The TPU v4's MXU does 128×128 BF16 multiply-accumulate in one clock cycle — equivalent to 32,768 multiply-adds per cycle. At ~940 MHz that's ~275 TFLOPS, with no DRAM access, just data flowing through cells. This is why TPUs dominate per-chip efficiency for transformer training where everything reduces to matrix math.
An FP32 multiplier requires ~100× more transistors than an INT8 multiplier, and consumes ~16× more power. On a battery-powered device, FP32 for inference would drain your phone in hours.
Quantization solves this: after training in FP32, the model weights are mapped to INT8 (or INT4) with a scale factor. The accuracy loss for most vision/NLP inference tasks is under 1% — invisible to the end user. But the power savings are enormous: an INT8 MAC array can do 4× more operations per mm² and per watt.
Apple's Neural Engine on the A18 Pro achieves 38 TOPS at roughly 0.5W — that's 76 TOPS/W. An NVIDIA H100 at 700W achieves ~2000 TOPS INT8 — only 2.9 TOPS/W. The NPU wins power-efficiency by a factor of 26× for inference.
CUDA cores are general-purpose FP32 ALUs — one multiply-add per clock. Tensor Cores (introduced in Volta, 2017) are 4×4 matrix multiply hardware units. A single 4th-gen Tensor Core computes a 4×4×4 matmul in one cycle — 64 FP16 MACs vs 1 for a CUDA core. That's why H100 jumps from ~66 TFLOPS FP32 (CUDA cores) to ~990 TFLOPS FP16 (Tensor Cores).
Tensor Cores only activate when your code uses torch.matmul on compatible shapes. Poorly shaped tensors (non-multiples of 16) fall back to CUDA cores and lose the 15× speedup. This is why practitioners obsess over "tensor core alignment" in model design.
GPUs have high launch overhead — dispatching a CUDA kernel takes microseconds. For tiny models or single-sample inference (batch size = 1), the GPU is idle most of the time and a CPU can win on latency.
Modern CPUs with AVX-512 or Intel's AMX (Advanced Matrix Extensions) can execute 512-bit SIMD operations and small tile-based matrix multiply natively. For distilled models under ~1B parameters running at batch-1, a high-end CPU like Xeon Sapphire Rapids with AMX is competitive with a data-centre GPU — at a fraction of the cost and power.
For sparse models and models with dynamic control flow (tree-of-thought reasoning, variable-length decoding), the CPU's low branch-prediction overhead can also beat a GPU's penalty for irregular computation.