Under the buzzwords, an AI chip does one thing astonishingly fast: multiply matrices. This is the deep dive โ MAC arrays, systolic arrays, HBM, quantization, dataflow, TOPS/W and the future of the silicon behind modern AI.
An AI chip (or AI accelerator) is a processor specialised for the mathematics of neural networks. And that math, stripped of mystique, is overwhelmingly one operation: multiply-accumulate โ acc += a ร b โ performed billions of times, arranged as matrix multiplications.
A CPU has a few powerful, general cores optimised for branchy, sequential logic. A neural network instead wants the same simple arithmetic on enormous, uniform data. So an AI chip throws out the generality and packs in thousands of small multiply-accumulate (MAC) units, surrounds them with wide on-chip memory, and feeds them from high-bandwidth external memory. GPUs, TPUs, NPUs and custom ASICs are all points on this same idea.
A neural-network layer computes, in essence, Y = activation(W ยท X + b) โ a matrix of weights W multiplied by a vector/matrix of inputs X. Multiplying matrices means, for every output element, a dot product: a long sequence of multiply-then-add. This dense matrix-multiply is called GEMM (General Matrix Multiply), and modern transformers are >90% GEMM.
One multiply-accumulate is trivial. The catch is scale: a single large language model does trillions of MACs per token. The entire job of an AI chip is to perform that ocean of MACs with maximum throughput per watt.
An AI chip is a machine for doing matrix multiplication โ i.e. massive numbers of multiply-accumulate operations โ as fast and as efficiently as physics allows.
Want to feel why parallel hardware crushes this? Play the interactive GPU Lab โ the CPU-vs-GPU race is the same principle behind every AI chip.
The atom of an AI chip is the MAC unit: a multiplier feeding an adder that accumulates a running sum. Put thousands of them on a die and you can do thousands of multiply-accumulates every clock cycle.
But raw MACs aren't enough โ you have to feed them, and memory bandwidth is the enemy. The elegant answer is the systolic array: a 2-D grid of MAC units (processing elements, PEs) through which data flows rhythmically, like a heartbeat (systole). Each PE multiplies, accumulates, and passes operands to its neighbours. Crucially, each value entering the array is reused by an entire row or column of PEs instead of being re-fetched โ slashing memory traffic. Google's TPU is built around a large systolic array.
โถ See it run: the interactive Systolic Array Lab lets you edit two matrices and step the array cycle by cycle โ watch the data flow and the accumulators build up.
"AI chip" is an umbrella. The members trade flexibility against efficiency:
| Type | What it is | Trade-off |
|---|---|---|
| GPU | Thousands of programmable cores + dedicated tensor/matrix units | Most flexible; huge software ecosystem; the default for training |
| TPU | ASIC built around a large systolic array for tensor math | Less flexible, very efficient at GEMM; data-centre scale |
| NPU | Compact accelerator inside phones/SoCs/edge devices | Optimised for low-power on-device inference |
| FPGA | Reconfigurable logic mapped to a custom dataflow | Flexible hardware, lower volume cost, good for evolving models |
| Custom ASIC | Fixed silicon for one workload (e.g. an inference appliance) | Highest efficiency, least flexible, high NRE cost |
They all accelerate the same MACs. A GPU keeps programmability (and so dominates fast-moving research/training); a TPU/ASIC hard-wires the dataflow for efficiency; an NPU shrinks it for the edge. See the GPU side hands-on โ
Here's the secret most explainers skip: AI chips are usually limited by memory, not compute. A MAC array can demand operands faster than DRAM can deliver them โ the famous memory wall. Moving a number from external memory can cost orders of magnitude more energy than the multiply itself. So AI-chip design is largely a memory-movement problem.
Two design levers fight the wall:
Neural networks are remarkably tolerant of low numerical precision. That's a gift: smaller numbers mean less memory, less bandwidth, and far more MACs per mmยฒ and per watt. AI chips support a ladder of formats:
| Format | Bits | Typical use |
|---|---|---|
| FP32 | 32 | Reference precision; older training |
| TF32 / FP16 | 19 / 16 | Training (faster, near-FP32 quality) |
| BF16 | 16 | Training favourite โ wide range, easy |
| FP8 | 8 | Modern training & inference |
| INT8 | 8 | Inference workhorse (after quantization) |
| INT4 / lower | 4 | Aggressive edge/LLM inference |
Quantization is the process of converting a trained FP model to a lower-precision one (e.g. INT8) with minimal accuracy loss. Halving the bit-width roughly doubles effective throughput and memory capacity โ which is why FP8 and INT8 are central to today's AI hardware. A chip's headline TOPS number is always quoted at a precision (e.g. "X TOPS INT8").
Because memory movement dominates energy, the dataflow โ the order in which data is loaded, reused and stored โ is a first-class architectural decision. Classic strategies keep one operand "stationary" in the PEs to maximise reuse:
The goal is always the same: fetch each value from expensive memory once, then reuse it as many times as possible close to the MACs.
Many neural-network weights and activations are zero (or can be pruned to zero). Multiplying by zero is wasted work โ so modern AI chips exploit sparsity: hardware that skips zero operands, or supports structured sparsity (e.g. 2-of-4 patterns) to roughly double effective throughput. Combined with quantization, sparsity is a major lever for inference efficiency.
The roofline model ties it together: performance is capped either by compute (the flat "roof") or by memory bandwidth (the sloped part), depending on a workload's arithmetic intensity (operations per byte fetched). Low-intensity workloads (like small-batch LLM inference) are memory-bound โ which is exactly why HBM and on-chip SRAM matter so much. Peak TOPS alone is marketing; sustained TOPS at high utilisation is engineering.
| Training | Inference | |
|---|---|---|
| Job | Learn the weights | Run the trained model |
| Precision | BF16 / FP16 / FP8 | INT8 / FP8 / INT4 |
| Priorities | Throughput, memory, scale-out | Latency, power, cost |
| Interconnect | Critical (1000s of chips) | Often single-chip / small |
| Where | Data centre | Cloud โ edge โ phone NPU |
Training is a brutal, multi-week, multi-thousand-accelerator effort dominated by compute and interconnect. Inference is about serving predictions cheaply and quickly โ and is increasingly pushed to the edge (your phone's NPU) for latency and privacy.
A single die has reached the reticle limit (the max size a lithography tool can print). So AI chips scale outward:
This is why the AI boom is a whole-supply-chain event โ logic, memory, packaging and networking all at once. More on that here โ
An AI chip is a matrix-multiply engine: thousands of MAC units, often arranged as a systolic array, fed by HBM and on-chip SRAM. Its real challenge is moving data, so it leans on data reuse, low-precision formats and sparsity to maximise TOPS per watt. GPUs, TPUs, NPUs and ASICs are different flexibility-vs-efficiency answers to that same problem.
A processor specialised for neural-network math โ mostly matrix multiply / multiply-accumulate โ packing thousands of MAC units, wide on-chip memory and HBM for high throughput per watt. GPUs, TPUs, NPUs and ASICs are all AI chips.
A grid of MAC units through which data flows rhythmically, each reusing operands and passing partial sums to neighbours โ an extremely efficient way to do matrix multiply. The TPU is built on one.
GPU = flexible, programmable, dominant for training. TPU = systolic-array ASIC, very efficient at tensor math. NPU = compact edge/phone accelerator for low-power inference.
Networks tolerate it; smaller numbers mean less memory/bandwidth and far more MACs per watt. Training favours BF16/FP8; inference often uses INT8 after quantization.
Trillions of Operations Per Second at a given precision. TOPS/W is efficiency. Real performance depends on memory bandwidth and utilisation, so peak TOPS can mislead.
Related: How a Quantum Chip Works ยท GPU Lab (interactive) ยท Why AI Needs So Many Chips ยท Transistor Evolution ยท VLSI Hub