An AI chip, or AI accelerator, is a processor specialised for the mathematics of neural networks - overwhelmingly matrix multiplication and multiply-accumulate (MAC) operations. Instead of a few general-purpose cores, it packs thousands of small MAC units, wide on-chip memory and high-bandwidth external memory (HBM) so it can perform the enormous number of identical arithmetic operations a neural network needs far faster and more energy-efficiently than a CPU. GPUs, TPUs, NPUs and custom ASICs are all types of AI chip.

What is the difference between a GPU and a TPU and an NPU?

A GPU has thousands of programmable cores and is flexible across graphics and AI. A TPU (Tensor Processing Unit) is an ASIC built around a systolic array dedicated to tensor/matrix math, trading flexibility for efficiency. An NPU (Neural Processing Unit) is a smaller AI accelerator integrated into phones and edge devices for efficient on-device inference. All accelerate the same core matrix operations but at different points on the flexibility-versus-efficiency spectrum.

Why do AI chips use lower precision like INT8 or FP16?

Neural networks tolerate reduced numerical precision, so AI chips use formats such as FP16, BF16, INT8, FP8 and even INT4 instead of FP32. Lower precision means smaller data, less memory bandwidth and far more MAC operations per unit area and per watt, dramatically increasing throughput and efficiency. Training often uses BF16 or FP16, while inference frequently uses INT8 after quantization.

What does TOPS mean for an AI chip?

TOPS means Tera (trillion) Operations Per Second, a measure of an AI chip's raw throughput, usually quoted at a given precision such as INT8. TOPS per watt (TOPS/W) measures energy efficiency. Real performance also depends heavily on memory bandwidth and how well the workload keeps the MAC units busy (utilisation), so TOPS alone can be misleading.

What is the difference between training and inference chips?

Training chips run the compute-heavy process of learning model weights, needing high precision (BF16/FP16), huge memory and fast chip-to-chip interconnect to scale across thousands of accelerators. Inference chips run an already-trained model to make predictions, optimised for low latency, low power and lower precision (often INT8), and range from data-centre accelerators to tiny edge NPUs.

What Is an AI Chip? A Deep Technical Guide (Architecture, Systolic Arrays, TOPS)

Q: What is a systolic array?

A systolic array is a grid of small processing elements (MAC units) through which data flows rhythmically, each element multiplying and accumulating and passing partial results to its neighbour. It implements matrix multiplication with very high efficiency because each piece of data is reused across many elements instead of being re-fetched from memory. Google's TPU is built around a large systolic array.

Definition

What an AI chip actually is

An AI chip (or AI accelerator) is a processor specialised for the mathematics of neural networks. And that math, stripped of mystique, is overwhelmingly one operation: multiply-accumulate — acc += a × b — performed billions of times, arranged as matrix multiplications.

A CPU has a few powerful, general cores optimised for branchy, sequential logic. A neural network instead wants the same simple arithmetic on enormous, uniform data. So an AI chip throws out the generality and packs in thousands of small multiply-accumulate (MAC) units, surrounds them with wide on-chip memory, and feeds them from high-bandwidth external memory. GPUs, TPUs, NPUs and custom ASICs are all points on this same idea.

IN THIS GUIDE:
The core math MAC & systolic arrays GPU vs TPU vs NPU vs ASIC Memory & the memory wall Number formats & quantization Dataflow & reuse Sparsity TOPS, TOPS/W, roofline Training vs inference Scaling & packaging The future

The core operation

It's all matrix multiply (GEMM)

A neural-network layer computes, in essence, Y = activation(W · X + b) — a matrix of weights W multiplied by a vector/matrix of inputs X. Multiplying matrices means, for every output element, a dot product: a long sequence of multiply-then-add. This dense matrix-multiply is called GEMM (General Matrix Multiply), and modern transformers are >90% GEMM.

One multiply-accumulate is trivial. The catch is scale: a single large language model does trillions of MACs per token. The entire job of an AI chip is to perform that ocean of MACs with maximum throughput per watt.

✅ The one sentence to remember

An AI chip is a machine for doing matrix multiplication — i.e. massive numbers of multiply-accumulate operations — as fast and as efficiently as physics allows.

Want to feel why parallel hardware crushes this? Play the interactive GPU Lab — the CPU-vs-GPU race is the same principle behind every AI chip.

The building block

MAC units and the systolic array

The atom of an AI chip is the MAC unit: a multiplier feeding an adder that accumulates a running sum. Put thousands of them on a die and you can do thousands of multiply-accumulates every clock cycle.

But raw MACs aren't enough — you have to feed them, and memory bandwidth is the enemy. The elegant answer is the systolic array: a 2-D grid of MAC units (processing elements, PEs) through which data flows rhythmically, like a heartbeat (systole). Each PE multiplies, accumulates, and passes operands to its neighbours. Crucially, each value entering the array is reused by an entire row or column of PEs instead of being re-fetched — slashing memory traffic. Google's TPU is built around a large systolic array.

Figure 1 — A systolic array. Inputs flow in from the left, weights from the top; each PE does a MAC and passes data on, so every value is reused many times before leaving the array.

▶ See it run: the interactive Systolic Array Lab lets you edit two matrices and step the array cycle by cycle — watch the data flow and the accumulators build up.

The landscape

GPU vs TPU vs NPU vs FPGA vs ASIC

"AI chip" is an umbrella. The members trade flexibility against efficiency:

Type	What it is	Trade-off
GPU	Thousands of programmable cores + dedicated tensor/matrix units	Most flexible; huge software ecosystem; the default for training
TPU	ASIC built around a large systolic array for tensor math	Less flexible, very efficient at GEMM; data-centre scale
NPU	Compact accelerator inside phones/SoCs/edge devices	Optimised for low-power on-device inference
FPGA	Reconfigurable logic mapped to a custom dataflow	Flexible hardware, lower volume cost, good for evolving models
Custom ASIC	Fixed silicon for one workload (e.g. an inference appliance)	Highest efficiency, least flexible, high NRE cost

They all accelerate the same MACs. A GPU keeps programmability (and so dominates fast-moving research/training); a TPU/ASIC hard-wires the dataflow for efficiency; an NPU shrinks it for the edge. See the GPU side hands-on →

The real bottleneck

Memory and the memory wall

Here's the secret most explainers skip: AI chips are usually limited by memory, not compute. A MAC array can demand operands faster than DRAM can deliver them — the famous memory wall. Moving a number from external memory can cost orders of magnitude more energy than the multiply itself. So AI-chip design is largely a memory-movement problem.

Registers / PE accumulators — fastest, tiny
On-chip SRAM (scratchpad / "tensor memory") — fast, KB–MB
HBM stacked DRAM — TB/s bandwidth, GBs
Host / network memory — largest, slowest

Two design levers fight the wall:

HBM (High-Bandwidth Memory) — DRAM dies stacked with through-silicon vias next to the compute die on a silicon interposer, delivering terabytes per second. (See why HBM demand exploded.)
On-chip SRAM + data reuse — keep data on-chip and reuse it many times (the systolic array's whole point) so you touch HBM as little as possible.

Make the numbers smaller

Number formats & quantization

Neural networks are remarkably tolerant of low numerical precision. That's a gift: smaller numbers mean less memory, less bandwidth, and far more MACs per mm² and per watt. AI chips support a ladder of formats:

Format	Bits	Typical use
FP32	32	Reference precision; older training
TF32 / FP16	19 / 16	Training (faster, near-FP32 quality)
BF16	16	Training favourite — wide range, easy
FP8	8	Modern training & inference
INT8	8	Inference workhorse (after quantization)
INT4 / lower	4	Aggressive edge/LLM inference

Quantization is the process of converting a trained FP model to a lower-precision one (e.g. INT8) with minimal accuracy loss. Halving the bit-width roughly doubles effective throughput and memory capacity — which is why FP8 and INT8 are central to today's AI hardware. A chip's headline TOPS number is always quoted at a precision (e.g. "X TOPS INT8").

How data moves

Dataflow & data reuse

Because memory movement dominates energy, the dataflow — the order in which data is loaded, reused and stored — is a first-class architectural decision. Classic strategies keep one operand "stationary" in the PEs to maximise reuse:

Weight-stationary — keep weights in the PEs, stream activations through (common in systolic arrays/TPUs).
Output-stationary — keep the accumulating output in place, stream inputs & weights.
Row-stationary — balance reuse of weights, inputs and partial sums (e.g. the Eyeriss research design).

The goal is always the same: fetch each value from expensive memory once, then reuse it as many times as possible close to the MACs.

Skip the zeros

Sparsity

Many neural-network weights and activations are zero (or can be pruned to zero). Multiplying by zero is wasted work — so modern AI chips exploit sparsity: hardware that skips zero operands, or supports structured sparsity (e.g. 2-of-4 patterns) to roughly double effective throughput. Combined with quantization, sparsity is a major lever for inference efficiency.

Measuring it

TOPS, TOPS/W and the roofline

TOPS / TFLOPS — Tera (trillion) Operations / Floating-point Ops Per Second: raw throughput, quoted at a precision.
TOPS/W — efficiency: operations per watt. The metric that actually matters at data-centre scale and on the edge.
Utilisation — what fraction of the MACs are actually busy. Real workloads often hit a fraction of peak TOPS because of memory stalls.

The roofline model ties it together: performance is capped either by compute (the flat "roof") or by memory bandwidth (the sloped part), depending on a workload's arithmetic intensity (operations per byte fetched). Low-intensity workloads (like small-batch LLM inference) are memory-bound — which is exactly why HBM and on-chip SRAM matter so much. Peak TOPS alone is marketing; sustained TOPS at high utilisation is engineering.

Two different jobs

Training chips vs inference chips

	Training	Inference
Job	Learn the weights	Run the trained model
Precision	BF16 / FP16 / FP8	INT8 / FP8 / INT4
Priorities	Throughput, memory, scale-out	Latency, power, cost
Interconnect	Critical (1000s of chips)	Often single-chip / small
Where	Data centre	Cloud → edge → phone NPU

Training is a brutal, multi-week, multi-thousand-accelerator effort dominated by compute and interconnect. Inference is about serving predictions cheaply and quickly — and is increasingly pushed to the edge (your phone's NPU) for latency and privacy.

Beyond one die

Scaling: chiplets, packaging & interconnect

A single die has reached the reticle limit (the max size a lithography tool can print). So AI chips scale outward:

Chiplets & advanced packaging — multiple dies (compute + HBM stacks) joined on a silicon interposer (2.5D) or stacked (3D). Packaging throughput is now a key constraint on how many accelerators ship.
Chip-to-chip interconnect — high-speed links (e.g. NVLink-class fabrics) let thousands of accelerators act as one giant machine for training.
Networking & power — at cluster scale, the bottleneck shifts again to network bandwidth, power delivery and cooling.

This is why the AI boom is a whole-supply-chain event — logic, memory, packaging and networking all at once. More on that here →

What's next

The frontier

In-memory / near-memory computing — do the MACs inside the memory array to kill data movement (the biggest energy cost). Deep dive: memory alternatives for AI →
Analog & mixed-signal compute — perform multiply-accumulate with physics (charge/current) for huge efficiency, trading exactness.
Photonic computing — matrix multiply with light; near-zero interconnect energy, still maturing. How photonics works →
Wafer-scale & 3D stacking — ever-bigger integrated compute with memory stacked directly on logic.
Sparsity & sub-8-bit formats — squeezing more useful work per joule.

✅ The whole guide in three sentences

An AI chip is a matrix-multiply engine: thousands of MAC units, often arranged as a systolic array, fed by HBM and on-chip SRAM. Its real challenge is moving data, so it leans on data reuse, low-precision formats and sparsity to maximise TOPS per watt. GPUs, TPUs, NPUs and ASICs are different flexibility-vs-efficiency answers to that same problem.

Reference

FAQ

What is an AI chip?

A processor specialised for neural-network math — mostly matrix multiply / multiply-accumulate — packing thousands of MAC units, wide on-chip memory and HBM for high throughput per watt. GPUs, TPUs, NPUs and ASICs are all AI chips.

What is a systolic array?

A grid of MAC units through which data flows rhythmically, each reusing operands and passing partial sums to neighbours — an extremely efficient way to do matrix multiply. The TPU is built on one.

GPU vs TPU vs NPU?

GPU = flexible, programmable, dominant for training. TPU = systolic-array ASIC, very efficient at tensor math. NPU = compact edge/phone accelerator for low-power inference.

Why low precision (INT8/FP8)?

Networks tolerate it; smaller numbers mean less memory/bandwidth and far more MACs per watt. Training favours BF16/FP8; inference often uses INT8 after quantization.

What is TOPS?

Trillions of Operations Per Second at a given precision. TOPS/W is efficiency. Real performance depends on memory bandwidth and utilisation, so peak TOPS can mislead.

What Is an AI Chip? How AI Accelerators Really Work