How does a systolic array do matrix multiplication?

To compute C = A x B, the rows of A are fed into the left edge of the grid and the columns of B into the top edge, both skewed in time. As the values march right and down through the array, the processing element at position (i,j) accumulates the dot product of row i of A and column j of B - which is exactly C[i][j]. After enough cycles every element holds one entry of the result.

A MAC (multiply-accumulate) unit is the basic building block of an AI chip: a multiplier followed by an adder that keeps a running sum, computing acc = acc + a x b in one step. A systolic array is a grid of thousands of these units working together.

How many cycles does a systolic array take?

For an N by N array with inner dimension K, the array completes the full matrix product in approximately 2N + K minus 2 cycles, because it performs an entire diagonal of multiply-accumulate operations every cycle rather than one multiply at a time.

Systolic Array Lab — Watch Matrix Multiply Flow Through a TPU's Core

Q: What is a systolic array?

A systolic array is a grid of small processing elements (MAC units) through which data flows rhythmically, like a heartbeat. Each element multiplies its two incoming values, adds the product to a running accumulator, and passes the values on to its neighbours. Because each input is reused across many elements instead of being re-fetched from memory, a systolic array performs matrix multiplication with very high efficiency. Google's TPU is built around a large systolic array.

Q: Why are systolic arrays used in AI chips?

Neural networks are dominated by matrix multiplication, and the main cost in hardware is moving data from memory. A systolic array maximises data reuse: each value entering the grid is used by an entire row or column of MAC units before leaving, so memory traffic is minimised. This makes it extremely efficient in operations per watt, which is why it sits at the heart of accelerators like the TPU.

Q: What is the difference between output-stationary and weight-stationary systolic arrays?

In an output-stationary array each result (partial sum) stays fixed in its processing element while inputs stream through, which is what this lab demonstrates. In a weight-stationary array the weights are pre-loaded into the processing elements and held there while activations stream through, maximising weight reuse - the classic TPU style for inference. Row-stationary balances the reuse of weights, inputs and partial sums.

Q: Who invented the systolic array?

The systolic array was introduced in 1978 by H. T. Kung and Charles Leiserson. The name comes from systole, the rhythmic contraction of the heart, because data pulses through the array in a regular beat. It became central to AI hardware decades later when Google built the Tensor Processing Unit (TPU) around a large systolic array.

Size:

Matrix A (rows → enter left)

Matrix B (cols → enter top)

A value (flows right) B value (flows down) PE accumulator (MAC fired)

Cycle: 0 / 0 Ready — press Step or Run

What you're looking at

This array computes C = A × B. The rows of A are fed into the left edge and the columns of B into the top edge — both skewed in time (notice each row/column starts one cycle later, forming a staircase). As values march right and down, the processing element at position (i, j) keeps a running sum of a × b — and that sum is exactly C[i][j].

The magic is data reuse: each value entering the grid is used by an entire row or column of MAC units before it leaves, instead of being re-fetched from memory. Minimising memory traffic is the whole reason this structure is so energy-efficient — and why it sits at the heart of Google's TPU.

How the dataflow works

Each PE (processing element) holds a MAC unit: acc += a × b.
Value A[i][k] is injected at the left of row i at cycle k + i; B[k][j] at the top of column j at cycle k + j (that's the skew).
They meet at PE(i, j) at cycle k + i + j, so each term of the dot product arrives at the right moment and accumulates.
After 2N + K − 2 cycles, every PE holds one element of the result.

This is an output-stationary systolic array — the result stays put in each PE while inputs stream through. (Weight-stationary and row-stationary are other common dataflows; see the AI-chip guide.)

Why it matters for AI chips

Neural networks are >90% matrix multiply, and the dominant cost in hardware is moving data, not the multiply itself. A systolic array attacks exactly that: maximum reuse, minimum memory traffic, thousands of MACs working in lock-step. Scale this little grid to 256×256 or larger, feed it from HBM, run it in INT8, and you have the core of a modern AI accelerator.

Want the bigger picture? Read What Is an AI Chip? and try the GPU Lab for the parallelism intuition.

Anatomy of a processing element (PE)

Each cell in the grid is a processing element, and inside it is almost nothing — which is the point. A PE contains just:

A multiplier that multiplies the value arriving from the left (a) by the value arriving from the top (b).
An adder and a small accumulator register that keeps the running sum: acc ← acc + a×b.
Two pipeline registers that latch a and b and pass them to the right and bottom neighbours on the next clock edge.

That tiny footprint is why you can fit a huge number of them on one die. A PE is deliberately "dumb": it has no instruction fetch, no branch logic, no cache — none of the machinery a CPU core carries. It just multiplies, adds, and forwards, every single clock cycle. Replace a few thousand complex CPU cores with hundreds of thousands of these trivial PEs and you get an enormous jump in multiply-accumulates per second per watt — exactly what matrix-heavy neural networks crave.

✅ The key insight

A systolic array trades flexibility (a PE can only MAC) for density and efficiency (you can have a sea of them, all busy, with almost no control overhead and minimal memory traffic).

A worked example — 2×2, cycle by cycle

Let's trace the array computing A × B with the lab's default 2×2 values, so you can see exactly what each PE does on each clock cycle.

A = [[2, 1], [0, 3]] B = [[1, 4], [2, 1]] → expected C = [[4, 9], [6, 3]]

Cycle	P00	P01	P10	P11	What's happening
0	2×1 → 2	–	–	–	A[0][0] meets B[0][0] at the top-left PE
1	1×2 → 4	2×4 → 8	0×1 → 0	–	data has marched one step right & down
2	–	1×1 → 9	3×2 → 6	0×4 → 0	the wavefront sweeps diagonally across
3	–	–	–	3×1 → 3	last term lands in the bottom-right PE

Read the accumulators after cycle 3: P00 = 4, P01 = 9, P10 = 6, P11 = 3 — exactly C = [[4, 9], [6, 3]]. Notice how each value of A and B was reused as it travelled through more than one PE, and how the active MACs form a diagonal wavefront sweeping across the grid. Step the lab above with these numbers and watch the green flashes trace that same diagonal.

The skew: why inputs enter staggered

If you fed every row of A and every column of B in at the same time, the values wouldn't meet at the right PEs at the right moments — the dot products would add up the wrong terms. The fix is skewing: each row of A is delayed by its row index, and each column of B by its column index.

Stream	Cycle 0	Cycle 1	Cycle 2
A row 0 (into PE row 0)	A[0][0]	A[0][1]	A[0][2]
A row 1 (delayed 1)	–	A[1][0]	A[1][1]
A row 2 (delayed 2)	–	–	A[2][0]

That staircase is exactly what you see on the left and top edges of the lab. Formally, A[i][k] enters row i at cycle k + i, and B[k][j] enters column j at cycle k + j. They therefore arrive together at PE(i, j) at cycle k + i + j, so every term of the dot product Σ A[i][k]·B[k][j] lands in the correct accumulator at a distinct cycle. The whole machine finishes in 2N + K − 2 cycles — far fewer than the N²·K multiplies done one-at-a-time on a scalar CPU, because the array does a whole diagonal of MACs every cycle.

Dataflows: output-, weight- and row-stationary

This lab shows an output-stationary array — each PE's accumulator (one element of C) stays put while inputs stream through. But "stationary" can apply to different operands, and the choice changes which data gets reused most, which matters because data movement, not arithmetic, dominates energy.

Dataflow	What stays put	Best reuse / used by
Output-stationary	The partial sum (output) in each PE	Reuses partial sums; what this lab shows
Weight-stationary	The weights, pre-loaded into the PEs	Reuses weights heavily; classic TPU style
Row-stationary	A balance of weights, inputs & sums	Balances all reuse; the Eyeriss research design

In a weight-stationary TPU, the weight matrix is loaded into the array first and held there; activations then stream through and results drop out the bottom. Because the same weights are reused across an entire batch of inputs, weight memory traffic almost disappears — ideal when weights are large and reused many times, as in inference.

From this toy grid to a real TPU

The lab uses a 2×2 or 3×3 array so you can follow every value. Real accelerators scale the exact same structure up dramatically:

Size — Google's first TPU used a 256×256 systolic array: 65,536 MAC units doing 8-bit multiply-accumulates, capable of tens of trillions of operations per second.
Precision — instead of the integers here, real arrays run INT8, FP8 or BF16, packing far more MACs into the same silicon and bandwidth.
Feeding it — a 256×256 array is hungry; it's fed from on-chip SRAM backed by HBM delivering terabytes per second, because the array is useless if it starves for data.
Utilisation — a big square array is most efficient when the matrices are large and "square-ish". Small or oddly-shaped layers leave PEs idle, so real compilers tile and pad work to keep the array busy.

So the mental model is simple: this lab is a TPU's compute core in miniature. Everything else — HBM, the on-chip buffers, the compiler — exists to keep an array like this fed and busy.

Where systolic arrays struggle

They're not a universal answer. Their weaknesses are the flip side of their strengths:

Shape sensitivity — a fixed N×N array runs at peak only for well-matched matrix sizes. Tiny matrices, tall-thin shapes, or batch size 1 (common in real-time LLM inference) leave many PEs idle, so real utilisation can be far below peak TOPS.
Low flexibility — a PE only does multiply-accumulate. Operations that aren't matrix multiply (activations, softmax, normalisation, control flow) need separate hardware around the array.
Fill and drain latency — those 2N + K − 2 cycles include time to fill the pipeline and drain it; for a single small multiply that overhead is relatively large (great for big streaming workloads, less so for one tiny op).

This is why a GPU — with thousands of programmable cores — stays popular for fast-changing research, while fixed systolic ASICs win on efficiency for stable, large, matrix-dominated workloads.

A short history

The systolic array isn't new. It was introduced in 1978 by H. T. Kung and Charles Leiserson, who coined the name from systole — the rhythmic contraction of the heart — because data pulses through the array in a regular beat. For decades it was mostly an elegant academic idea, used in signal-processing chips.

Then deep learning made matrix multiplication the most important computation on Earth, and the systolic array's near-perfect data reuse suddenly made it the ideal engine. Google's Tensor Processing Unit (TPU), deployed from 2015, put a large systolic array at its heart — and a 40-year-old idea became the backbone of modern AI infrastructure. A reminder that in computer architecture, the right idea sometimes just has to wait for the right workload.

Glossary

MACMultiply-accumulate: acc += a×b, the core operation.

PEProcessing element — one cell of the array, holding one MAC.

SystoleThe "heartbeat" — data pulses through one step per cycle.

SkewStaggering inputs in time so they meet at the right PE.

Output-stationaryThe result stays in each PE; inputs stream through.

Weight-stationaryWeights are held in PEs; activations stream through.

GEMMGeneral Matrix Multiply — the workload an array accelerates.

UtilisationFraction of PEs actually busy; real perf depends on it.

FAQ

What is a systolic array?

A grid of MAC units through which data flows rhythmically; each reuses its inputs and passes them to neighbours, computing matrix multiply very efficiently. The TPU is built on one.

How does it multiply matrices?

Rows of A enter the left, columns of B enter the top (skewed in time). PE(i,j) accumulates the dot product of row i and column j = C[i][j].

Why use it in AI chips?

It maximises data reuse and minimises memory traffic — the main energy cost — giving excellent operations-per-watt for the matrix math that dominates neural networks.

What is a MAC unit?

A multiply-accumulate unit — a multiplier plus an adder/accumulator computing acc += a×b in one step. A systolic array is a grid of thousands of them.

What is the difference between output-stationary and weight-stationary?

Output-stationary keeps each result (partial sum) fixed in its PE while inputs stream through — what this lab shows. Weight-stationary pre-loads the weights into the PEs and streams activations through, maximising weight reuse — the classic TPU style for inference.

How many cycles does it take?

For an N×N array multiplying with inner dimension K, the array produces the full result in about 2N + K − 2 cycles, doing a whole diagonal of multiply-accumulates every cycle instead of one multiply at a time.

Who invented the systolic array?

H. T. Kung and Charles Leiserson described it in 1978. It became central to AI hardware decades later when Google built the TPU around a large systolic array.

Systolic Array Lab — Watch a TPU Multiply Matrices