HomeToolsSystolic Array Lab
🔬 INTERACTIVE LAB

Systolic Array Lab — Watch a TPU Multiply Matrices

Edit two matrices, then step the array cycle by cycle. Watch values flow through the MAC grid, accumulators build up, and the result verified against plain multiplication. This is the engine inside a TPU — made visible.

Size:

Matrix A (rows → enter left)

Matrix B (cols → enter top)

A value (flows right) B value (flows down) PE accumulator (MAC fired)
Cycle: 0 / 0 Ready — press Step or Run

What you're looking at

This array computes C = A × B. The rows of A are fed into the left edge and the columns of B into the top edge — both skewed in time (notice each row/column starts one cycle later, forming a staircase). As values march right and down, the processing element at position (i, j) keeps a running sum of a × b — and that sum is exactly C[i][j].

The magic is data reuse: each value entering the grid is used by an entire row or column of MAC units before it leaves, instead of being re-fetched from memory. Minimising memory traffic is the whole reason this structure is so energy-efficient — and why it sits at the heart of Google's TPU.

How the dataflow works

This is an output-stationary systolic array — the result stays put in each PE while inputs stream through. (Weight-stationary and row-stationary are other common dataflows; see the AI-chip guide.)

Why it matters for AI chips

Neural networks are >90% matrix multiply, and the dominant cost in hardware is moving data, not the multiply itself. A systolic array attacks exactly that: maximum reuse, minimum memory traffic, thousands of MACs working in lock-step. Scale this little grid to 256×256 or larger, feed it from HBM, run it in INT8, and you have the core of a modern AI accelerator.

Want the bigger picture? Read What Is an AI Chip? and try the GPU Lab for the parallelism intuition.

Anatomy of a processing element (PE)

Each cell in the grid is a processing element, and inside it is almost nothing — which is the point. A PE contains just:

That tiny footprint is why you can fit a huge number of them on one die. A PE is deliberately "dumb": it has no instruction fetch, no branch logic, no cache — none of the machinery a CPU core carries. It just multiplies, adds, and forwards, every single clock cycle. Replace a few thousand complex CPU cores with hundreds of thousands of these trivial PEs and you get an enormous jump in multiply-accumulates per second per watt — exactly what matrix-heavy neural networks crave.

✅ The key insight

A systolic array trades flexibility (a PE can only MAC) for density and efficiency (you can have a sea of them, all busy, with almost no control overhead and minimal memory traffic).

A worked example — 2×2, cycle by cycle

Let's trace the array computing A × B with the lab's default 2×2 values, so you can see exactly what each PE does on each clock cycle.

A = [[2, 1], [0, 3]]    B = [[1, 4], [2, 1]]  →  expected C = [[4, 9], [6, 3]]

CycleP00P01P10P11What's happening
02×1 → 2A[0][0] meets B[0][0] at the top-left PE
11×2 → 42×4 → 80×1 → 0data has marched one step right & down
21×1 → 93×2 → 60×4 → 0the wavefront sweeps diagonally across
33×1 → 3last term lands in the bottom-right PE

Read the accumulators after cycle 3: P00 = 4, P01 = 9, P10 = 6, P11 = 3 — exactly C = [[4, 9], [6, 3]]. Notice how each value of A and B was reused as it travelled through more than one PE, and how the active MACs form a diagonal wavefront sweeping across the grid. Step the lab above with these numbers and watch the green flashes trace that same diagonal.

The skew: why inputs enter staggered

If you fed every row of A and every column of B in at the same time, the values wouldn't meet at the right PEs at the right moments — the dot products would add up the wrong terms. The fix is skewing: each row of A is delayed by its row index, and each column of B by its column index.

StreamCycle 0Cycle 1Cycle 2
A row 0 (into PE row 0)A[0][0]A[0][1]A[0][2]
A row 1 (delayed 1)A[1][0]A[1][1]
A row 2 (delayed 2)A[2][0]

That staircase is exactly what you see on the left and top edges of the lab. Formally, A[i][k] enters row i at cycle k + i, and B[k][j] enters column j at cycle k + j. They therefore arrive together at PE(i, j) at cycle k + i + j, so every term of the dot product Σ A[i][k]·B[k][j] lands in the correct accumulator at a distinct cycle. The whole machine finishes in 2N + K − 2 cycles — far fewer than the N²·K multiplies done one-at-a-time on a scalar CPU, because the array does a whole diagonal of MACs every cycle.

Dataflows: output-, weight- and row-stationary

This lab shows an output-stationary array — each PE's accumulator (one element of C) stays put while inputs stream through. But "stationary" can apply to different operands, and the choice changes which data gets reused most, which matters because data movement, not arithmetic, dominates energy.

DataflowWhat stays putBest reuse / used by
Output-stationaryThe partial sum (output) in each PEReuses partial sums; what this lab shows
Weight-stationaryThe weights, pre-loaded into the PEsReuses weights heavily; classic TPU style
Row-stationaryA balance of weights, inputs & sumsBalances all reuse; the Eyeriss research design

In a weight-stationary TPU, the weight matrix is loaded into the array first and held there; activations then stream through and results drop out the bottom. Because the same weights are reused across an entire batch of inputs, weight memory traffic almost disappears — ideal when weights are large and reused many times, as in inference.

From this toy grid to a real TPU

The lab uses a 2×2 or 3×3 array so you can follow every value. Real accelerators scale the exact same structure up dramatically:

So the mental model is simple: this lab is a TPU's compute core in miniature. Everything else — HBM, the on-chip buffers, the compiler — exists to keep an array like this fed and busy.

Where systolic arrays struggle

They're not a universal answer. Their weaknesses are the flip side of their strengths:

This is why a GPU — with thousands of programmable cores — stays popular for fast-changing research, while fixed systolic ASICs win on efficiency for stable, large, matrix-dominated workloads.

A short history

The systolic array isn't new. It was introduced in 1978 by H. T. Kung and Charles Leiserson, who coined the name from systole — the rhythmic contraction of the heart — because data pulses through the array in a regular beat. For decades it was mostly an elegant academic idea, used in signal-processing chips.

Then deep learning made matrix multiplication the most important computation on Earth, and the systolic array's near-perfect data reuse suddenly made it the ideal engine. Google's Tensor Processing Unit (TPU), deployed from 2015, put a large systolic array at its heart — and a 40-year-old idea became the backbone of modern AI infrastructure. A reminder that in computer architecture, the right idea sometimes just has to wait for the right workload.

Glossary

MACMultiply-accumulate: acc += a×b, the core operation.
PEProcessing element — one cell of the array, holding one MAC.
SystoleThe "heartbeat" — data pulses through one step per cycle.
SkewStaggering inputs in time so they meet at the right PE.
Output-stationaryThe result stays in each PE; inputs stream through.
Weight-stationaryWeights are held in PEs; activations stream through.
GEMMGeneral Matrix Multiply — the workload an array accelerates.
UtilisationFraction of PEs actually busy; real perf depends on it.

FAQ

What is a systolic array?

A grid of MAC units through which data flows rhythmically; each reuses its inputs and passes them to neighbours, computing matrix multiply very efficiently. The TPU is built on one.

How does it multiply matrices?

Rows of A enter the left, columns of B enter the top (skewed in time). PE(i,j) accumulates the dot product of row i and column j = C[i][j].

Why use it in AI chips?

It maximises data reuse and minimises memory traffic — the main energy cost — giving excellent operations-per-watt for the matrix math that dominates neural networks.

What is a MAC unit?

A multiply-accumulate unit — a multiplier plus an adder/accumulator computing acc += a×b in one step. A systolic array is a grid of thousands of them.

What is the difference between output-stationary and weight-stationary?

Output-stationary keeps each result (partial sum) fixed in its PE while inputs stream through — what this lab shows. Weight-stationary pre-loads the weights into the PEs and streams activations through, maximising weight reuse — the classic TPU style for inference.

How many cycles does it take?

For an N×N array multiplying with inner dimension K, the array produces the full result in about 2N + K − 2 cycles, doing a whole diagonal of multiply-accumulates every cycle instead of one multiply at a time.

Who invented the systolic array?

H. T. Kung and Charles Leiserson described it in 1978. It became central to AI hardware decades later when Google built the TPU around a large systolic array.

Related: What Is an AI Chip? · GPU Lab · AI Semiconductor Boom