Edit two matrices, then step the array cycle by cycle. Watch values flow through the MAC grid, accumulators build up, and the result verified against plain multiplication. This is the engine inside a TPU — made visible.
This array computes C = A × B. The rows of A are fed into the left edge and the columns of B into the top edge — both skewed in time (notice each row/column starts one cycle later, forming a staircase). As values march right and down, the processing element at position (i, j) keeps a running sum of a × b — and that sum is exactly C[i][j].
The magic is data reuse: each value entering the grid is used by an entire row or column of MAC units before it leaves, instead of being re-fetched from memory. Minimising memory traffic is the whole reason this structure is so energy-efficient — and why it sits at the heart of Google's TPU.
acc += a × b.A[i][k] is injected at the left of row i at cycle k + i; B[k][j] at the top of column j at cycle k + j (that's the skew).k + i + j, so each term of the dot product arrives at the right moment and accumulates.2N + K − 2 cycles, every PE holds one element of the result.This is an output-stationary systolic array — the result stays put in each PE while inputs stream through. (Weight-stationary and row-stationary are other common dataflows; see the AI-chip guide.)
Neural networks are >90% matrix multiply, and the dominant cost in hardware is moving data, not the multiply itself. A systolic array attacks exactly that: maximum reuse, minimum memory traffic, thousands of MACs working in lock-step. Scale this little grid to 256×256 or larger, feed it from HBM, run it in INT8, and you have the core of a modern AI accelerator.
Want the bigger picture? Read What Is an AI Chip? and try the GPU Lab for the parallelism intuition.
Each cell in the grid is a processing element, and inside it is almost nothing — which is the point. A PE contains just:
a) by the value arriving from the top (b).acc ← acc + a×b.a and b and pass them to the right and bottom neighbours on the next clock edge.That tiny footprint is why you can fit a huge number of them on one die. A PE is deliberately "dumb": it has no instruction fetch, no branch logic, no cache — none of the machinery a CPU core carries. It just multiplies, adds, and forwards, every single clock cycle. Replace a few thousand complex CPU cores with hundreds of thousands of these trivial PEs and you get an enormous jump in multiply-accumulates per second per watt — exactly what matrix-heavy neural networks crave.
A systolic array trades flexibility (a PE can only MAC) for density and efficiency (you can have a sea of them, all busy, with almost no control overhead and minimal memory traffic).
Let's trace the array computing A × B with the lab's default 2×2 values, so you can see exactly what each PE does on each clock cycle.
A = [[2, 1], [0, 3]] B = [[1, 4], [2, 1]] → expected C = [[4, 9], [6, 3]]
| Cycle | P00 | P01 | P10 | P11 | What's happening |
|---|---|---|---|---|---|
| 0 | 2×1 → 2 | – | – | – | A[0][0] meets B[0][0] at the top-left PE |
| 1 | 1×2 → 4 | 2×4 → 8 | 0×1 → 0 | – | data has marched one step right & down |
| 2 | – | 1×1 → 9 | 3×2 → 6 | 0×4 → 0 | the wavefront sweeps diagonally across |
| 3 | – | – | – | 3×1 → 3 | last term lands in the bottom-right PE |
Read the accumulators after cycle 3: P00 = 4, P01 = 9, P10 = 6, P11 = 3 — exactly C = [[4, 9], [6, 3]]. Notice how each value of A and B was reused as it travelled through more than one PE, and how the active MACs form a diagonal wavefront sweeping across the grid. Step the lab above with these numbers and watch the green flashes trace that same diagonal.
If you fed every row of A and every column of B in at the same time, the values wouldn't meet at the right PEs at the right moments — the dot products would add up the wrong terms. The fix is skewing: each row of A is delayed by its row index, and each column of B by its column index.
| Stream | Cycle 0 | Cycle 1 | Cycle 2 |
|---|---|---|---|
| A row 0 (into PE row 0) | A[0][0] | A[0][1] | A[0][2] |
| A row 1 (delayed 1) | – | A[1][0] | A[1][1] |
| A row 2 (delayed 2) | – | – | A[2][0] |
That staircase is exactly what you see on the left and top edges of the lab. Formally, A[i][k] enters row i at cycle k + i, and B[k][j] enters column j at cycle k + j. They therefore arrive together at PE(i, j) at cycle k + i + j, so every term of the dot product Σ A[i][k]·B[k][j] lands in the correct accumulator at a distinct cycle. The whole machine finishes in 2N + K − 2 cycles — far fewer than the N²·K multiplies done one-at-a-time on a scalar CPU, because the array does a whole diagonal of MACs every cycle.
This lab shows an output-stationary array — each PE's accumulator (one element of C) stays put while inputs stream through. But "stationary" can apply to different operands, and the choice changes which data gets reused most, which matters because data movement, not arithmetic, dominates energy.
| Dataflow | What stays put | Best reuse / used by |
|---|---|---|
| Output-stationary | The partial sum (output) in each PE | Reuses partial sums; what this lab shows |
| Weight-stationary | The weights, pre-loaded into the PEs | Reuses weights heavily; classic TPU style |
| Row-stationary | A balance of weights, inputs & sums | Balances all reuse; the Eyeriss research design |
In a weight-stationary TPU, the weight matrix is loaded into the array first and held there; activations then stream through and results drop out the bottom. Because the same weights are reused across an entire batch of inputs, weight memory traffic almost disappears — ideal when weights are large and reused many times, as in inference.
The lab uses a 2×2 or 3×3 array so you can follow every value. Real accelerators scale the exact same structure up dramatically:
So the mental model is simple: this lab is a TPU's compute core in miniature. Everything else — HBM, the on-chip buffers, the compiler — exists to keep an array like this fed and busy.
They're not a universal answer. Their weaknesses are the flip side of their strengths:
2N + K − 2 cycles include time to fill the pipeline and drain it; for a single small multiply that overhead is relatively large (great for big streaming workloads, less so for one tiny op).This is why a GPU — with thousands of programmable cores — stays popular for fast-changing research, while fixed systolic ASICs win on efficiency for stable, large, matrix-dominated workloads.
The systolic array isn't new. It was introduced in 1978 by H. T. Kung and Charles Leiserson, who coined the name from systole — the rhythmic contraction of the heart — because data pulses through the array in a regular beat. For decades it was mostly an elegant academic idea, used in signal-processing chips.
Then deep learning made matrix multiplication the most important computation on Earth, and the systolic array's near-perfect data reuse suddenly made it the ideal engine. Google's Tensor Processing Unit (TPU), deployed from 2015, put a large systolic array at its heart — and a 40-year-old idea became the backbone of modern AI infrastructure. A reminder that in computer architecture, the right idea sometimes just has to wait for the right workload.
acc += a×b, the core operation.A grid of MAC units through which data flows rhythmically; each reuses its inputs and passes them to neighbours, computing matrix multiply very efficiently. The TPU is built on one.
Rows of A enter the left, columns of B enter the top (skewed in time). PE(i,j) accumulates the dot product of row i and column j = C[i][j].
It maximises data reuse and minimises memory traffic — the main energy cost — giving excellent operations-per-watt for the matrix math that dominates neural networks.
A multiply-accumulate unit — a multiplier plus an adder/accumulator computing acc += a×b in one step. A systolic array is a grid of thousands of them.
Output-stationary keeps each result (partial sum) fixed in its PE while inputs stream through — what this lab shows. Weight-stationary pre-loads the weights into the PEs and streams activations through, maximising weight reuse — the classic TPU style for inference.
For an N×N array multiplying with inner dimension K, the array produces the full result in about 2N + K − 2 cycles, doing a whole diagonal of multiply-accumulates every cycle instead of one multiply at a time.
H. T. Kung and Charles Leiserson described it in 1978. It became central to AI hardware decades later when Google built the TPU around a large systolic array.
Related: What Is an AI Chip? · GPU Lab · AI Semiconductor Boom