Systolic Array Design PE Architecture, Dataflow & Matrix Multiply
By EcrioniX · Updated June 2026 · ~45 min read
Processing Element (PE)Systolic ArrayWeight-StationaryOutput-StationaryMatrix MultiplyMAC UnitDataflow
What is a Systolic Array?
A systolic array is a 2D grid of simple Processing Elements (PEs) where data flows rhythmically from one PE to its neighbors on every clock edge — like blood pumped rhythmically by the heart (Greek: systole). Each PE does one thing: a multiply-accumulate (MAC) operation. The magic is that data flows through the array, being reused at every PE it passes, without returning to memory.
Google's TPU, NVIDIA's Tensor Core, and virtually every AI inference chip uses a systolic array as its heart. Understanding this architecture is foundational for accelerator design.
Why Systolic Arrays Win
Matrix multiply on a CPU requires loading each operand multiple times from memory. A 4×4 systolic array reuses each input element across 4 PEs before discarding it. For an N×N array: compute = N³ MACs, memory accesses = O(N²). Arithmetic intensity = N — it grows with array size, making systolic arrays memory-bandwidth efficient.
Processing Element (PE) Design
Each PE in a weight-stationary systolic array holds one weight value and computes: acc += weight × input. Inputs flow east (right), partial sums flow south (down).
Single PE — weight-stationary
Verilog — Processing Element (PE), weight-stationary
// Computes C = A × B where A,B are 2×2 matrices// Row activations flow east; partial sums flow southmodulesystolic_2x2 #(parameterW=8, parameterACC=32) (
inputclk, rst, load_w,
input [W-1:0] w[0:1][0:1], // weight matrix Binput [W-1:0] a_row0[0:1], // row 0 of Ainput [W-1:0] a_row1[0:1], // row 1 of Aoutput [ACC-1:0] C[0:1][0:1] // result matrix C
);
// Inter-PE wireswire [W-1:0] a_mid[0:1]; // activation between col0→col1wire [ACC-1:0] p_mid[0:1]; // psum between row0→row1// Row 0
pe #(W,ACC) pe00 (.clk(clk),.rst(rst),.load_w(load_w),
.w_in(w[0][0]),.a_in(a_row0[0]),.psum_in(0),
.a_out(a_mid[0]),.psum_out(p_mid[0]));
pe #(W,ACC) pe01 (.clk(clk),.rst(rst),.load_w(load_w),
.w_in(w[0][1]),.a_in(a_mid[0]),.psum_in(0),
.a_out(),.psum_out(p_mid[1]));
// Row 1
pe #(W,ACC) pe10 (.clk(clk),.rst(rst),.load_w(load_w),
.w_in(w[1][0]),.a_in(a_row1[0]),.psum_in(p_mid[0]),
.a_out(a_mid[1]),.psum_out(C[0][0]));
pe #(W,ACC) pe11 (.clk(clk),.rst(rst),.load_w(load_w),
.w_in(w[1][1]),.a_in(a_mid[1]),.psum_in(p_mid[1]),
.a_out(),.psum_out(C[0][1]));
endmodule
Dataflow Comparison
Dataflow
What's Stationary
Data Flowing
Used In
Best For
Weight-Stationary
Weights (B matrix)
Activations east, psums south
Google TPU v1
Inference — weights loaded once
Output-Stationary
Partial sums (C matrix)
Both A and B flow in
NVDLA, many edge AI chips
When accumulator bandwidth is limited
Input-Stationary
Input activations (A matrix)
Weights and psums flow
Less common
Streaming inference
No Local Reuse
Nothing
All operands stream in
GPUs (register file)
Flexible but memory-bandwidth hungry
Latency & Throughput Analysis
For an N×N systolic array multiplying N×N matrices:
Fill latency = 2N - 1 cycles (time to fill the pipeline)
Compute cycles = N cycles (once filled, N outputs/cycle)
Total latency = 3N - 1 cycles (fill + drain)
Throughput (steady state) = N² MACs / cycle
Example: 128×128 array at 1 GHz:
Throughput = 128² × 2 = 32,768 GFlops = 32.7 TFlops
(×2 for FMA = multiply + add counted as 2 flops)
Compare to: CPU core ≈ 0.05 TFlops, GPU ≈ 1000 TFlops (due to 1000s of cores)
Day 4 — Interview Questions
Q1What is a systolic array and why is it efficient for matrix multiply?
A systolic array is a 2D grid of Processing Elements (PEs) where data flows rhythmically from PE to PE each clock cycle. Each PE does a MAC (multiply-accumulate) and passes its inputs to adjacent PEs. For N×N matrix multiply, a systolic array performs N³ MACs with only O(N²) memory accesses — arithmetic intensity scales as N. This means as the array size grows, the ratio of compute to memory bandwidth improves, making systolic arrays highly efficient for the compute-bound workloads found in neural network inference and training.
Q2Explain weight-stationary dataflow in a systolic array.
In weight-stationary dataflow, the weight matrix B is preloaded into the PE array — each PE holds exactly one weight element that remains fixed throughout the computation. Input activations (from matrix A) flow horizontally (east) through the array row by row. Partial sums accumulate vertically (south) as the activations pass through each PE. The key advantage: each weight value is loaded once into its PE and reused for every input vector, maximizing weight reuse and minimising weight memory bandwidth. Google's TPU v1 uses weight-stationary dataflow.
Q3What is the fill latency of an N×N systolic array?
The fill latency is 2N−1 cycles. The first input element reaches PE[0][0] at cycle 1, but PE[N-1][N-1] (bottom-right) only receives its first activation at cycle 2N−1 because it takes N−1 cycles for data to traverse N PEs horizontally and N−1 cycles to traverse vertically, staggered. After the fill phase (2N−1 cycles), the array is in steady state producing N outputs per cycle. Total latency for one matrix multiply is 3N−1 cycles (fill + N compute + drain). This pipeline latency is a one-time cost amortized over the computation.
Q4What is the difference between a MAC unit and a PE in a systolic array?
A MAC (Multiply-Accumulate) unit is a pure arithmetic circuit: output = A × B + C. A PE (Processing Element) in a systolic array wraps a MAC unit with registers for pipelining and local state storage (the stationary weight). The PE also includes input/output registers to pass activations east and partial sums south with correct timing alignment. A PE is the complete building block of a systolic array; a MAC is just the arithmetic core inside it.
Q5How does a systolic array handle matrices larger than the array size?
Larger matrices are processed by tiling. The matrix is divided into tiles matching the array dimensions (N×N). The accelerator loads one tile of B into the array as stationary weights, then streams multiple tiles of A through it, accumulating partial sums. After processing all A tiles for a B tile, the partial sums are written to memory and the next B tile is loaded. The software (or DMA controller) manages the tiling schedule. Tiling adds software complexity but allows any matrix size to be processed on a fixed-size array.
Q6Why do AI accelerators use INT8 instead of FP32 in systolic arrays?
INT8 (8-bit integer) MACs are 4× smaller in silicon area than FP32 MACs, consume significantly less power, and can achieve 4× higher throughput on the same die area. Neural network inference is tolerant of quantisation noise — models can be quantised from FP32 training weights to INT8 with less than 1% accuracy loss when done carefully. For a fixed die area, an INT8 systolic array fits 16× more PEs than FP32 (4× per dimension in a 2D array), enabling 16× higher throughput. This is why TPUs, Apple Neural Engine, and most edge AI chips use INT8 or INT4 for inference.