HomeRISC-V + AcceleratorDay 4 — Systolic Array Design
RISC-V + Accelerator · Day 04 of 15

Systolic Array Design
PE Architecture, Dataflow & Matrix Multiply

By EcrioniX · Updated June 2026 · ~45 min read
Processing Element (PE)Systolic ArrayWeight-Stationary Output-StationaryMatrix MultiplyMAC UnitDataflow

What is a Systolic Array?

A systolic array is a 2D grid of simple Processing Elements (PEs) where data flows rhythmically from one PE to its neighbors on every clock edge — like blood pumped rhythmically by the heart (Greek: systole). Each PE does one thing: a multiply-accumulate (MAC) operation. The magic is that data flows through the array, being reused at every PE it passes, without returning to memory.

Google's TPU, NVIDIA's Tensor Core, and virtually every AI inference chip uses a systolic array as its heart. Understanding this architecture is foundational for accelerator design.

Why Systolic Arrays Win

Matrix multiply on a CPU requires loading each operand multiple times from memory. A 4×4 systolic array reuses each input element across 4 PEs before discarding it. For an N×N array: compute = N³ MACs, memory accesses = O(N²). Arithmetic intensity = N — it grows with array size, making systolic arrays memory-bandwidth efficient.

Processing Element (PE) Design

Each PE in a weight-stationary systolic array holds one weight value and computes: acc += weight × input. Inputs flow east (right), partial sums flow south (down).

Single PE — weight-stationary
PE acc += w × in w stored in reg in in_out psum_in psum_out w loaded at startup
Verilog — Processing Element (PE), weight-stationary
module pe #(parameter W=8, parameter ACC=32) (
  input               clk, rst, load_w,
  input  [W-1:0]      w_in,      // weight to load
  input  [W-1:0]      a_in,      // activation flowing east
  input  [ACC-1:0]    psum_in,   // partial sum flowing south
  output reg [W-1:0]  a_out,     // activation passed east
  output reg [ACC-1:0] psum_out  // accumulated sum passed south
);
  reg [W-1:0] weight;

  always @(posedge clk) begin
    if (rst) begin
      weight <= 0; a_out <= 0; psum_out <= 0;
    end else begin
      if (load_w) weight <= w_in;   // preload weight
      a_out    <= a_in;               // pass activation east
      psum_out <= psum_in + (weight * a_in);  // MAC + pass south
    end
  end
endmodule

2×2 Systolic Array

Verilog — 2×2 systolic array for matrix multiply
// Computes C = A × B where A,B are 2×2 matrices
// Row activations flow east; partial sums flow south
module systolic_2x2 #(parameter W=8, parameter ACC=32) (
  input              clk, rst, load_w,
  input  [W-1:0]   w[0:1][0:1],  // weight matrix B
  input  [W-1:0]   a_row0[0:1],    // row 0 of A
  input  [W-1:0]   a_row1[0:1],    // row 1 of A
  output [ACC-1:0] C[0:1][0:1]     // result matrix C
);
  // Inter-PE wires
  wire [W-1:0]   a_mid[0:1];   // activation between col0→col1
  wire [ACC-1:0] p_mid[0:1];   // psum between row0→row1

  // Row 0
  pe #(W,ACC) pe00 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[0][0]),.a_in(a_row0[0]),.psum_in(0),
    .a_out(a_mid[0]),.psum_out(p_mid[0]));

  pe #(W,ACC) pe01 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[0][1]),.a_in(a_mid[0]),.psum_in(0),
    .a_out(),.psum_out(p_mid[1]));

  // Row 1
  pe #(W,ACC) pe10 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[1][0]),.a_in(a_row1[0]),.psum_in(p_mid[0]),
    .a_out(a_mid[1]),.psum_out(C[0][0]));

  pe #(W,ACC) pe11 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[1][1]),.a_in(a_mid[1]),.psum_in(p_mid[1]),
    .a_out(),.psum_out(C[0][1]));
endmodule

Dataflow Comparison

DataflowWhat's StationaryData FlowingUsed InBest For
Weight-StationaryWeights (B matrix)Activations east, psums southGoogle TPU v1Inference — weights loaded once
Output-StationaryPartial sums (C matrix)Both A and B flow inNVDLA, many edge AI chipsWhen accumulator bandwidth is limited
Input-StationaryInput activations (A matrix)Weights and psums flowLess commonStreaming inference
No Local ReuseNothingAll operands stream inGPUs (register file)Flexible but memory-bandwidth hungry

Latency & Throughput Analysis

For an N×N systolic array multiplying N×N matrices: Fill latency = 2N - 1 cycles (time to fill the pipeline) Compute cycles = N cycles (once filled, N outputs/cycle) Total latency = 3N - 1 cycles (fill + drain) Throughput (steady state) = N² MACs / cycle Example: 128×128 array at 1 GHz: Throughput = 128² × 2 = 32,768 GFlops = 32.7 TFlops (×2 for FMA = multiply + add counted as 2 flops) Compare to: CPU core ≈ 0.05 TFlops, GPU ≈ 1000 TFlops (due to 1000s of cores)

Day 4 — Interview Questions

Q1What is a systolic array and why is it efficient for matrix multiply?
A systolic array is a 2D grid of Processing Elements (PEs) where data flows rhythmically from PE to PE each clock cycle. Each PE does a MAC (multiply-accumulate) and passes its inputs to adjacent PEs. For N×N matrix multiply, a systolic array performs N³ MACs with only O(N²) memory accesses — arithmetic intensity scales as N. This means as the array size grows, the ratio of compute to memory bandwidth improves, making systolic arrays highly efficient for the compute-bound workloads found in neural network inference and training.
Q2Explain weight-stationary dataflow in a systolic array.
In weight-stationary dataflow, the weight matrix B is preloaded into the PE array — each PE holds exactly one weight element that remains fixed throughout the computation. Input activations (from matrix A) flow horizontally (east) through the array row by row. Partial sums accumulate vertically (south) as the activations pass through each PE. The key advantage: each weight value is loaded once into its PE and reused for every input vector, maximizing weight reuse and minimising weight memory bandwidth. Google's TPU v1 uses weight-stationary dataflow.
Q3What is the fill latency of an N×N systolic array?
The fill latency is 2N−1 cycles. The first input element reaches PE[0][0] at cycle 1, but PE[N-1][N-1] (bottom-right) only receives its first activation at cycle 2N−1 because it takes N−1 cycles for data to traverse N PEs horizontally and N−1 cycles to traverse vertically, staggered. After the fill phase (2N−1 cycles), the array is in steady state producing N outputs per cycle. Total latency for one matrix multiply is 3N−1 cycles (fill + N compute + drain). This pipeline latency is a one-time cost amortized over the computation.
Q4What is the difference between a MAC unit and a PE in a systolic array?
A MAC (Multiply-Accumulate) unit is a pure arithmetic circuit: output = A × B + C. A PE (Processing Element) in a systolic array wraps a MAC unit with registers for pipelining and local state storage (the stationary weight). The PE also includes input/output registers to pass activations east and partial sums south with correct timing alignment. A PE is the complete building block of a systolic array; a MAC is just the arithmetic core inside it.
Q5How does a systolic array handle matrices larger than the array size?
Larger matrices are processed by tiling. The matrix is divided into tiles matching the array dimensions (N×N). The accelerator loads one tile of B into the array as stationary weights, then streams multiple tiles of A through it, accumulating partial sums. After processing all A tiles for a B tile, the partial sums are written to memory and the next B tile is loaded. The software (or DMA controller) manages the tiling schedule. Tiling adds software complexity but allows any matrix size to be processed on a fixed-size array.
Q6Why do AI accelerators use INT8 instead of FP32 in systolic arrays?
INT8 (8-bit integer) MACs are 4× smaller in silicon area than FP32 MACs, consume significantly less power, and can achieve 4× higher throughput on the same die area. Neural network inference is tolerant of quantisation noise — models can be quantised from FP32 training weights to INT8 with less than 1% accuracy loss when done carefully. For a fixed die area, an INT8 systolic array fits 16× more PEs than FP32 (4× per dimension in a 2D array), enabling 16× higher throughput. This is why TPUs, Apple Neural Engine, and most edge AI chips use INT8 or INT4 for inference.
← Day 3: RoCC Interface Day 5: RoCC + Systolic →