What is weight-stationary dataflow in a systolic array?

In weight-stationary dataflow, the weight matrix is preloaded into the PE array and stays stationary (fixed) in each PE throughout the computation. Input activations flow horizontally through the array, and partial sums accumulate vertically. Each PE holds one weight value and accumulates: acc += weight × input. This maximizes weight reuse — each weight is loaded once and used for every input row. Google's TPU uses weight-stationary dataflow. The alternative is output-stationary (each PE accumulates one output element) or input-stationary (inputs stay fixed).

RISC-V Accelerator Day 4 — Systolic Array Design: PE, Dataflow & Matrix Multiply

Q: What is a systolic array?

A systolic array is a network of Processing Elements (PEs) where data flows rhythmically from one PE to the next in synchrony with a clock — like the rhythmic pumping of a heart (systole). Each PE performs a multiply-accumulate (MAC) operation and passes its inputs to adjacent PEs. Systolic arrays are ideal for matrix multiplication because the same data is reused across multiple PEs without going back to memory, dramatically reducing memory bandwidth requirements. Google's TPU and most AI accelerators use systolic arrays as their core compute engine.

What is a Systolic Array?

A systolic array is a 2D grid of simple Processing Elements (PEs) where data flows rhythmically from one PE to its neighbors on every clock edge — like blood pumped rhythmically by the heart (Greek: systole). Each PE does one thing: a multiply-accumulate (MAC) operation. The magic is that data flows through the array, being reused at every PE it passes, without returning to memory.

Google's TPU, NVIDIA's Tensor Core, and virtually every AI inference chip uses a systolic array as its heart. Understanding this architecture is foundational for accelerator design.

Why Systolic Arrays Win

Matrix multiply on a CPU requires loading each operand multiple times from memory. A 4×4 systolic array reuses each input element across 4 PEs before discarding it. For an N×N array: compute = N³ MACs, memory accesses = O(N²). Arithmetic intensity = N — it grows with array size, making systolic arrays memory-bandwidth efficient.

Processing Element (PE) Design

Each PE in a weight-stationary systolic array holds one weight value and computes: acc += weight × input. Inputs flow east (right), partial sums flow south (down).

Single PE — weight-stationary

Verilog — Processing Element (PE), weight-stationary

module pe #(parameter W=8, parameter ACC=32) (
  input               clk, rst, load_w,
  input  [W-1:0]      w_in,      // weight to load
  input  [W-1:0]      a_in,      // activation flowing east
  input  [ACC-1:0]    psum_in,   // partial sum flowing south
  output reg [W-1:0]  a_out,     // activation passed east
  output reg [ACC-1:0] psum_out  // accumulated sum passed south
);
  reg [W-1:0] weight;

  always @(posedge clk) begin
    if (rst) begin
      weight <= 0; a_out <= 0; psum_out <= 0;
    end else begin
      if (load_w) weight <= w_in;   // preload weight
      a_out    <= a_in;               // pass activation east
      psum_out <= psum_in + (weight * a_in);  // MAC + pass south
    end
  end
endmodule

2×2 Systolic Array

Verilog — 2×2 systolic array for matrix multiply

// Computes C = A × B where A,B are 2×2 matrices
// Row activations flow east; partial sums flow south
module systolic_2x2 #(parameter W=8, parameter ACC=32) (
  input              clk, rst, load_w,
  input  [W-1:0]   w[0:1][0:1],  // weight matrix B
  input  [W-1:0]   a_row0[0:1],    // row 0 of A
  input  [W-1:0]   a_row1[0:1],    // row 1 of A
  output [ACC-1:0] C[0:1][0:1]     // result matrix C
);
  // Inter-PE wires
  wire [W-1:0]   a_mid[0:1];   // activation between col0→col1
  wire [ACC-1:0] p_mid[0:1];   // psum between row0→row1

  // Row 0
  pe #(W,ACC) pe00 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[0][0]),.a_in(a_row0[0]),.psum_in(0),
    .a_out(a_mid[0]),.psum_out(p_mid[0]));

  pe #(W,ACC) pe01 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[0][1]),.a_in(a_mid[0]),.psum_in(0),
    .a_out(),.psum_out(p_mid[1]));

  // Row 1
  pe #(W,ACC) pe10 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[1][0]),.a_in(a_row1[0]),.psum_in(p_mid[0]),
    .a_out(a_mid[1]),.psum_out(C[0][0]));

  pe #(W,ACC) pe11 (.clk(clk),.rst(rst),.load_w(load_w),
    .w_in(w[1][1]),.a_in(a_mid[1]),.psum_in(p_mid[1]),
    .a_out(),.psum_out(C[0][1]));
endmodule

Dataflow Comparison

Dataflow	What's Stationary	Data Flowing	Used In	Best For
Weight-Stationary	Weights (B matrix)	Activations east, psums south	Google TPU v1	Inference — weights loaded once
Output-Stationary	Partial sums (C matrix)	Both A and B flow in	NVDLA, many edge AI chips	When accumulator bandwidth is limited
Input-Stationary	Input activations (A matrix)	Weights and psums flow	Less common	Streaming inference
No Local Reuse	Nothing	All operands stream in	GPUs (register file)	Flexible but memory-bandwidth hungry

Latency & Throughput Analysis

For an N×N systolic array multiplying N×N matrices: Fill latency = 2N - 1 cycles (time to fill the pipeline) Compute cycles = N cycles (once filled, N outputs/cycle) Total latency = 3N - 1 cycles (fill + drain) Throughput (steady state) = N² MACs / cycle Example: 128×128 array at 1 GHz: Throughput = 128² × 2 = 32,768 GFlops = 32.7 TFlops (×2 for FMA = multiply + add counted as 2 flops) Compare to: CPU core ≈ 0.05 TFlops, GPU ≈ 1000 TFlops (due to 1000s of cores)

Day 4 — Interview Questions

Q1What is a systolic array and why is it efficient for matrix multiply?

A systolic array is a 2D grid of Processing Elements (PEs) where data flows rhythmically from PE to PE each clock cycle. Each PE does a MAC (multiply-accumulate) and passes its inputs to adjacent PEs. For N×N matrix multiply, a systolic array performs N³ MACs with only O(N²) memory accesses — arithmetic intensity scales as N. This means as the array size grows, the ratio of compute to memory bandwidth improves, making systolic arrays highly efficient for the compute-bound workloads found in neural network inference and training.

Q2Explain weight-stationary dataflow in a systolic array.

In weight-stationary dataflow, the weight matrix B is preloaded into the PE array — each PE holds exactly one weight element that remains fixed throughout the computation. Input activations (from matrix A) flow horizontally (east) through the array row by row. Partial sums accumulate vertically (south) as the activations pass through each PE. The key advantage: each weight value is loaded once into its PE and reused for every input vector, maximizing weight reuse and minimising weight memory bandwidth. Google's TPU v1 uses weight-stationary dataflow.

Q3What is the fill latency of an N×N systolic array?

The fill latency is 2N−1 cycles. The first input element reaches PE[0][0] at cycle 1, but PE[N-1][N-1] (bottom-right) only receives its first activation at cycle 2N−1 because it takes N−1 cycles for data to traverse N PEs horizontally and N−1 cycles to traverse vertically, staggered. After the fill phase (2N−1 cycles), the array is in steady state producing N outputs per cycle. Total latency for one matrix multiply is 3N−1 cycles (fill + N compute + drain). This pipeline latency is a one-time cost amortized over the computation.

Q4What is the difference between a MAC unit and a PE in a systolic array?

A MAC (Multiply-Accumulate) unit is a pure arithmetic circuit: output = A × B + C. A PE (Processing Element) in a systolic array wraps a MAC unit with registers for pipelining and local state storage (the stationary weight). The PE also includes input/output registers to pass activations east and partial sums south with correct timing alignment. A PE is the complete building block of a systolic array; a MAC is just the arithmetic core inside it.

Q5How does a systolic array handle matrices larger than the array size?

Larger matrices are processed by tiling. The matrix is divided into tiles matching the array dimensions (N×N). The accelerator loads one tile of B into the array as stationary weights, then streams multiple tiles of A through it, accumulating partial sums. After processing all A tiles for a B tile, the partial sums are written to memory and the next B tile is loaded. The software (or DMA controller) manages the tiling schedule. Tiling adds software complexity but allows any matrix size to be processed on a fixed-size array.

Q6Why do AI accelerators use INT8 instead of FP32 in systolic arrays?

INT8 (8-bit integer) MACs are 4× smaller in silicon area than FP32 MACs, consume significantly less power, and can achieve 4× higher throughput on the same die area. Neural network inference is tolerant of quantisation noise — models can be quantised from FP32 training weights to INT8 with less than 1% accuracy loss when done carefully. For a fixed die area, an INT8 systolic array fits 16× more PEs than FP32 (4× per dimension in a 2D array), enabling 16× higher throughput. This is why TPUs, Apple Neural Engine, and most edge AI chips use INT8 or INT4 for inference.

← Day 3: RoCC Interface Day 5: RoCC + Systolic →

Systolic Array DesignPE Architecture, Dataflow & Matrix Multiply