AI Chip Design Day 8 — 4×4 Systolic Array Deep Dive

Problem: Multiply 4×4 Matrix

Compute C[4,4] = A[4,4] @ B[4,4] using a 4×4 systolic array.

The Setup

Input A (4×4): Input B (4×4): Compute C = A @ B [1 2 3 4] [1 0 0 0] [5 6 7 8] [0 1 0 0] [9 10 11 12] × [0 0 1 0] = ? [13 14 15 16] [0 0 0 1] Identity matrix, so C = A (simple case for visualization)

Systolic Array Hardware Layout

A inputs (left edge) ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs (right) ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs ↓ C outputs (bottom edge) Data flow: - A values enter from LEFT, flow RIGHT - B values enter from TOP, flow DOWN - Partial sums move RIGHT with data - Final C values exit at BOTTOM

Cycle-by-Cycle Execution

Initialization: Pipelined loading of A (rows), B (columns)

Loading Phase (Cycles 0-6)

Before computation starts, we must load A and B values into the systolic array. They propagate through the pipeline:

Cycle 0: A[0,0]=1 enters MAC[0,0] Cycle 1: A[0,0] → MAC[0,1], A[1,0]=5 → MAC[0,0] B[0,0]=1 enters MAC[0,0] Cycle 2: A[0,0] → MAC[0,2], A[1,0] → MAC[0,1], A[2,0]=9 → MAC[0,0] B[0,0] → MAC[1,0], B[1,0]=0 → MAC[0,0] ... (After 6 cycles, all 16 A values and 16 B values are queued up)

Compute Phase (Cycles 7-22)

All 16 MACs performing multiplications simultaneously.

Cycle	MAC[0,0]	MAC[0,1]	MAC[0,2]	MAC[0,3]	Comment
7	1×1=1	1×0=0	2×0=0	3×0=0	1st column of A meets 1st row of B
8	2×0+1=1	2×1=2	3×1=3	4×1=4	2nd column of A meets row 2 of B
9	3×0+1=1	3×0=0	4×0=0	...	Partial sums accumulate down column
10	4×0+1=1	...	...	...	Final sum for C[0,0] = 1

Key Property: Wavefront Execution

"Wavefront" = diagonal of active computations moving through array.

Time T=5 (5 MAC operations active): [A] [A] [ ] [ ] [A] [A] [A] [ ] [ ] [A] [A] [A] [ ] [ ] [A] [A] Time T=8 (all 16 MACs active): [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] Time T=12 (still 16 active, draining output): [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] Time T=22 (last result exits): [ ] [ ] [ ] [ ] [ ] [ ] [ ] [A] [ ] [ ] [A] [A] [ ] [A] [A] [A]

Output Generation

C values exit through the bottom edge over cycles 10-22 (after pipeline fills).

Memory Access Pattern

A values: Each A[i,k] loaded once, used by MACs in row i across time (row-wise broadcast)
B values: Each B[k,j] loaded once, used by MACs in column j across time (column-wise broadcast)
Memory bandwidth: O(N) for N×N matrix (vs O(N²) for naive approach)

Implementation: SystemVerilog

module systolic_4x4 (
  input clk, reset,
  input [7:0] A_in [0:3],   // 4 A values per cycle (top edge)
  input [7:0] B_in [0:3],   // 4 B values per cycle (left edge)
  output [15:0] C_out [0:3] // C results (bottom edge)
);

  // Array of processing elements
  reg [7:0] a_mem [0:3][0:3];  // A pipeline registers
  reg [7:0] b_mem [0:3][0:3];  // B pipeline registers
  reg [15:0] c_mem [0:3][0:3]; // Partial sums (accumulators)

  always @(posedge clk) begin
    if (reset) begin
      // Clear all accumulators
      for (int i=0; i<4; i++)
        for (int j=0; j<4; j++)
          c_mem[i][j] <= 0;
    end else begin
      // Propagate A leftward through array
      for (int i=0; i<4; i++) begin
        a_mem[i][3] <= a_mem[i][2];
        a_mem[i][2] <= a_mem[i][1];
        a_mem[i][1] <= a_mem[i][0];
        a_mem[i][0] <= A_in[i];
      end

      // Propagate B downward through array
      for (int j=0; j<4; j++) begin
        b_mem[3][j] <= b_mem[2][j];
        b_mem[2][j] <= b_mem[1][j];
        b_mem[1][j] <= b_mem[0][j];
        b_mem[0][j] <= B_in[j];
      end

      // MAC operation: C[i,j] += A[i,k] * B[k,j]
      for (int i=0; i<4; i++)
        for (int j=0; j<4; j++)
          c_mem[i][j] <= c_mem[i][j] + (a_mem[i][j] * b_mem[i][j]);
    end
  end

  // Output C results
  assign C_out = c_mem[3]; // Bottom row outputs (partial sums flow out)
endmodule