HomeDay 8

4×4 Systolic Array Deep Dive

Cycle-by-cycle execution of a real systolic array computing C = A @ B. Understand the heartbeat of GPU/TPU acceleration.

Problem: Multiply 4×4 Matrix

Compute C[4,4] = A[4,4] @ B[4,4] using a 4×4 systolic array.

The Setup

Input A (4×4): Input B (4×4): Compute C = A @ B [1 2 3 4] [1 0 0 0] [5 6 7 8] [0 1 0 0] [9 10 11 12] × [0 0 1 0] = ? [13 14 15 16] [0 0 0 1] Identity matrix, so C = A (simple case for visualization)

Systolic Array Hardware Layout

A inputs (left edge) ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs (right) ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs ↓ ↓ ↓ ↓ [MAC] [MAC] [MAC] [MAC] → B outputs ↓ C outputs (bottom edge) Data flow: - A values enter from LEFT, flow RIGHT - B values enter from TOP, flow DOWN - Partial sums move RIGHT with data - Final C values exit at BOTTOM

Cycle-by-Cycle Execution

Initialization: Pipelined loading of A (rows), B (columns)

Loading Phase (Cycles 0-6)

Before computation starts, we must load A and B values into the systolic array. They propagate through the pipeline:

Cycle 0: A[0,0]=1 enters MAC[0,0] Cycle 1: A[0,0] → MAC[0,1], A[1,0]=5 → MAC[0,0] B[0,0]=1 enters MAC[0,0] Cycle 2: A[0,0] → MAC[0,2], A[1,0] → MAC[0,1], A[2,0]=9 → MAC[0,0] B[0,0] → MAC[1,0], B[1,0]=0 → MAC[0,0] ... (After 6 cycles, all 16 A values and 16 B values are queued up)

Compute Phase (Cycles 7-22)

All 16 MACs performing multiplications simultaneously.

CycleMAC[0,0]MAC[0,1]MAC[0,2]MAC[0,3]Comment
71×1=11×0=02×0=03×0=01st column of A meets 1st row of B
82×0+1=12×1=23×1=34×1=42nd column of A meets row 2 of B
93×0+1=13×0=04×0=0...Partial sums accumulate down column
104×0+1=1.........Final sum for C[0,0] = 1

Key Property: Wavefront Execution

"Wavefront" = diagonal of active computations moving through array.

Time T=5 (5 MAC operations active): [A] [A] [ ] [ ] [A] [A] [A] [ ] [ ] [A] [A] [A] [ ] [ ] [A] [A] Time T=8 (all 16 MACs active): [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] Time T=12 (still 16 active, draining output): [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] [A] Time T=22 (last result exits): [ ] [ ] [ ] [ ] [ ] [ ] [ ] [A] [ ] [ ] [A] [A] [ ] [A] [A] [A]

Output Generation

C values exit through the bottom edge over cycles 10-22 (after pipeline fills).

Memory Access Pattern

Implementation: SystemVerilog

module systolic_4x4 ( input clk, reset, input [7:0] A_in [0:3], // 4 A values per cycle (top edge) input [7:0] B_in [0:3], // 4 B values per cycle (left edge) output [15:0] C_out [0:3] // C results (bottom edge) ); // Array of processing elements reg [7:0] a_mem [0:3][0:3]; // A pipeline registers reg [7:0] b_mem [0:3][0:3]; // B pipeline registers reg [15:0] c_mem [0:3][0:3]; // Partial sums (accumulators) always @(posedge clk) begin if (reset) begin // Clear all accumulators for (int i=0; i<4; i++) for (int j=0; j<4; j++) c_mem[i][j] <= 0; end else begin // Propagate A leftward through array for (int i=0; i<4; i++) begin a_mem[i][3] <= a_mem[i][2]; a_mem[i][2] <= a_mem[i][1]; a_mem[i][1] <= a_mem[i][0]; a_mem[i][0] <= A_in[i]; end // Propagate B downward through array for (int j=0; j<4; j++) begin b_mem[3][j] <= b_mem[2][j]; b_mem[2][j] <= b_mem[1][j]; b_mem[1][j] <= b_mem[0][j]; b_mem[0][j] <= B_in[j]; end // MAC operation: C[i,j] += A[i,k] * B[k,j] for (int i=0; i<4; i++) for (int j=0; j<4; j++) c_mem[i][j] <= c_mem[i][j] + (a_mem[i][j] * b_mem[i][j]); end end // Output C results assign C_out = c_mem[3]; // Bottom row outputs (partial sums flow out) endmodule