Problem: Multiply 4×4 Matrix
Compute C[4,4] = A[4,4] @ B[4,4] using a 4×4 systolic array.
The Setup
Input A (4×4): Input B (4×4): Compute C = A @ B
[1 2 3 4] [1 0 0 0]
[5 6 7 8] [0 1 0 0]
[9 10 11 12] × [0 0 1 0] = ?
[13 14 15 16] [0 0 0 1]
Identity matrix, so C = A (simple case for visualization)
Systolic Array Hardware Layout
A inputs (left edge)
↓ ↓ ↓ ↓
[MAC] [MAC] [MAC] [MAC] → B outputs (right)
↓ ↓ ↓ ↓
[MAC] [MAC] [MAC] [MAC] → B outputs
↓ ↓ ↓ ↓
[MAC] [MAC] [MAC] [MAC] → B outputs
↓ ↓ ↓ ↓
[MAC] [MAC] [MAC] [MAC] → B outputs
↓
C outputs (bottom edge)
Data flow:
- A values enter from LEFT, flow RIGHT
- B values enter from TOP, flow DOWN
- Partial sums move RIGHT with data
- Final C values exit at BOTTOM
Cycle-by-Cycle Execution
Initialization: Pipelined loading of A (rows), B (columns)
Loading Phase (Cycles 0-6)
Before computation starts, we must load A and B values into the systolic array. They propagate through the pipeline:
Cycle 0: A[0,0]=1 enters MAC[0,0]
Cycle 1: A[0,0] → MAC[0,1], A[1,0]=5 → MAC[0,0]
B[0,0]=1 enters MAC[0,0]
Cycle 2: A[0,0] → MAC[0,2], A[1,0] → MAC[0,1], A[2,0]=9 → MAC[0,0]
B[0,0] → MAC[1,0], B[1,0]=0 → MAC[0,0]
...
(After 6 cycles, all 16 A values and 16 B values are queued up)
Compute Phase (Cycles 7-22)
All 16 MACs performing multiplications simultaneously.
| Cycle | MAC[0,0] | MAC[0,1] | MAC[0,2] | MAC[0,3] | Comment |
|---|---|---|---|---|---|
| 7 | 1×1=1 | 1×0=0 | 2×0=0 | 3×0=0 | 1st column of A meets 1st row of B |
| 8 | 2×0+1=1 | 2×1=2 | 3×1=3 | 4×1=4 | 2nd column of A meets row 2 of B |
| 9 | 3×0+1=1 | 3×0=0 | 4×0=0 | ... | Partial sums accumulate down column |
| 10 | 4×0+1=1 | ... | ... | ... | Final sum for C[0,0] = 1 |
Key Property: Wavefront Execution
"Wavefront" = diagonal of active computations moving through array.
Time T=5 (5 MAC operations active):
[A] [A] [ ] [ ]
[A] [A] [A] [ ]
[ ] [A] [A] [A]
[ ] [ ] [A] [A]
Time T=8 (all 16 MACs active):
[A] [A] [A] [A]
[A] [A] [A] [A]
[A] [A] [A] [A]
[A] [A] [A] [A]
Time T=12 (still 16 active, draining output):
[A] [A] [A] [A]
[A] [A] [A] [A]
[A] [A] [A] [A]
[A] [A] [A] [A]
Time T=22 (last result exits):
[ ] [ ] [ ] [ ]
[ ] [ ] [ ] [A]
[ ] [ ] [A] [A]
[ ] [A] [A] [A]
Output Generation
C values exit through the bottom edge over cycles 10-22 (after pipeline fills).
Memory Access Pattern
- A values: Each A[i,k] loaded once, used by MACs in row i across time (row-wise broadcast)
- B values: Each B[k,j] loaded once, used by MACs in column j across time (column-wise broadcast)
- Memory bandwidth: O(N) for N×N matrix (vs O(N²) for naive approach)
Implementation: SystemVerilog
module systolic_4x4 (
input clk, reset,
input [7:0] A_in [0:3], // 4 A values per cycle (top edge)
input [7:0] B_in [0:3], // 4 B values per cycle (left edge)
output [15:0] C_out [0:3] // C results (bottom edge)
);
// Array of processing elements
reg [7:0] a_mem [0:3][0:3]; // A pipeline registers
reg [7:0] b_mem [0:3][0:3]; // B pipeline registers
reg [15:0] c_mem [0:3][0:3]; // Partial sums (accumulators)
always @(posedge clk) begin
if (reset) begin
// Clear all accumulators
for (int i=0; i<4; i++)
for (int j=0; j<4; j++)
c_mem[i][j] <= 0;
end else begin
// Propagate A leftward through array
for (int i=0; i<4; i++) begin
a_mem[i][3] <= a_mem[i][2];
a_mem[i][2] <= a_mem[i][1];
a_mem[i][1] <= a_mem[i][0];
a_mem[i][0] <= A_in[i];
end
// Propagate B downward through array
for (int j=0; j<4; j++) begin
b_mem[3][j] <= b_mem[2][j];
b_mem[2][j] <= b_mem[1][j];
b_mem[1][j] <= b_mem[0][j];
b_mem[0][j] <= B_in[j];
end
// MAC operation: C[i,j] += A[i,k] * B[k,j]
for (int i=0; i<4; i++)
for (int j=0; j<4; j++)
c_mem[i][j] <= c_mem[i][j] + (a_mem[i][j] * b_mem[i][j]);
end
end
// Output C results
assign C_out = c_mem[3]; // Bottom row outputs (partial sums flow out)
endmodule