The computational heart of every neural network. Build a tiled 4×4 GEMM engine on FPGA with pipelined MAC arrays, DSP48 packing, output-stationary dataflow, and a complete Verilog implementation.
Over 90% of the compute in modern deep networks is matrix multiplication (GEMM — General Matrix Multiply). Every major operation maps to it:
| Dataflow | What Stays Fixed | What Streams | Best For | Used In |
|---|---|---|---|---|
| Output-Stationary | Accumulator C[i][j] | Rows of A, Cols of B | Low output memory BW | Simple FPGA engines |
| Weight-Stationary | Weight tile (B) | Activations (A), outputs (C) | Weight reuse across batches | Google TPU, systolic arrays |
| Input-Stationary | Activation tile (A) | Weights (B), outputs (C) | High input reuse | Edge inference engines |
| Row-Stationary | One row of computation | All other data | Minimize all memory traffic | MIT Eyeriss accelerator |
For our Day 3 implementation, we use output-stationary dataflow — it's the most natural to implement in Verilog and maps cleanly to DSP58 accumulators.
The DSP58E2 has a 27-bit × 18-bit multiplier. Since INT8 only needs 8 bits, we can pack two INT8 multiplications into a single DSP block using careful bit-field arrangement.
// gemm_4x4.v — 4×4 Output-Stationary GEMM Engine (INT8)
// Computes C[4][4] += A[4][K] × B[K][4]
// 16 parallel MAC units, one per output element
module gemm_4x4 #(
parameter K = 64, // inner dimension (tile size)
parameter DW = 8, // data width (INT8)
parameter AW = 32 // accumulator width
)(
input wire clk, rst_n,
input wire start, // begin computation
input wire signed [DW-1:0] a_row [0:3], // A row broadcast (4 elements)
input wire signed [DW-1:0] b_col [0:3], // B col broadcast (4 elements)
input wire valid_in, // data valid
output reg signed [AW-1:0] c [0:3][0:3], // 4×4 output tile
output reg done // computation complete
);
// Inner product counters
reg [$clog2(K):0] k_cnt;
reg computing;
integer i, j;
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= 0;
k_cnt <= 0;
computing <= 0;
done <= 0;
end else begin
done <= 0;
if (start) begin
// Clear accumulators, begin
for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= 0;
k_cnt <= 0;
computing <= 1;
end else if (computing && valid_in) begin
// Accumulate: 16 MACs in parallel
for (i = 0; i < 4; i++)
for (j = 0; j < 4; j++)
c[i][j] <= c[i][j] + ({{(AW-DW){a_row[i][DW-1]}}, a_row[i]}
* {{(AW-DW){b_col[j][DW-1]}}, b_col[j]});
k_cnt <= k_cnt + 1;
if (k_cnt == K - 1) begin
computing <= 0;
done <= 1; // tile complete
end
end
end
end
endmodule// gemm_4x4_tb.v — Verify: Identity × Data = Data
`timescale 1ns/1ps
module gemm_4x4_tb;
reg clk=0, rst_n=0, start=0, valid_in=0;
reg signed [7:0] a_row[0:3], b_col[0:3];
wire signed [31:0] c[0:3][0:3];
wire done;
gemm_4x4 #(.K(4)) uut(.*);
always #5 clk=~clk;
integer i;
initial begin
rst_n=0; #20; rst_n=1; #10;
// A = Identity 4×4, B = [[1,2,3,4],[5,6,7,8],...]
// Expected: C = B (identity × B = B)
start=1; @(posedge clk); start=0;
for (i=0; i<4; i++) begin
a_row[0]=i==0?1:0; a_row[1]=i==1?1:0;
a_row[2]=i==2?1:0; a_row[3]=i==3?1:0;
b_col[0]=i*4+1; b_col[1]=i*4+2;
b_col[2]=i*4+3; b_col[3]=i*4+4;
valid_in=1; @(posedge clk);
end
valid_in=0;
wait(done); #10;
$display("C[0][0]=%0d (exp 1)", c[0][0]);
$display("C[1][1]=%0d (exp 6)", c[1][1]);
$display("C[3][3]=%0d (exp 16)", c[3][3]);
$finish;
end
endmodule| Configuration | Array Size | DSPs Used | TOPS | BRAM (tiles) | Freq |
|---|---|---|---|---|---|
| Small (this tutorial) | 4×4 | 16 | 0.0032 | 2 BRAM18 | 200 MHz |
| Medium | 16×16 | 256 | 0.102 | 8 BRAM36 | 200 MHz |
| Large | 32×32 | 1024 | 0.41 | 32 BRAM36 | 200 MHz |
| Max Alveo U250 | 64×64 | 4096 | 1.64 | 128 BRAM36 | 200 MHz |
| Max + DSP packing | 64×64 | 2048 | 3.28 | 128 BRAM36 | 200 MHz |
Next — Day 4: Systolic Array Architecture — the dataflow engine inside Google's TPU, weight-stationary computation, and an 8×8 systolic array in Verilog.