HomeFPGA Neural NetworkDay 3 — Matrix Multiply Accelerator

Matrix Multiply
Accelerator (GEMM)

The computational heart of every neural network. Build a tiled 4×4 GEMM engine on FPGA with pipelined MAC arrays, DSP48 packing, output-stationary dataflow, and a complete Verilog implementation.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,800 words · 15 min read

1. Why GEMM is the Core of Neural Networks

Over 90% of the compute in modern deep networks is matrix multiplication (GEMM — General Matrix Multiply). Every major operation maps to it:

🧠
Fully Connected
Y = W × X + b
Direct GEMM
🔲
Convolution
im2col → GEMM
~95% of CNN ops
🔄
Attention
Q×Kᵀ, ×V
3 GEMMs per head
GEMM Definition: C = A × B + C (where A, B, C are matrices) For matrices: A[M×K], B[K×N] → C[M×N] Each element: C[i][j] = Σ(k=0 to K-1) A[i][k] × B[k][j] Total operations = M × N × K multiply-accumulates (MACs) ResNet-50 GEMM workload: Layer M K N MACs Conv1 (7×7) 64 147 3136 28.9M FC-1000 1000 2048 1 2.05M Total MACs: ~4.1 Billion per image That's 4.1 billion INT8 MACs per ResNet-50 inference. At 100 MHz with 1,024 MACs/cycle: 40µs compute time.

2. Naive vs Tiled Matrix Multiply

Naive Implementation — Memory Bottleneck

❌ Naive GEMM — Poor Memory Reuse
A [M×K] Row i (loaded K times) row i+1 row i+2 B [K×N] Col j (K loads) C [M×N] C[i,j] Problem: A[i][k] loaded M×N×K times total from DDR! For 256×256 matrix: 16M DDR reads — bandwidth limited, not compute limited

Tiled GEMM — The FPGA Way

✅ Tiled GEMM — High Data Reuse in BRAM
Matrix A (DDR) Tile A Matrix B (DDR) Tile B BRAM (On-chip) Tile A copy × Tile B copy = Tile C accum → 100% reuse! Matrix C (DDR) Tile C
Key insight: Each tile (e.g. 16×16 INT8 = 256 bytes) loaded once from DDR, then reused for all MAC operations in BRAM at full speed. Arithmetic intensity jumps from ~1 OPS/byte to 32+ OPS/byte — firmly compute-bound.

3. Dataflow Strategies

DataflowWhat Stays FixedWhat StreamsBest ForUsed In
Output-StationaryAccumulator C[i][j]Rows of A, Cols of BLow output memory BWSimple FPGA engines
Weight-StationaryWeight tile (B)Activations (A), outputs (C)Weight reuse across batchesGoogle TPU, systolic arrays
Input-StationaryActivation tile (A)Weights (B), outputs (C)High input reuseEdge inference engines
Row-StationaryOne row of computationAll other dataMinimize all memory trafficMIT Eyeriss accelerator

For our Day 3 implementation, we use output-stationary dataflow — it's the most natural to implement in Verilog and maps cleanly to DSP58 accumulators.

4. 4×4 MAC Array Architecture

4×4 MAC Array — Parallel Output-Stationary GEMM
B col 0 B col 1 B col 2 B col 3 A row 0 A row 1 A row 2 A row 3 MAC[0,0] C[0][0] MAC[0,1] C[0][1] MAC[0,2] C[0][2] MAC[0,3] C[0][3] MAC[1,0] C[1][0] MAC[1,1] C[1][1] MAC[1,2] C[1][2] MAC[1,3] C[1][3] MAC[2,0..3] — 4 parallel accumulators MAC[3,0..3] — 4 parallel accumulators A rows (broadcast horizontally) B cols (broadcast vertically)
16 parallel MACs compute all 16 output elements of a 4×4 tile simultaneously. Each MAC accumulates K partial products. Total throughput: 16 MACs/cycle → for K=256, tile completes in 256 cycles.

5. Throughput Analysis

GEMM Engine Throughput: Array size: 4×4 = 16 MAC units Clock frequency: 200 MHz (achievable on Xilinx UltraScale+) MACs per cycle: 16 Throughput: 16 × 200M = 3.2 GMAC/s = 3.2 TOPS (INT8) DSP utilization: 16 DSP58E2 blocks (out of 12,288 on Alveo U250) → Only 0.13% of DSPs used! Scale-up potential: 32×32 array: 1024 MACs × 200MHz = 204.8 TOPS 64×64 array: 4096 MACs × 200MHz = 819.2 TOPS (Limited by BRAM for tile storage and routing) Optimal array size for Alveo U250: Rule of thumb: √(BRAM_size / element_size) √(54MB / 1 byte) = √(56.6M) ≈ 7,500 → 64×64 is feasible (4,096 DSPs, 2MB BRAM for tiles) Efficiency metric: TOPS / DSP = 3.2 TOPS / 16 DSPs = 0.2 TOPS/DSP Target: maximize this ratio → pack 2 INT8 MACs per DSP (next section)

6. DSP48 Packing — 2× Throughput Trick

The DSP58E2 has a 27-bit × 18-bit multiplier. Since INT8 only needs 8 bits, we can pack two INT8 multiplications into a single DSP block using careful bit-field arrangement.

DSP48 Packing — 2 INT8 MACs per DSP
❌ Unpacked (1 MAC/DSP) a[7:0] used 27-bit input — 19 bits WASTED b[7:0] 10 wasted ✅ Packed (2 MACs/DSP) a1[8:0] hi a0[8:0] lo 9 bits guard b1[7:0] b0[7:0] Packed product[47:0]: hi 24 bits = a1×b1, lo 24 bits = a0×b0 (extract with bit select) Result: 2× INT8 MACs with 1 DSP block → doubles effective throughput!

7. Complete 4×4 GEMM Engine in Verilog

// gemm_4x4.v — 4×4 Output-Stationary GEMM Engine (INT8) // Computes C[4][4] += A[4][K] × B[K][4] // 16 parallel MAC units, one per output element module gemm_4x4 #( parameter K = 64, // inner dimension (tile size) parameter DW = 8, // data width (INT8) parameter AW = 32 // accumulator width )( input wire clk, rst_n, input wire start, // begin computation input wire signed [DW-1:0] a_row [0:3], // A row broadcast (4 elements) input wire signed [DW-1:0] b_col [0:3], // B col broadcast (4 elements) input wire valid_in, // data valid output reg signed [AW-1:0] c [0:3][0:3], // 4×4 output tile output reg done // computation complete ); // Inner product counters reg [$clog2(K):0] k_cnt; reg computing; integer i, j; always @(posedge clk or negedge rst_n) begin if (!rst_n) begin for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= 0; k_cnt <= 0; computing <= 0; done <= 0; end else begin done <= 0; if (start) begin // Clear accumulators, begin for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= 0; k_cnt <= 0; computing <= 1; end else if (computing && valid_in) begin // Accumulate: 16 MACs in parallel for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= c[i][j] + ({{(AW-DW){a_row[i][DW-1]}}, a_row[i]} * {{(AW-DW){b_col[j][DW-1]}}, b_col[j]}); k_cnt <= k_cnt + 1; if (k_cnt == K - 1) begin computing <= 0; done <= 1; // tile complete end end end end endmodule

Testbench — Verify 4×4 GEMM

// gemm_4x4_tb.v — Verify: Identity × Data = Data `timescale 1ns/1ps module gemm_4x4_tb; reg clk=0, rst_n=0, start=0, valid_in=0; reg signed [7:0] a_row[0:3], b_col[0:3]; wire signed [31:0] c[0:3][0:3]; wire done; gemm_4x4 #(.K(4)) uut(.*); always #5 clk=~clk; integer i; initial begin rst_n=0; #20; rst_n=1; #10; // A = Identity 4×4, B = [[1,2,3,4],[5,6,7,8],...] // Expected: C = B (identity × B = B) start=1; @(posedge clk); start=0; for (i=0; i<4; i++) begin a_row[0]=i==0?1:0; a_row[1]=i==1?1:0; a_row[2]=i==2?1:0; a_row[3]=i==3?1:0; b_col[0]=i*4+1; b_col[1]=i*4+2; b_col[2]=i*4+3; b_col[3]=i*4+4; valid_in=1; @(posedge clk); end valid_in=0; wait(done); #10; $display("C[0][0]=%0d (exp 1)", c[0][0]); $display("C[1][1]=%0d (exp 6)", c[1][1]); $display("C[3][3]=%0d (exp 16)", c[3][3]); $finish; end endmodule

8. Tiling Strategy for Large Matrices

Tiling Loop Structure — Outer Controller
for mt = 0 to M/4 - 1:   // tile rows of A,C
for nt = 0 to N/4 - 1:   // tile cols of B,C
for kt = 0 to K/T - 1:   // tile inner dim
Load A_tile[mt][kt] → BRAM_A  // DDR read once
Load B_tile[kt][nt] → BRAM_B  // DDR read once
gemm_4x4.start()           // compute tile
wait(done); C_tile += result  // accumulate
Reuse ratio: Each tile of A is reused N/4 times (across all column tiles). Each tile of B reused M/4 times. Total DDR bandwidth: O(M×K + K×N) not O(M×K×N) — massive savings!

9. Performance Numbers

ConfigurationArray SizeDSPs UsedTOPSBRAM (tiles)Freq
Small (this tutorial)4×4160.00322 BRAM18200 MHz
Medium16×162560.1028 BRAM36200 MHz
Large32×3210240.4132 BRAM36200 MHz
Max Alveo U25064×6440961.64128 BRAM36200 MHz
Max + DSP packing64×6420483.28128 BRAM36200 MHz

Day 3 — Key Takeaways

Next — Day 4: Systolic Array Architecture — the dataflow engine inside Google's TPU, weight-stationary computation, and an 8×8 systolic array in Verilog.

← Previous
Day 2: Fixed-Point & Quantization
Next →
Day 4: Systolic Array