Why is matrix multiplication the core of neural networks?

Every fully-connected layer computes Y = W×X + b, which is matrix multiplication. Convolution can be converted to GEMM using the im2col transformation. Attention mechanisms are QKV matrix multiplications. More than 90% of the compute in modern deep networks is GEMM, making an efficient GEMM engine the most important hardware block to design.

What is tiled matrix multiplication on FPGA?

Tiled GEMM divides large matrices into small tiles that fit in on-chip BRAM. Each tile is loaded once and reused for multiple dot products, increasing arithmetic intensity and reducing DDR bandwidth pressure. For example, a 16×16 tile of INT8 values fits in one BRAM18 block and supports 256 MAC operations per load.

How does DSP48 packing work for INT8 GEMM?

The Xilinx DSP58E2 has a 27×18 bit multiplier. By packing two INT8 values into the 18-bit B input (upper 9 bits and lower 9 bits) and using careful bit manipulation, two INT8 multiplications can be performed simultaneously in one DSP block, doubling effective throughput without using more resources.

What is output-stationary dataflow?

In output-stationary dataflow, each output element C[i][j] stays in a register (accumulator) while the corresponding row of A and column of B stream through. This minimizes writes to memory and reuses the accumulator register for the entire dot product computation before writing the final result.

Matrix Multiply Accelerator on FPGA — GEMM Engine in Verilog

1. Why GEMM is the Core of Neural Networks

Over 90% of the compute in modern deep networks is matrix multiplication (GEMM — General Matrix Multiply). Every major operation maps to it:

🧠

Fully Connected

Y = W × X + b

Direct GEMM

🔲

Convolution

im2col → GEMM

~95% of CNN ops

🔄

Attention

Q×Kᵀ, ×V

3 GEMMs per head

GEMM Definition: C = A × B + C (where A, B, C are matrices) For matrices: A[M×K], B[K×N] → C[M×N] Each element: C[i][j] = Σ(k=0 to K-1) A[i][k] × B[k][j] Total operations = M × N × K multiply-accumulates (MACs) ResNet-50 GEMM workload: Layer M K N MACs Conv1 (7×7) 64 147 3136 28.9M FC-1000 1000 2048 1 2.05M Total MACs: ~4.1 Billion per image That's 4.1 billion INT8 MACs per ResNet-50 inference. At 100 MHz with 1,024 MACs/cycle: 40µs compute time.

2. Naive vs Tiled Matrix Multiply

Naive Implementation — Memory Bottleneck

❌ Naive GEMM — Poor Memory Reuse

Tiled GEMM — The FPGA Way

✅ Tiled GEMM — High Data Reuse in BRAM

Key insight: Each tile (e.g. 16×16 INT8 = 256 bytes) loaded once from DDR, then reused for all MAC operations in BRAM at full speed. Arithmetic intensity jumps from ~1 OPS/byte to 32+ OPS/byte — firmly compute-bound.

3. Dataflow Strategies

Dataflow	What Stays Fixed	What Streams	Best For	Used In
Output-Stationary	Accumulator C[i][j]	Rows of A, Cols of B	Low output memory BW	Simple FPGA engines
Weight-Stationary	Weight tile (B)	Activations (A), outputs (C)	Weight reuse across batches	Google TPU, systolic arrays
Input-Stationary	Activation tile (A)	Weights (B), outputs (C)	High input reuse	Edge inference engines
Row-Stationary	One row of computation	All other data	Minimize all memory traffic	MIT Eyeriss accelerator

For our Day 3 implementation, we use output-stationary dataflow — it's the most natural to implement in Verilog and maps cleanly to DSP58 accumulators.

4. 4×4 MAC Array Architecture

4×4 MAC Array — Parallel Output-Stationary GEMM

16 parallel MACs compute all 16 output elements of a 4×4 tile simultaneously. Each MAC accumulates K partial products. Total throughput: 16 MACs/cycle → for K=256, tile completes in 256 cycles.

5. Throughput Analysis

GEMM Engine Throughput: Array size: 4×4 = 16 MAC units Clock frequency: 200 MHz (achievable on Xilinx UltraScale+) MACs per cycle: 16 Throughput: 16 × 200M = 3.2 GMAC/s = 3.2 TOPS (INT8) DSP utilization: 16 DSP58E2 blocks (out of 12,288 on Alveo U250) → Only 0.13% of DSPs used! Scale-up potential: 32×32 array: 1024 MACs × 200MHz = 204.8 TOPS 64×64 array: 4096 MACs × 200MHz = 819.2 TOPS (Limited by BRAM for tile storage and routing) Optimal array size for Alveo U250: Rule of thumb: √(BRAM_size / element_size) √(54MB / 1 byte) = √(56.6M) ≈ 7,500 → 64×64 is feasible (4,096 DSPs, 2MB BRAM for tiles) Efficiency metric: TOPS / DSP = 3.2 TOPS / 16 DSPs = 0.2 TOPS/DSP Target: maximize this ratio → pack 2 INT8 MACs per DSP (next section)

6. DSP48 Packing — 2× Throughput Trick

The DSP58E2 has a 27-bit × 18-bit multiplier. Since INT8 only needs 8 bits, we can pack two INT8 multiplications into a single DSP block using careful bit-field arrangement.

DSP48 Packing — 2 INT8 MACs per DSP

7. Complete 4×4 GEMM Engine in Verilog

// gemm_4x4.v — 4×4 Output-Stationary GEMM Engine (INT8)
// Computes C[4][4] += A[4][K] × B[K][4]
// 16 parallel MAC units, one per output element

module gemm_4x4 #(
  parameter K     = 64,   // inner dimension (tile size)
  parameter DW    = 8,    // data width (INT8)
  parameter AW    = 32    // accumulator width
)(
  input  wire        clk, rst_n,
  input  wire        start,           // begin computation
  input  wire signed [DW-1:0] a_row [0:3], // A row broadcast (4 elements)
  input  wire signed [DW-1:0] b_col [0:3], // B col broadcast (4 elements)
  input  wire        valid_in,        // data valid
  output reg  signed [AW-1:0] c [0:3][0:3], // 4×4 output tile
  output reg         done             // computation complete
);

  // Inner product counters
  reg [$clog2(K):0] k_cnt;
  reg computing;

  integer i, j;

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= 0;
      k_cnt     <= 0;
      computing <= 0;
      done      <= 0;
    end else begin
      done <= 0;
      if (start) begin
        // Clear accumulators, begin
        for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) c[i][j] <= 0;
        k_cnt     <= 0;
        computing <= 1;
      end else if (computing && valid_in) begin
        // Accumulate: 16 MACs in parallel
        for (i = 0; i < 4; i++)
          for (j = 0; j < 4; j++)
            c[i][j] <= c[i][j] + ({{(AW-DW){a_row[i][DW-1]}}, a_row[i]}
                                 * {{(AW-DW){b_col[j][DW-1]}}, b_col[j]});
        k_cnt <= k_cnt + 1;
        if (k_cnt == K - 1) begin
          computing <= 0;
          done      <= 1;  // tile complete
        end
      end
    end
  end

endmodule

Testbench — Verify 4×4 GEMM

// gemm_4x4_tb.v — Verify: Identity × Data = Data
`timescale 1ns/1ps
module gemm_4x4_tb;
  reg clk=0, rst_n=0, start=0, valid_in=0;
  reg signed [7:0] a_row[0:3], b_col[0:3];
  wire signed [31:0] c[0:3][0:3];
  wire done;

  gemm_4x4 #(.K(4)) uut(.*);
  always #5 clk=~clk;

  integer i;
  initial begin
    rst_n=0; #20; rst_n=1; #10;
    // A = Identity 4×4, B = [[1,2,3,4],[5,6,7,8],...]
    // Expected: C = B (identity × B = B)
    start=1; @(posedge clk); start=0;
    for (i=0; i<4; i++) begin
      a_row[0]=i==0?1:0; a_row[1]=i==1?1:0;
      a_row[2]=i==2?1:0; a_row[3]=i==3?1:0;
      b_col[0]=i*4+1; b_col[1]=i*4+2;
      b_col[2]=i*4+3; b_col[3]=i*4+4;
      valid_in=1; @(posedge clk);
    end
    valid_in=0;
    wait(done); #10;
    $display("C[0][0]=%0d (exp 1)", c[0][0]);
    $display("C[1][1]=%0d (exp 6)", c[1][1]);
    $display("C[3][3]=%0d (exp 16)", c[3][3]);
    $finish;
  end
endmodule

8. Tiling Strategy for Large Matrices

Tiling Loop Structure — Outer Controller

for mt = 0 to M/4 - 1:   // tile rows of A,C

for nt = 0 to N/4 - 1:   // tile cols of B,C

for kt = 0 to K/T - 1:   // tile inner dim

Load A_tile[mt][kt] → BRAM_A  // DDR read once

Load B_tile[kt][nt] → BRAM_B  // DDR read once

gemm_4x4.start()           // compute tile

wait(done); C_tile += result  // accumulate

Reuse ratio: Each tile of A is reused N/4 times (across all column tiles). Each tile of B reused M/4 times. Total DDR bandwidth: O(M×K + K×N) not O(M×K×N) — massive savings!

9. Performance Numbers

Configuration	Array Size	DSPs Used	TOPS	BRAM (tiles)	Freq
Small (this tutorial)	4×4	16	0.0032	2 BRAM18	200 MHz
Medium	16×16	256	0.102	8 BRAM36	200 MHz
Large	32×32	1024	0.41	32 BRAM36	200 MHz
Max Alveo U250	64×64	4096	1.64	128 BRAM36	200 MHz
Max + DSP packing	64×64	2048	3.28	128 BRAM36	200 MHz

Day 3 — Key Takeaways

✅ GEMM is >90% of neural network compute — mastering it is the key skill
✅ Naive GEMM is memory-bound; tiled GEMM moves to compute-bound regime
✅ Output-stationary dataflow — keep C[i][j] in accumulator, stream A rows and B cols
✅ 4×4 MAC array computes 16 output elements in parallel per cycle
✅ DSP48 packing doubles throughput: 2 INT8 MACs per DSP block
✅ Tiling outer loop controller orchestrates DDR → BRAM → compute → writeback
✅ Arithmetic intensity jumps from ~1 to 32+ OPS/byte with tiling
✅ 64×64 array with packing achieves 3.28 TOPS on Alveo U250 at 200 MHz

Next — Day 4: Systolic Array Architecture — the dataflow engine inside Google's TPU, weight-stationary computation, and an 8×8 systolic array in Verilog.

← Previous

Day 2: Fixed-Point & Quantization

Day 4: Systolic Array