The core of every vision model. Build a streaming 2D convolution engine on FPGA with line buffers, sliding-window logic, kernel weight storage, im2col, and a complete 3×3 conv2D in Verilog.
Convolution slides a small kernel (filter) across an input image, computing a weighted sum at each position. It's the operation that lets CNNs detect edges, textures, and ultimately objects. On FPGA, the challenge is doing this at one output pixel per clock without re-reading the image from slow external memory.
A 3×3 convolution needs 9 pixels at once — but the image streams in one pixel per clock, row by row. The sliding window holds the current 3×3 patch and shifts right each cycle.
The problem: a 3×3 window touches 3 different rows, but the image arrives one row at a time. The solution: line buffers store the previous rows in on-chip BRAM so all 3 rows are available simultaneously.
This streaming 3×3 conv engine uses two line buffers, a 3×3 shift-register window, and 9 parallel MACs.
// conv3x3.v — Streaming 3×3 Convolution Engine (INT8, single channel)
module conv3x3 #(
parameter IMG_W = 224, // image width
parameter DW = 8 // data width
)(
input wire clk, rst_n,
input wire pix_valid, // input pixel valid
input wire signed [DW-1:0] pix_in, // streaming pixel
input wire signed [DW-1:0] kernel [0:8], // 9 kernel weights
output reg signed [2*DW+4:0] conv_out, // convolution result
output reg out_valid
);
// Two line buffers (store previous 2 rows)
reg signed [DW-1:0] lb1 [0:IMG_W-1]; // row N-2
reg signed [DW-1:0] lb2 [0:IMG_W-1]; // row N-1
reg [$clog2(IMG_W):0] col;
// 3×3 window shift registers (3 taps per row)
reg signed [DW-1:0] w00,w01,w02, w10,w11,w12, w20,w21,w22;
wire signed [DW-1:0] r0 = lb1[col]; // top row pixel
wire signed [DW-1:0] r1 = lb2[col]; // middle row pixel
wire signed [DW-1:0] r2 = pix_in; // current row pixel
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
col <= 0; out_valid <= 0;
end else if (pix_valid) begin
// shift window left, bring in new column (r0,r1,r2)
w00<=w01; w01<=w02; w02<=r0;
w10<=w11; w11<=w12; w12<=r1;
w20<=w21; w21<=w22; w22<=r2;
// update line buffers: lb1 gets old lb2, lb2 gets new pixel
lb1[col] <= lb2[col];
lb2[col] <= pix_in;
col <= (col == IMG_W-1) ? 0 : col + 1;
// 9-tap MAC (valid once window is full)
conv_out <= w00*kernel[0] + w01*kernel[1] + w02*kernel[2]
+ w10*kernel[3] + w11*kernel[4] + w12*kernel[5]
+ w20*kernel[6] + w21*kernel[7] + w22*kernel[8];
out_valid <= (col >= 2); // skip border warm-up
end
end
endmoduleim2col unrolls each convolution window into a column, turning the whole layer into one big matrix multiply — letting you reuse the Day 3 GEMM engine or Day 4 systolic array.
| Approach | Memory | Hardware Reuse | Best For |
|---|---|---|---|
| Line-buffer (direct) | Low (K−1 rows) | Dedicated conv engine | Streaming, edge, low latency |
| im2col + GEMM | High (pixel duplication) | Reuses systolic/GEMM array | Datacenter, large batch |
Real CNN layers have many input and output channels. The single-channel engine above is replicated and accumulated across input channels.
The conv engine feeds its output into the activation unit (Day 7) and pooling (Day 8). Weights and feature maps are streamed via the memory architecture from Day 6. This is how the full CNN pipeline (Day 9) comes together.
Next — Day 6: Memory Architecture — BRAM vs DDR4, bandwidth bottleneck analysis, ping-pong buffering, and the AXI4 interface for streaming weights and activations.