HomeFPGA Neural NetworkDay 5 — Convolution Engine

Convolution Engine
for CNNs

The core of every vision model. Build a streaming 2D convolution engine on FPGA with line buffers, sliding-window logic, kernel weight storage, im2col, and a complete 3×3 conv2D in Verilog.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,600 words · 14 min read

1. What is 2D Convolution?

Convolution slides a small kernel (filter) across an input image, computing a weighted sum at each position. It's the operation that lets CNNs detect edges, textures, and ultimately objects. On FPGA, the challenge is doing this at one output pixel per clock without re-reading the image from slow external memory.

2D Convolution (single channel, 3×3 kernel): out[y][x] = Σ(i=0..2) Σ(j=0..2) in[y+i][x+j] × kernel[i][j] 9 multiply-accumulates per output pixel Full layer MAC count: MACs = K×K × Cin × Cout × H_out × W_out Example (3×3, 64→64, 56×56 output): = 9 × 64 × 64 × 3136 ≈ 116 million MACs per layer

2. The Sliding Window

A 3×3 convolution needs 9 pixels at once — but the image streams in one pixel per clock, row by row. The sliding window holds the current 3×3 patch and shifts right each cycle.

3×3 Sliding Window over Input Image
3×3 window slides → 1 pixel/clock Kernel 9 weights (stored in BRAM) Each window position → 1 output pixel = Σ(window × kernel)

3. Line Buffers — The Key to Streaming

The problem: a 3×3 window touches 3 different rows, but the image arrives one row at a time. The solution: line buffers store the previous rows in on-chip BRAM so all 3 rows are available simultaneously.

Line Buffer Architecture (3×3 conv)
Pixel stream in Line Buffer 1 (BRAM) — stores row N−2 Line Buffer 2 (BRAM) — stores row N−1 + current row N (direct from stream) 3×3 Window Register 9 pixels → MAC array → 1 output/clock
Result: for a K×K kernel you need K−1 line buffers. Each holds one image row (width × bytes). After the buffers fill, the engine produces one output pixel every clock cycle — true streaming.
Line Buffer Sizing: Buffers needed = K − 1 (for K×K kernel) Each buffer size = image_width × bytes_per_pixel Example (3×3 kernel, 224-wide image, INT8): Buffers: 2 Size each: 224 × 1 byte = 224 bytes → 1 BRAM18 each Total: 2 BRAM18 blocks (tiny!) Latency before first output: ≈ (K−1) × image_width + K cycles = 2 × 224 + 3 = 451 cycles (then 1 output/clock)

4. Convolution Engine in Verilog

This streaming 3×3 conv engine uses two line buffers, a 3×3 shift-register window, and 9 parallel MACs.

// conv3x3.v — Streaming 3×3 Convolution Engine (INT8, single channel) module conv3x3 #( parameter IMG_W = 224, // image width parameter DW = 8 // data width )( input wire clk, rst_n, input wire pix_valid, // input pixel valid input wire signed [DW-1:0] pix_in, // streaming pixel input wire signed [DW-1:0] kernel [0:8], // 9 kernel weights output reg signed [2*DW+4:0] conv_out, // convolution result output reg out_valid ); // Two line buffers (store previous 2 rows) reg signed [DW-1:0] lb1 [0:IMG_W-1]; // row N-2 reg signed [DW-1:0] lb2 [0:IMG_W-1]; // row N-1 reg [$clog2(IMG_W):0] col; // 3×3 window shift registers (3 taps per row) reg signed [DW-1:0] w00,w01,w02, w10,w11,w12, w20,w21,w22; wire signed [DW-1:0] r0 = lb1[col]; // top row pixel wire signed [DW-1:0] r1 = lb2[col]; // middle row pixel wire signed [DW-1:0] r2 = pix_in; // current row pixel always @(posedge clk or negedge rst_n) begin if (!rst_n) begin col <= 0; out_valid <= 0; end else if (pix_valid) begin // shift window left, bring in new column (r0,r1,r2) w00<=w01; w01<=w02; w02<=r0; w10<=w11; w11<=w12; w12<=r1; w20<=w21; w21<=w22; w22<=r2; // update line buffers: lb1 gets old lb2, lb2 gets new pixel lb1[col] <= lb2[col]; lb2[col] <= pix_in; col <= (col == IMG_W-1) ? 0 : col + 1; // 9-tap MAC (valid once window is full) conv_out <= w00*kernel[0] + w01*kernel[1] + w02*kernel[2] + w10*kernel[3] + w11*kernel[4] + w12*kernel[5] + w20*kernel[6] + w21*kernel[7] + w22*kernel[8]; out_valid <= (col >= 2); // skip border warm-up end end endmodule

5. The im2col Alternative

im2col unrolls each convolution window into a column, turning the whole layer into one big matrix multiply — letting you reuse the Day 3 GEMM engine or Day 4 systolic array.

im2col — Convolution as GEMM
Input windows unroll im2col matrix 9 × N_windows × weights Cout × 9 = output feature map Reuses the GEMM/systolic hardware — but duplicates overlapping pixels (memory cost)
ApproachMemoryHardware ReuseBest For
Line-buffer (direct)Low (K−1 rows)Dedicated conv engineStreaming, edge, low latency
im2col + GEMMHigh (pixel duplication)Reuses systolic/GEMM arrayDatacenter, large batch

6. Multi-Channel Convolution

Real CNN layers have many input and output channels. The single-channel engine above is replicated and accumulated across input channels.

Multi-channel conv (per output pixel, per output channel): out[co][y][x] = Σ(ci=0..Cin-1) conv3x3(in[ci], kernel[co][ci]) Hardware options: 1. Channel parallelism: Cin conv engines in parallel → sum 2. Channel sequencing: 1 engine, accumulate over Cin cycles 3. Output parallelism: Cout engines for different filters Resource estimate (3×3, 64 input channels, INT8): Full parallel: 64 × 9 = 576 MACs (576 DSPs) With DSP packing (2 INT8/DSP): 288 DSPs Alveo U250 has 12,288 DSPs → plenty of room

Connecting the Course

The conv engine feeds its output into the activation unit (Day 7) and pooling (Day 8). Weights and feature maps are streamed via the memory architecture from Day 6. This is how the full CNN pipeline (Day 9) comes together.

Day 5 — Key Takeaways

Next — Day 6: Memory Architecture — BRAM vs DDR4, bandwidth bottleneck analysis, ping-pong buffering, and the AXI4 interface for streaming weights and activations.

← Previous
Day 4: Systolic Array
Next →
Day 6: Memory Architecture