What is a line buffer in an FPGA convolution engine?

A line buffer is on-chip BRAM that stores the most recent rows of an input image so a 2D convolution window can access multiple rows simultaneously. For a 3×3 kernel you need 2 line buffers (plus the current row), letting the engine produce one output pixel per clock as the image streams in, without re-reading rows from external DDR.

What is im2col and why is it used on FPGA?

im2col (image-to-column) unrolls each convolution window into a column vector, turning convolution into a single large matrix multiply (GEMM). This lets a CNN reuse the highly optimized systolic/GEMM hardware. The trade-off is memory expansion — overlapping windows duplicate pixels — so direct line-buffer convolution is often preferred for streaming FPGA designs.

How many multiply-accumulates does a 3x3 convolution need?

A single 3×3 convolution window over one input channel needs 9 MACs. For a full layer with Cin input channels, Cout output channels, and an H×W output, total MACs = 9 × Cin × Cout × H × W. For example, a 3×3 layer with 64→64 channels on a 56×56 feature map needs 9 × 64 × 64 × 3136 ≈ 116 million MACs.

Convolution Engine for CNNs on FPGA — Line Buffers & im2col

1. What is 2D Convolution?

Convolution slides a small kernel (filter) across an input image, computing a weighted sum at each position. It's the operation that lets CNNs detect edges, textures, and ultimately objects. On FPGA, the challenge is doing this at one output pixel per clock without re-reading the image from slow external memory.

2D Convolution (single channel, 3×3 kernel): out[y][x] = Σ(i=0..2) Σ(j=0..2) in[y+i][x+j] × kernel[i][j] 9 multiply-accumulates per output pixel Full layer MAC count: MACs = K×K × Cin × Cout × H_out × W_out Example (3×3, 64→64, 56×56 output): = 9 × 64 × 64 × 3136 ≈ 116 million MACs per layer

2. The Sliding Window

A 3×3 convolution needs 9 pixels at once — but the image streams in one pixel per clock, row by row. The sliding window holds the current 3×3 patch and shifts right each cycle.

3×3 Sliding Window over Input Image

3. Line Buffers — The Key to Streaming

The problem: a 3×3 window touches 3 different rows, but the image arrives one row at a time. The solution: line buffers store the previous rows in on-chip BRAM so all 3 rows are available simultaneously.

Line Buffer Architecture (3×3 conv)

Result: for a K×K kernel you need K−1 line buffers. Each holds one image row (width × bytes). After the buffers fill, the engine produces one output pixel every clock cycle — true streaming.

Line Buffer Sizing: Buffers needed = K − 1 (for K×K kernel) Each buffer size = image_width × bytes_per_pixel Example (3×3 kernel, 224-wide image, INT8): Buffers: 2 Size each: 224 × 1 byte = 224 bytes → 1 BRAM18 each Total: 2 BRAM18 blocks (tiny!) Latency before first output: ≈ (K−1) × image_width + K cycles = 2 × 224 + 3 = 451 cycles (then 1 output/clock)

4. Convolution Engine in Verilog

This streaming 3×3 conv engine uses two line buffers, a 3×3 shift-register window, and 9 parallel MACs.

// conv3x3.v — Streaming 3×3 Convolution Engine (INT8, single channel)
module conv3x3 #(
  parameter IMG_W = 224,   // image width
  parameter DW    = 8      // data width
)(
  input  wire              clk, rst_n,
  input  wire              pix_valid,            // input pixel valid
  input  wire signed [DW-1:0] pix_in,            // streaming pixel
  input  wire signed [DW-1:0] kernel [0:8],      // 9 kernel weights
  output reg  signed [2*DW+4:0] conv_out,        // convolution result
  output reg               out_valid
);

  // Two line buffers (store previous 2 rows)
  reg signed [DW-1:0] lb1 [0:IMG_W-1];   // row N-2
  reg signed [DW-1:0] lb2 [0:IMG_W-1];   // row N-1
  reg [$clog2(IMG_W):0] col;

  // 3×3 window shift registers (3 taps per row)
  reg signed [DW-1:0] w00,w01,w02, w10,w11,w12, w20,w21,w22;

  wire signed [DW-1:0] r0 = lb1[col];  // top row pixel
  wire signed [DW-1:0] r1 = lb2[col];  // middle row pixel
  wire signed [DW-1:0] r2 = pix_in;    // current row pixel

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      col <= 0; out_valid <= 0;
    end else if (pix_valid) begin
      // shift window left, bring in new column (r0,r1,r2)
      w00<=w01; w01<=w02; w02<=r0;
      w10<=w11; w11<=w12; w12<=r1;
      w20<=w21; w21<=w22; w22<=r2;

      // update line buffers: lb1 gets old lb2, lb2 gets new pixel
      lb1[col] <= lb2[col];
      lb2[col] <= pix_in;

      col <= (col == IMG_W-1) ? 0 : col + 1;

      // 9-tap MAC (valid once window is full)
      conv_out <= w00*kernel[0] + w01*kernel[1] + w02*kernel[2]
                + w10*kernel[3] + w11*kernel[4] + w12*kernel[5]
                + w20*kernel[6] + w21*kernel[7] + w22*kernel[8];
      out_valid <= (col >= 2);  // skip border warm-up
    end
  end
endmodule

5. The im2col Alternative

im2col unrolls each convolution window into a column, turning the whole layer into one big matrix multiply — letting you reuse the Day 3 GEMM engine or Day 4 systolic array.

im2col — Convolution as GEMM

Approach	Memory	Hardware Reuse	Best For
Line-buffer (direct)	Low (K−1 rows)	Dedicated conv engine	Streaming, edge, low latency
im2col + GEMM	High (pixel duplication)	Reuses systolic/GEMM array	Datacenter, large batch

6. Multi-Channel Convolution

Real CNN layers have many input and output channels. The single-channel engine above is replicated and accumulated across input channels.

Multi-channel conv (per output pixel, per output channel): out[co][y][x] = Σ(ci=0..Cin-1) conv3x3(in[ci], kernel[co][ci]) Hardware options: 1. Channel parallelism: Cin conv engines in parallel → sum 2. Channel sequencing: 1 engine, accumulate over Cin cycles 3. Output parallelism: Cout engines for different filters Resource estimate (3×3, 64 input channels, INT8): Full parallel: 64 × 9 = 576 MACs (576 DSPs) With DSP packing (2 INT8/DSP): 288 DSPs Alveo U250 has 12,288 DSPs → plenty of room

Connecting the Course

The conv engine feeds its output into the activation unit (Day 7) and pooling (Day 8). Weights and feature maps are streamed via the memory architecture from Day 6. This is how the full CNN pipeline (Day 9) comes together.

Day 5 — Key Takeaways

✅ 2D convolution = sliding weighted sum; 9 MACs per pixel for 3×3
✅ Sliding window holds the current K×K patch, shifts 1 pixel/clock
✅ Line buffers (K−1 of them) store previous rows in BRAM for streaming
✅ Latency ≈ (K−1)×width before first output, then 1 output/clock
✅ im2col turns conv into GEMM — reuses systolic hardware at a memory cost
✅ Multi-channel: parallelize, sequence, or replicate across Cin/Cout
✅ A 3×3 64→64 layer on 56×56 needs ~116M MACs — DSP-bound, fits easily on Alveo

Next — Day 6: Memory Architecture — BRAM vs DDR4, bandwidth bottleneck analysis, ping-pong buffering, and the AXI4 interface for streaming weights and activations.

← Previous

Day 4: Systolic Array

Day 6: Memory Architecture

Convolution Enginefor CNNs