How to keep thousands of MACs busy every cycle. Layer pipelining, inter-layer FIFOs, the throughput-vs-latency trade-off, initiation interval, and data vs model parallelism.
An FPGA's superpower is spatial parallelism — but only if you keep the hardware busy. A conv engine with 1,000 MACs that sits idle 80% of the time delivers the same throughput as 200 fully-utilized MACs. Pipelining and parallelism are what turn raw DSP count into real inference performance.
Three levers control performance:
A pipeline overlaps the stages of consecutive operations — like an assembly line. While stage 3 works on item A, stage 2 works on item B, and stage 1 on item C. After the pipeline fills, one result comes out every clock.
II=1 is the holy grail — a new input accepted every clock. Two things break it: resource conflicts (two operations need the same DSP/BRAM port) and loop-carried dependencies (this iteration needs last iteration's result).
| II Killer | Cause | Fix |
|---|---|---|
| Resource conflict | One BRAM port, two reads/cycle | Array partition / dual-port BRAM |
| Loop-carried dependency | acc += x[i] (acc depends on prev) | Partial sums / tree reduction |
| Long combinational path | Too much logic between registers | Add pipeline registers (retiming) |
| Variable-latency op | Divide, sqrt with data-dependent cycles | Fixed-latency LUT/CORDIC |
A naive acc += data[i] creates a loop-carried dependency: the adder must finish iteration i before starting i+1. If the adder takes 3 cycles, you get II=3. The fix is multiple partial accumulators (interleave 4 sums, combine at the end) — restoring II=1 at the cost of a few extra registers.
Unrolling replicates hardware to do more work per cycle. Instead of one MAC processing 64 channels over 64 cycles, instantiate 64 MACs and do it in one cycle. It's a direct area-for-speed trade.
The most powerful FPGA architecture: run every layer's engine simultaneously, connected by FIFOs. Layer 1 streams into Layer 2's FIFO while Layer 2 processes the previous frame's data — the whole network is one giant pipeline.
In a balanced layer pipeline, overall throughput equals the throughput of the slowest stage. If Conv3 takes twice as long as the others, the whole pipeline runs at half speed — so PD/HLS engineers carefully balance per-layer parallelism (more MACs for heavy layers) so every stage finishes in roughly the same time.
The classic FPGA architecture choice. Two extreme styles, with a spectrum between:
| Architecture | Hardware | Latency | Throughput | Best For |
|---|---|---|---|---|
| Single engine (folded) | 1 reused engine, all layers time-shared | Higher (sequential) | Lower | Small FPGA, low cost, large models |
| Layer-pipelined (dataflow) | One engine per layer, all concurrent | Low (streaming) | Very high | Real-time, fixed model, large FPGA |
| Hybrid | Pipeline groups of layers | Medium | High | Balanced resource budgets |
// stream_fifo.v — synchronous FIFO decoupling two layer engines
module stream_fifo #(parameter DW = 8, parameter DEPTH = 512)(
input wire clk, rst_n,
// producer (upstream layer)
input wire [DW-1:0] wr_data,
input wire wr_en,
output wire full,
// consumer (downstream layer)
output reg [DW-1:0] rd_data,
input wire rd_en,
output wire empty
);
localparam AW = $clog2(DEPTH);
reg [DW-1:0] mem [0:DEPTH-1];
reg [AW:0] wptr, rptr; // extra bit for full/empty distinction
assign full = (wptr[AW] != rptr[AW]) && (wptr[AW-1:0] == rptr[AW-1:0]);
assign empty = (wptr == rptr);
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin wptr <= 0; rptr <= 0; end
else begin
if (wr_en && !full) begin mem[wptr[AW-1:0]] <= wr_data; wptr <= wptr + 1; end
if (rd_en && !empty) begin rd_data <= mem[rptr[AW-1:0]]; rptr <= rptr + 1; end
end
end
endmodule
// Producer stalls when full; consumer stalls when empty.
// FIFO absorbs short-term rate mismatch between layer engines.Next — Day 10: Vitis HLS — write CNN layers in C++, synthesize to RTL with PIPELINE / UNROLL / ARRAY_PARTITION pragmas, and read latency/area reports.