What is initiation interval (II) in FPGA pipelining?

Initiation interval (II) is the number of clock cycles between starting consecutive operations in a pipeline. II=1 (the ideal) means a new input is accepted every cycle, giving maximum throughput. II=2 means one new input every two cycles, halving throughput. Achieving II=1 requires resolving all resource conflicts and loop-carried dependencies.

What is the difference between data and model parallelism on FPGA?

Data parallelism processes multiple inputs (images) simultaneously through copies of the same hardware — boosting throughput. Model parallelism splits one model across hardware: different layers run on different engines (layer pipelining), or one layer's channels are computed in parallel. FPGAs typically combine both: parallel channels within a layer plus a pipeline across layers.

Why use inter-layer FIFOs in a CNN accelerator?

Inter-layer FIFOs decouple adjacent layer engines so they can run concurrently at slightly different rates without stalling each other. When a producing layer runs faster than the consumer, the FIFO buffers the surplus; when it runs slower, the FIFO drains. This smooths out rate mismatches and keeps all pipeline stages busy.

Pipelining & Parallelism in FPGA Neural Networks — Layer Pipelines & Dataflow

1. Why Pipelining Is Everything

An FPGA's superpower is spatial parallelism — but only if you keep the hardware busy. A conv engine with 1,000 MACs that sits idle 80% of the time delivers the same throughput as 200 fully-utilized MACs. Pipelining and parallelism are what turn raw DSP count into real inference performance.

Three levers control performance:

Pipelining — overlap operations in time so a new result emerges every cycle
Spatial parallelism — replicate hardware to process more data at once
Dataflow — connect layer engines so they all run concurrently

2. The Pipeline Concept

A pipeline overlaps the stages of consecutive operations — like an assembly line. While stage 3 works on item A, stage 2 works on item B, and stage 1 on item C. After the pipeline fills, one result comes out every clock.

Pipelined Execution — Fill, Steady-State, Drain

Pipeline Performance: Latency = depth (cycles to traverse the pipeline) Throughput = 1 / II (results per cycle) II = Initiation Interval = cycles between new inputs II = 1 → ideal (new input every cycle) → max throughput II = 2 → half throughput II = N → 1/N throughput For a 1000-image batch through a depth-50, II=1 pipeline: Total cycles = depth + (N-1)×II = 50 + 999×1 = 1049 cycles vs non-pipelined: 1000 × 50 = 50,000 cycles → ~48× speedup from pipelining alone!

3. Achieving II = 1

II=1 is the holy grail — a new input accepted every clock. Two things break it: resource conflicts (two operations need the same DSP/BRAM port) and loop-carried dependencies (this iteration needs last iteration's result).

II Killer	Cause	Fix
Resource conflict	One BRAM port, two reads/cycle	Array partition / dual-port BRAM
Loop-carried dependency	acc += x[i] (acc depends on prev)	Partial sums / tree reduction
Long combinational path	Too much logic between registers	Add pipeline registers (retiming)
Variable-latency op	Divide, sqrt with data-dependent cycles	Fixed-latency LUT/CORDIC

The Accumulator Trap

A naive acc += data[i] creates a loop-carried dependency: the adder must finish iteration i before starting i+1. If the adder takes 3 cycles, you get II=3. The fix is multiple partial accumulators (interleave 4 sums, combine at the end) — restoring II=1 at the cost of a few extra registers.

4. Spatial Parallelism — Unrolling

Unrolling replicates hardware to do more work per cycle. Instead of one MAC processing 64 channels over 64 cycles, instantiate 64 MACs and do it in one cycle. It's a direct area-for-speed trade.

Parallelism Dimensions in a Conv Layer

Input-channel parallel

Process all Cin channels of one window at once → Cin MACs summed in a tree

Output-channel parallel

Compute Cout filters at once → Cout independent MAC arrays share inputs

Kernel parallel

All K×K taps in parallel (the Day 5 conv engine — 9 MACs for 3×3)

Pixel/batch parallel

Multiple output pixels or images at once → data parallelism

Total parallelism = product of all dimensions. A design with Cin=16 × Cout=16 = 256 MACs working in parallel, all at II=1, hits 256 MACs/cycle.

5. Layer Pipelining with Inter-Layer FIFOs

The most powerful FPGA architecture: run every layer's engine simultaneously, connected by FIFOs. Layer 1 streams into Layer 2's FIFO while Layer 2 processes the previous frame's data — the whole network is one giant pipeline.

Layer-Pipelined CNN with Inter-Layer FIFOs

Throughput = Slowest Stage

In a balanced layer pipeline, overall throughput equals the throughput of the slowest stage. If Conv3 takes twice as long as the others, the whole pipeline runs at half speed — so PD/HLS engineers carefully balance per-layer parallelism (more MACs for heavy layers) so every stage finishes in roughly the same time.

6. Throughput vs Latency Trade-off

The classic FPGA architecture choice. Two extreme styles, with a spectrum between:

Architecture	Hardware	Latency	Throughput	Best For
Single engine (folded)	1 reused engine, all layers time-shared	Higher (sequential)	Lower	Small FPGA, low cost, large models
Layer-pipelined (dataflow)	One engine per layer, all concurrent	Low (streaming)	Very high	Real-time, fixed model, large FPGA
Hybrid	Pipeline groups of layers	Medium	High	Balanced resource budgets

7. Inter-Layer FIFO in Verilog

// stream_fifo.v — synchronous FIFO decoupling two layer engines
module stream_fifo #(parameter DW = 8, parameter DEPTH = 512)(
  input  wire           clk, rst_n,
  // producer (upstream layer)
  input  wire [DW-1:0]  wr_data,
  input  wire           wr_en,
  output wire           full,
  // consumer (downstream layer)
  output reg  [DW-1:0]  rd_data,
  input  wire           rd_en,
  output wire           empty
);
  localparam AW = $clog2(DEPTH);
  reg [DW-1:0] mem [0:DEPTH-1];
  reg [AW:0]   wptr, rptr;             // extra bit for full/empty distinction

  assign full  = (wptr[AW] != rptr[AW]) && (wptr[AW-1:0] == rptr[AW-1:0]);
  assign empty = (wptr == rptr);

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin wptr <= 0; rptr <= 0; end
    else begin
      if (wr_en && !full)  begin mem[wptr[AW-1:0]] <= wr_data; wptr <= wptr + 1; end
      if (rd_en && !empty) begin rd_data <= mem[rptr[AW-1:0]]; rptr <= rptr + 1; end
    end
  end
endmodule

// Producer stalls when full; consumer stalls when empty.
// FIFO absorbs short-term rate mismatch between layer engines.

8. Real-World Pipeline Numbers

MobileNetV2 on Xilinx Kria (layer-pipelined dataflow): Architecture: dataflow, all 53 layers pipelined Clock: 300 MHz Bottleneck stage: 224×224 first conv (~150k cycles) Steady-state throughput: 1 frame / 150k cycles = 300M / 150k = 2000 FPS theoretical Measured: ~400 FPS (memory + overhead limited) Latency: ~0.9ms (pipeline fill, single frame) Single-engine (folded) alternative on same chip: Latency: ~3ms (layers run sequentially) Throughput: ~330 FPS → dataflow trades area for latency + throughput

Day 9 — Key Takeaways

✅ Pipelining overlaps stages → 1 result/cycle after fill; ~48× over non-pipelined
✅ Initiation Interval (II): II=1 is ideal; throughput = 1/II
✅ II killers: resource conflicts, loop-carried deps, long paths, variable latency
✅ Accumulator trap: use partial sums / tree reduction to keep II=1
✅ Parallelism dimensions: input-ch × output-ch × kernel × pixel — multiply them
✅ Layer pipelining + FIFOs: all layer engines run concurrently, decoupled
✅ Throughput = slowest stage — balance per-layer parallelism
✅ Dataflow vs folded: area for latency+throughput, or reuse for cost

Next — Day 10: Vitis HLS — write CNN layers in C++, synthesize to RTL with PIPELINE / UNROLL / ARRAY_PARTITION pragmas, and read latency/area reports.

← Previous

Day 8: Pooling & Normalization

Day 10: Vitis HLS