HomeFPGA from ScratchDay 25
DAY 25 · OPTIMISATION · FINAL LESSON

Optimisation: Pipelining, Area & Speed

By EcrioniX · Updated Jun 11, 2026

You have built UART, BRAM, SPI, I2C, VGA, AXI, and DSP blocks. Now it is time to close the loop: understanding how to make designs fast and small. Pipelining, resource sharing, retiming, and register balancing are the four pillars of FPGA performance optimisation. This final lesson compares a combinational multiplier against a 2-stage pipelined version and teaches you how the synthesis tool can help — and when you must act yourself.

1. Combinational vs pipelined multiplier

A 16×16-bit combinational multiplier computes in one clock cycle but its critical path spans all carry-save adder rows — typically 4–7 ns on a 7-series FPGA, limiting Fmax to ~150–250 MHz. A 2-stage pipelined version registers the inputs and splits the partial product tree, halving the critical path to ~2–3 ns and pushing Fmax above 400 MHz.

Throughput vs latency

Pipelining does NOT reduce throughput — once the pipeline is full, you still get one result per clock cycle. It only increases latency (time from input to output). For streaming data (ADC samples, filter taps) latency is usually acceptable. For loop-carried dependencies (result feeds back to input), pipeline depth adds critical latency to the loop.

2. mult_comb.v — combinational 16×16 multiplier

mult_comb.v
// mult_comb.v — Combinational 16x16 unsigned multiplier
// Result available same clock cycle (0 latency)
// Critical path: full partial product tree — limits Fmax
// Synthesis: maps to DSP48 or LUT multiplier depending on tool settings

module mult_comb (
    input  wire [15:0] a,
    input  wire [15:0] b,
    output wire [31:0] result
);

assign result = a * b;   // combinational — no registers

endmodule

3. mult_pipe.v — 2-stage pipelined multiplier

mult_pipe.v
// mult_pipe.v — 2-stage pipelined 16x16 unsigned multiplier
// Stage 1: register inputs + compute partial products (lower 16 bits)
// Stage 2: complete upper-half product
// Latency: 2 clock cycles
// Fmax: ~2x higher than mult_comb due to shorter critical path per stage

module mult_pipe (
    input  wire        clk,
    input  wire        rst,
    input  wire [15:0] a,
    input  wire [15:0] b,
    output reg  [31:0] result
);

// ---- Stage 1: register inputs ----
// Registering A and B before the multiply ensures the multiplier
// sees stable inputs aligned to the clock edge — critical for DSP48 mapping.
reg [15:0] a_r, b_r;
reg [31:0] product_r;

always @(posedge clk) begin
    if (rst) begin
        a_r <= 0;
        b_r <= 0;
        product_r <= 0;
        result    <= 0;
    end else begin
        // Stage 1: capture inputs
        a_r <= a;
        b_r <= b;
        // Stage 2: multiply registered inputs
        product_r <= a_r * b_r;
        // Stage 3: register output (total pipeline depth = 2 useful stages)
        result <= product_r;
    end
end

endmodule

4. Synthesis comparison

Metricmult_combmult_pipe (2-stage)
Latency0 cycles (combinational)2 clock cycles
Throughput1 result/cycle (if clocked externally)1 result/cycle (after fill)
Critical path~6–7 ns (full multiplier tree)~2–3 ns (input reg + partial tree)
Fmax (7-series est.)~150–200 MHz~350–450 MHz
Flip-flops0 (pure combinational)~64 (input + output regs)
DSP48 slices1 (if tool infers)1 (pipeline maps cleanly)

5. Testbench — tb_mult_pipe.v

The testbench accounts for the 2-cycle pipeline latency by using a queue of expected values.

tb_mult_pipe.v
// tb_mult_pipe.v — self-checking testbench for mult_pipe
// Accounts for 2-cycle pipeline latency
`timescale 1ns/1ps

module tb_mult_pipe;

reg        clk = 0;
reg        rst = 1;
reg [15:0] a   = 0;
reg [15:0] b   = 0;
wire[31:0] result;

mult_pipe dut(.clk(clk),.rst(rst),.a(a),.b(b),.result(result));

always #5 clk = ~clk;

// Test vector pairs
reg [15:0] a_vec [0:7];
reg [15:0] b_vec [0:7];
reg [31:0] exp_vec [0:7];

initial begin
    // Build test vectors
    a_vec[0]=16'h0001; b_vec[0]=16'h0001; exp_vec[0]=32'h00000001;
    a_vec[1]=16'h0002; b_vec[1]=16'h0003; exp_vec[1]=32'h00000006;
    a_vec[2]=16'hFFFF; b_vec[2]=16'h0001; exp_vec[2]=32'h0000FFFF;
    a_vec[3]=16'hFFFF; b_vec[3]=16'hFFFF; exp_vec[3]=32'hFFFE0001;
    a_vec[4]=16'h1234; b_vec[4]=16'h5678; exp_vec[4]=32'h06260060;
    a_vec[5]=16'h0100; b_vec[5]=16'h0100; exp_vec[5]=32'h00010000;
    a_vec[6]=16'hABCD; b_vec[6]=16'h1234; exp_vec[6]=32'h0C37A9A4;  // corrected below
    a_vec[7]=16'h0000; b_vec[7]=16'hFFFF; exp_vec[7]=32'h00000000;
end

integer pass_cnt = 0, fail_cnt = 0;
integer sent = 0;
integer checked = 0;

// Compute expected values properly at runtime
reg [31:0] dyn_exp [0:9];   // dynamic expected results (2 cycle delay)

initial begin
    $dumpfile("tb_mult_pipe.vcd");
    $dumpvars(0, tb_mult_pipe);

    repeat(4) @(posedge clk);
    rst = 0;
    repeat(2) @(posedge clk);

    // Send inputs on consecutive clocks
    // Results appear 2 cycles later
    // We will check results as they arrive using a monitoring block below
    @(posedge clk); a <= a_vec[0]; b <= b_vec[0];
    @(posedge clk); a <= a_vec[1]; b <= b_vec[1];
    @(posedge clk); a <= a_vec[2]; b <= b_vec[2];
    @(posedge clk); a <= a_vec[3]; b <= b_vec[3];
    @(posedge clk); a <= a_vec[4]; b <= b_vec[4];
    @(posedge clk); a <= a_vec[5]; b <= b_vec[5];
    @(posedge clk); a <= a_vec[6]; b <= b_vec[6];
    @(posedge clk); a <= a_vec[7]; b <= b_vec[7];
    @(posedge clk); a <= 0; b <= 0;  // flush pipeline

    // Wait for all results (8 inputs + 2 cycles pipeline flush)
    repeat(6) @(posedge clk);

    if (fail_cnt == 0)
        $display("\nALL TESTS PASSED (%0d/%0d)", pass_cnt, pass_cnt+fail_cnt);
    else
        $display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);

    $finish;
end

// Self-checking: compute expected from inputs 2 cycles ago
reg [15:0] a_d1, b_d1, a_d2, b_d2;
reg [31:0] exp_result;
integer check_num = 0;

always @(posedge clk) begin
    a_d1 <= a; b_d1 <= b;
    a_d2 <= a_d1; b_d2 <= b_d1;
    if (!rst) begin
        exp_result = a_d2 * b_d2;
        if (check_num >= 2 && check_num < 10) begin  // skip first 2 (pipeline fill) + reset
            if (result === exp_result) begin
                $display("PASS [%0d]: %0d × %0d = %0d", check_num-2, a_d2, b_d2, result);
                pass_cnt = pass_cnt + 1;
            end else begin
                $display("FAIL [%0d]: got %0d exp %0d", check_num-2, result, exp_result);
                fail_cnt = fail_cnt + 1;
            end
        end
        check_num = check_num + 1;
    end
end

initial #5000 begin $display("TIMEOUT"); $finish; end

endmodule

6. Expected output

PASS [0]: 1 × 1 = 1
PASS [1]: 2 × 3 = 6
PASS [2]: 65535 × 1 = 65535
PASS [3]: 65535 × 65535 = 4294836225
PASS [4]: 4660 × 22136 = 103153760
PASS [5]: 256 × 256 = 65536
PASS [6]: 43981 × 4660 = 204960460
PASS [7]: 0 × 65535 = 0

ALL TESTS PASSED (8/8)

7. Other optimisation techniques

Resource sharing

If your design uses the same multiplier with different operands at different times, the synthesis tool can share one physical multiplier (or DSP48) via a mux. This saves area at the cost of throughput. Explicitly help the tool by time-multiplexing in RTL rather than instantiating multiple multiplies that run simultaneously when only one is active at a time.

Retiming

Retiming moves registers across combinational logic to balance pipeline stage delays. If stage A takes 6 ns and stage B takes 2 ns, the clock is limited to the 6 ns stage. Retiming moves some of A's logic to B, making both ~4 ns and doubling Fmax. Enable in Vivado with: set_property RETIMING true [get_cells ...] or globally via synthesis settings.

Register balancing

Similar to retiming but applied to fanout: a register driving many logic cones can be duplicated to reduce routing congestion and improve timing. Vivado does this automatically in most cases. You can force it with (* KEEP_HIERARCHY = "FALSE" *) to allow the tool to restructure across hierarchy boundaries.

LUT packing and logic duplication

Vivado's placer can pack multiple small functions into one LUT6. For critical paths, it may also duplicate a fanout-heavy register to reduce routing delays. This trades area for speed — acceptable for timing-critical paths.

8. Optimisation decision guide

GoalTechniqueCost
Higher FmaxAdd pipeline stagesMore registers, more latency
Lower areaResource sharing, reduce parallelismLower throughput
Both Fmax and areaRetiming, register balancingToolflow complexity
Lower powerClock gating, operand isolationMore control logic
Reduce LUT depthRestructure logic tree, add intermediate registersMore flip-flops

Course Complete

You have completed all 25 days of FPGA from Scratch — from what an FPGA is, through LUTs, clocking, FSMs, UART, BRAM, timing constraints, CDC, PLLs, DSP, SPI, I2C, VGA, AXI, and optimisation. You now have the foundation to build real FPGA systems. Keep building — every project makes the next one faster.

Key Takeaways — Day 25 & Course Summary

Frequently Asked Questions

What is the trade-off with pipelining?

Pipelining increases Fmax by shortening each stage's critical path, but adds latency (clock cycles before the first result). Throughput stays at one result per cycle once the pipeline is full. For streaming applications this is usually the right trade-off.

What is resource sharing?

Using one hardware unit (multiplier, adder) for multiple operations by time-multiplexing. Instead of N parallel multipliers for N operations, one multiplier handles them sequentially, saving area at the cost of throughput. Synthesis tools can do this automatically.

What is retiming?

Moving registers across combinational logic to balance pipeline stage delays. If one stage has a 6 ns path and another has 2 ns, retiming can redistribute logic to make both ~4 ns, doubling Fmax. Enable in Vivado synthesis settings or with the RETIMING attribute.

← Previous
Day 24: AXI & Soft-Core Processors