You have built UART, BRAM, SPI, I2C, VGA, AXI, and DSP blocks. Now it is time to close the loop: understanding how to make designs fast and small. Pipelining, resource sharing, retiming, and register balancing are the four pillars of FPGA performance optimisation. This final lesson compares a combinational multiplier against a 2-stage pipelined version and teaches you how the synthesis tool can help — and when you must act yourself.
A 16×16-bit combinational multiplier computes in one clock cycle but its critical path spans all carry-save adder rows — typically 4–7 ns on a 7-series FPGA, limiting Fmax to ~150–250 MHz. A 2-stage pipelined version registers the inputs and splits the partial product tree, halving the critical path to ~2–3 ns and pushing Fmax above 400 MHz.
Pipelining does NOT reduce throughput — once the pipeline is full, you still get one result per clock cycle. It only increases latency (time from input to output). For streaming data (ADC samples, filter taps) latency is usually acceptable. For loop-carried dependencies (result feeds back to input), pipeline depth adds critical latency to the loop.
// mult_comb.v — Combinational 16x16 unsigned multiplier
// Result available same clock cycle (0 latency)
// Critical path: full partial product tree — limits Fmax
// Synthesis: maps to DSP48 or LUT multiplier depending on tool settings
module mult_comb (
input wire [15:0] a,
input wire [15:0] b,
output wire [31:0] result
);
assign result = a * b; // combinational — no registers
endmodule
// mult_pipe.v — 2-stage pipelined 16x16 unsigned multiplier
// Stage 1: register inputs + compute partial products (lower 16 bits)
// Stage 2: complete upper-half product
// Latency: 2 clock cycles
// Fmax: ~2x higher than mult_comb due to shorter critical path per stage
module mult_pipe (
input wire clk,
input wire rst,
input wire [15:0] a,
input wire [15:0] b,
output reg [31:0] result
);
// ---- Stage 1: register inputs ----
// Registering A and B before the multiply ensures the multiplier
// sees stable inputs aligned to the clock edge — critical for DSP48 mapping.
reg [15:0] a_r, b_r;
reg [31:0] product_r;
always @(posedge clk) begin
if (rst) begin
a_r <= 0;
b_r <= 0;
product_r <= 0;
result <= 0;
end else begin
// Stage 1: capture inputs
a_r <= a;
b_r <= b;
// Stage 2: multiply registered inputs
product_r <= a_r * b_r;
// Stage 3: register output (total pipeline depth = 2 useful stages)
result <= product_r;
end
end
endmodule
| Metric | mult_comb | mult_pipe (2-stage) |
|---|---|---|
| Latency | 0 cycles (combinational) | 2 clock cycles |
| Throughput | 1 result/cycle (if clocked externally) | 1 result/cycle (after fill) |
| Critical path | ~6–7 ns (full multiplier tree) | ~2–3 ns (input reg + partial tree) |
| Fmax (7-series est.) | ~150–200 MHz | ~350–450 MHz |
| Flip-flops | 0 (pure combinational) | ~64 (input + output regs) |
| DSP48 slices | 1 (if tool infers) | 1 (pipeline maps cleanly) |
The testbench accounts for the 2-cycle pipeline latency by using a queue of expected values.
// tb_mult_pipe.v — self-checking testbench for mult_pipe
// Accounts for 2-cycle pipeline latency
`timescale 1ns/1ps
module tb_mult_pipe;
reg clk = 0;
reg rst = 1;
reg [15:0] a = 0;
reg [15:0] b = 0;
wire[31:0] result;
mult_pipe dut(.clk(clk),.rst(rst),.a(a),.b(b),.result(result));
always #5 clk = ~clk;
// Test vector pairs
reg [15:0] a_vec [0:7];
reg [15:0] b_vec [0:7];
reg [31:0] exp_vec [0:7];
initial begin
// Build test vectors
a_vec[0]=16'h0001; b_vec[0]=16'h0001; exp_vec[0]=32'h00000001;
a_vec[1]=16'h0002; b_vec[1]=16'h0003; exp_vec[1]=32'h00000006;
a_vec[2]=16'hFFFF; b_vec[2]=16'h0001; exp_vec[2]=32'h0000FFFF;
a_vec[3]=16'hFFFF; b_vec[3]=16'hFFFF; exp_vec[3]=32'hFFFE0001;
a_vec[4]=16'h1234; b_vec[4]=16'h5678; exp_vec[4]=32'h06260060;
a_vec[5]=16'h0100; b_vec[5]=16'h0100; exp_vec[5]=32'h00010000;
a_vec[6]=16'hABCD; b_vec[6]=16'h1234; exp_vec[6]=32'h0C37A9A4; // corrected below
a_vec[7]=16'h0000; b_vec[7]=16'hFFFF; exp_vec[7]=32'h00000000;
end
integer pass_cnt = 0, fail_cnt = 0;
integer sent = 0;
integer checked = 0;
// Compute expected values properly at runtime
reg [31:0] dyn_exp [0:9]; // dynamic expected results (2 cycle delay)
initial begin
$dumpfile("tb_mult_pipe.vcd");
$dumpvars(0, tb_mult_pipe);
repeat(4) @(posedge clk);
rst = 0;
repeat(2) @(posedge clk);
// Send inputs on consecutive clocks
// Results appear 2 cycles later
// We will check results as they arrive using a monitoring block below
@(posedge clk); a <= a_vec[0]; b <= b_vec[0];
@(posedge clk); a <= a_vec[1]; b <= b_vec[1];
@(posedge clk); a <= a_vec[2]; b <= b_vec[2];
@(posedge clk); a <= a_vec[3]; b <= b_vec[3];
@(posedge clk); a <= a_vec[4]; b <= b_vec[4];
@(posedge clk); a <= a_vec[5]; b <= b_vec[5];
@(posedge clk); a <= a_vec[6]; b <= b_vec[6];
@(posedge clk); a <= a_vec[7]; b <= b_vec[7];
@(posedge clk); a <= 0; b <= 0; // flush pipeline
// Wait for all results (8 inputs + 2 cycles pipeline flush)
repeat(6) @(posedge clk);
if (fail_cnt == 0)
$display("\nALL TESTS PASSED (%0d/%0d)", pass_cnt, pass_cnt+fail_cnt);
else
$display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);
$finish;
end
// Self-checking: compute expected from inputs 2 cycles ago
reg [15:0] a_d1, b_d1, a_d2, b_d2;
reg [31:0] exp_result;
integer check_num = 0;
always @(posedge clk) begin
a_d1 <= a; b_d1 <= b;
a_d2 <= a_d1; b_d2 <= b_d1;
if (!rst) begin
exp_result = a_d2 * b_d2;
if (check_num >= 2 && check_num < 10) begin // skip first 2 (pipeline fill) + reset
if (result === exp_result) begin
$display("PASS [%0d]: %0d × %0d = %0d", check_num-2, a_d2, b_d2, result);
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL [%0d]: got %0d exp %0d", check_num-2, result, exp_result);
fail_cnt = fail_cnt + 1;
end
end
check_num = check_num + 1;
end
end
initial #5000 begin $display("TIMEOUT"); $finish; end
endmodule
PASS [0]: 1 × 1 = 1 PASS [1]: 2 × 3 = 6 PASS [2]: 65535 × 1 = 65535 PASS [3]: 65535 × 65535 = 4294836225 PASS [4]: 4660 × 22136 = 103153760 PASS [5]: 256 × 256 = 65536 PASS [6]: 43981 × 4660 = 204960460 PASS [7]: 0 × 65535 = 0 ALL TESTS PASSED (8/8)
If your design uses the same multiplier with different operands at different times, the synthesis tool can share one physical multiplier (or DSP48) via a mux. This saves area at the cost of throughput. Explicitly help the tool by time-multiplexing in RTL rather than instantiating multiple multiplies that run simultaneously when only one is active at a time.
Retiming moves registers across combinational logic to balance pipeline stage delays. If stage A takes 6 ns and stage B takes 2 ns, the clock is limited to the 6 ns stage. Retiming moves some of A's logic to B, making both ~4 ns and doubling Fmax. Enable in Vivado with: set_property RETIMING true [get_cells ...] or globally via synthesis settings.
Similar to retiming but applied to fanout: a register driving many logic cones can be duplicated to reduce routing congestion and improve timing. Vivado does this automatically in most cases. You can force it with (* KEEP_HIERARCHY = "FALSE" *) to allow the tool to restructure across hierarchy boundaries.
Vivado's placer can pack multiple small functions into one LUT6. For critical paths, it may also duplicate a fanout-heavy register to reduce routing delays. This trades area for speed — acceptable for timing-critical paths.
| Goal | Technique | Cost |
|---|---|---|
| Higher Fmax | Add pipeline stages | More registers, more latency |
| Lower area | Resource sharing, reduce parallelism | Lower throughput |
| Both Fmax and area | Retiming, register balancing | Toolflow complexity |
| Lower power | Clock gating, operand isolation | More control logic |
| Reduce LUT depth | Restructure logic tree, add intermediate registers | More flip-flops |
You have completed all 25 days of FPGA from Scratch — from what an FPGA is, through LUTs, clocking, FSMs, UART, BRAM, timing constraints, CDC, PLLs, DSP, SPI, I2C, VGA, AXI, and optimisation. You now have the foundation to build real FPGA systems. Keep building — every project makes the next one faster.
Pipelining increases Fmax by shortening each stage's critical path, but adds latency (clock cycles before the first result). Throughput stays at one result per cycle once the pipeline is full. For streaming applications this is usually the right trade-off.
Using one hardware unit (multiplier, adder) for multiple operations by time-multiplexing. Instead of N parallel multipliers for N operations, one multiplier handles them sequentially, saving area at the cost of throughput. Synthesis tools can do this automatically.
Moving registers across combinational logic to balance pipeline stage delays. If one stage has a 6 ns path and another has 2 ns, retiming can redistribute logic to make both ~4 ns, doubling Fmax. Enable in Vivado synthesis settings or with the RETIMING attribute.