Your design simulates perfectly but fails mysteriously on real hardware. The culprit is almost always timing. This lesson demystifies setup time, hold time, critical paths, and WNS/TNS. You will write real XDC constraints and build a pipelined adder that demonstrates how pipeline registers break critical paths and push Fmax higher.
Every flip-flop has two timing requirements around its clock edge:
For a path from flip-flop A to flip-flop B:
Slack = Tclk − (Tco_A + Tcomb + Tsu_B)
Where Tco is clock-to-output delay, Tcomb is combinational logic delay, Tsu is setup time of B.
If slack < 0, your design will fail. The tool reports this as negative slack.
| Metric | Meaning | Target |
|---|---|---|
| WNS | Worst Negative Slack — the single worst timing path | ≥ 0 ns |
| TNS | Total Negative Slack — sum of all negative-slack paths | 0 ns |
| WHS | Worst Hold Slack | ≥ 0 ns |
| THS | Total Hold Slack | 0 ns |
| Fmax | Maximum achievable clock frequency | ≥ your target |
# constraints.xdc — example timing constraints for Basys3/Nexys4 (100 MHz)
# 1. Define the clock on pin W5 (Basys3 100 MHz oscillator)
# Period = 10 ns (100 MHz)
create_clock -period 10.000 -name sys_clk -waveform {0.000 5.000} [get_ports clk]
# 2. Input delay — tell the tool how late inputs arrive after the clock edge
# If external data is valid 2 ns after the board clock edge:
set_input_delay -clock sys_clk -max 2.0 [get_ports {a[*] b[*]}]
set_input_delay -clock sys_clk -min 0.5 [get_ports {a[*] b[*]}]
# 3. Output delay — tell the tool when output must be valid before next clock
# If external device needs data 2 ns before next clock edge:
set_output_delay -clock sys_clk -max 2.0 [get_ports {result[*]}]
set_output_delay -clock sys_clk -min 0.5 [get_ports {result[*]}]
# 4. False path — async reset does not need timing analysis
set_false_path -from [get_ports rst]
| Port | Dir | Width | Description |
|---|---|---|---|
| clk | IN | 1 | System clock |
| rst | IN | 1 | Synchronous reset |
| a | IN | 16 | First addend, captured at stage 1 |
| b | IN | 16 | Second addend, captured at stage 1 |
| result | OUT | 17 | Sum a+b, valid 2 cycles after inputs (2-stage pipeline) |
| valid | OUT | 1 | Valid flag pipelined alongside result |
// pipe_add.v — 2-stage pipelined 16-bit adder
// Stage 1: capture inputs and compute partial sum (lower 8 bits + carry)
// Stage 2: complete upper 8 bits + combine — result appears 2 cycles later
// This shows how pipeline registers break the critical path.
module pipe_add (
input wire clk,
input wire rst,
input wire [15:0] a,
input wire [15:0] b,
input wire valid_in,
output reg [16:0] result,
output reg valid_out
);
// ---- Stage 1 registers ----
// Latch inputs and compute low-byte sum
reg [7:0] s1_sum_lo; // a[7:0] + b[7:0]
reg s1_carry; // carry out from low byte
reg [7:0] s1_a_hi; // upper bytes passed through
reg [7:0] s1_b_hi;
reg s1_valid;
always @(posedge clk) begin
if (rst) begin
s1_sum_lo <= 0;
s1_carry <= 0;
s1_a_hi <= 0;
s1_b_hi <= 0;
s1_valid <= 0;
end else begin
{s1_carry, s1_sum_lo} <= a[7:0] + b[7:0]; // critical path: 8-bit adder
s1_a_hi <= a[15:8];
s1_b_hi <= b[15:8];
s1_valid <= valid_in;
end
end
// ---- Stage 2 registers ----
// Compute upper-byte sum using carry from stage 1
always @(posedge clk) begin
if (rst) begin
result <= 0;
valid_out <= 0;
end else begin
result <= {(s1_a_hi + s1_b_hi + s1_carry), s1_sum_lo}; // 8-bit adder + carry
valid_out <= s1_valid;
end
end
endmodule
Without pipelining, a 16-bit adder has a carry-ripple chain of 16 full-adder delays. On Xilinx 7-series this might be ~3 ns, limiting Fmax to ~333 MHz. By splitting at byte 8, each stage has only 8 full-adder delays (~1.5 ns), roughly doubling Fmax — at the cost of 1 extra cycle of latency.
// tb_pipe_add.v — self-checking testbench for pipe_add
// Accounts for 2-cycle pipeline latency
`timescale 1ns/1ps
module tb_pipe_add;
reg clk = 0;
reg rst = 1;
reg [15:0] a = 0, b = 0;
reg valid_in = 0;
wire[16:0] result;
wire valid_out;
pipe_add dut(.clk(clk),.rst(rst),.a(a),.b(b),.valid_in(valid_in),
.result(result),.valid_out(valid_out));
always #5 clk = ~clk;
integer pass_cnt = 0, fail_cnt = 0;
// Store expected values in a queue (shift register)
reg [16:0] exp_q [0:3];
reg vld_q [0:3];
integer i;
task apply;
input [15:0] ta, tb;
begin
@(posedge clk);
a <= ta; b <= tb; valid_in <= 1;
// Push expected into queue
exp_q[0] <= ta + tb;
vld_q[0] <= 1;
end
endtask
initial begin
$dumpfile("tb_pipe_add.vcd");
$dumpvars(0, tb_pipe_add);
repeat(4) @(posedge clk);
rst = 0;
// Send a series of inputs
// Results appear 2 cycles later on valid_out
@(posedge clk); a<=16'h0001; b<=16'h0002; valid_in<=1;
@(posedge clk); a<=16'hFFFF; b<=16'h0001; valid_in<=1;
@(posedge clk); a<=16'h1234; b<=16'h5678; valid_in<=1;
@(posedge clk); a<=16'hAAAA; b<=16'h5555; valid_in<=1;
@(posedge clk); valid_in<=0;
// Wait for pipeline to flush (2 cycles)
repeat(4) @(posedge clk);
// Check results (monitor valid_out)
$finish;
end
// Self-checking monitor
reg [16:0] expected_vals [0:3];
integer check_idx = 0;
initial begin
expected_vals[0] = 17'h00003; // 1+2
expected_vals[1] = 17'h10000; // 0xFFFF+1 = 0x10000
expected_vals[2] = 17'h068AC; // 0x1234+0x5678
expected_vals[3] = 17'h0FFFF; // 0xAAAA+0x5555
end
always @(posedge clk) begin
if (valid_out && check_idx < 4) begin
if (result === expected_vals[check_idx]) begin
$display("PASS [%0d]: result=0x%05X", check_idx, result);
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL [%0d]: got=0x%05X exp=0x%05X", check_idx, result, expected_vals[check_idx]);
fail_cnt = fail_cnt + 1;
end
check_idx = check_idx + 1;
if (check_idx == 4) begin
if (fail_cnt == 0)
$display("\nALL TESTS PASSED (%0d/4)", pass_cnt);
else
$display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);
$finish;
end
end
end
initial #2000 begin $display("TIMEOUT"); $finish; end
endmodule
PASS [0]: result=0x00003 PASS [1]: result=0x10000 PASS [2]: result=0x068AC PASS [3]: result=0x0FFFF ALL TESTS PASSED (4/4)
create_clock is mandatory — without it the tools assume an infinite period and make no effort to optimise timing.set_input_delay and set_output_delay for correct I/O timing analysis.Setup time (Tsu) is the minimum time data must be stable before the clock edge. Hold time (Th) is the minimum time after the clock edge. Violating setup time causes wrong data capture. Violating hold time can cause the flip-flop to latch a transitioning value.
WNS (Worst Negative Slack) is the most critical timing violation. If negative, the design cannot run at the target clock. TNS (Total Negative Slack) is the sum of all negative-slack paths. Both must be ≥ 0 for timing closure.
Pipelining inserts registers between combinational stages, giving each stage a shorter critical path. The clock can run faster (higher Fmax) at the cost of extra clock cycles of latency before results appear.