Setup violations mean your critical paths are too slow for the target clock frequency. Fixing them is the central activity of timing closure — an iterative, systematic process that demands understanding which technique to apply to which type of violation. The goal is to reduce path delay until slack ≥ 0 at every corner, while minimising the collateral damage to power, area, and hold timing.
Setup closure is never done in one pass. It is an iterative loop that converges as violations are progressively fixed:
Each iteration should address a specific category of violations — not random individual paths. Fixing one path often shifts the critical path to the next-worst, so working on the statistically worst group each iteration converges faster than fixing paths one by one.
| Technique | Delay reduction | Area cost | Power cost | When to use |
|---|---|---|---|---|
| Buffer removal on critical path | 20–50ps per buffer | Saves area | Saves power | When excessive buffering (for fanout/SI) is on the critical path |
| Cell upsizing | 30–100ps per cell | +10–30% | +10–30% | First response for setup violations on critical cells |
| LVT cell swap (SVT→LVT) | 50–150ps | Same | +300–1000% leakage | Last resort for critical cells when upsizing is insufficient |
| Useful skew | 20–80ps effective | CTS change | Negligible | When timing budget can be shifted from one path pair to another |
| Logic restructuring | Path-specific | Variable | Variable | When logic depth is the fundamental limit (re-synthesis) |
| Placement optimisation | 30–200ps (wire) | No change | No change | When wire delay dominates the critical path |
| Pipeline insertion | Full path split | +FF area | +FF power | When path fundamentally exceeds one clock cycle even after all other fixes |
Cell upsizing replaces a cell with a larger drive-strength version of the same function. A stronger driver has lower output resistance, which charges the output capacitance faster. This reduces both cell delay and output slew, which also speeds up subsequent cells in the path.
## Identify cells contributing most delay on worst path
report_timing -nworst 1 -path_type full
## Target the cell with highest Incr value on the critical path
## Example: AND2X2 has 0.09 ns delay; try AND2X8 (stronger)
size_cell [get_cells u_core/U_and1] AND2X8
## Verify improvement
update_timing
report_timing -from [get_cells u_core/ff_A] -to [get_cells u_core/ff_B] -nworst 1
## Automated sizing: optimize worst N paths
optimize_critical_paths -slack_threshold 0 -max_paths 100
## Check for hold side effects from upsizing
## Upsizing = faster cell = shorter min delay = more hold risk
report_timing -delay_type min -nworst 10
## Size multiple cells at once (ECO approach)
foreach cell {u_core/U1 u_core/U2 u_core/U4} {
set cur [get_attribute [get_cells $cell] ref_name]
puts "Sizing $cell from $cur"
}
size_cell [get_cells u_core/U1] BUFX8
size_cell [get_cells u_core/U2] OR2X8
size_cell [get_cells u_core/U4] AND3X4
Low-Vt (LVT) cells are the same logic function as standard-Vt (SVT) but with a lower threshold voltage. Lower Vt means transistors turn on earlier, giving faster switching at the cost of dramatically higher subthreshold leakage.
## Identify cells on timing-critical paths for LVT swap
## Typical: swap cells contributing >50% of path delay
## List cells on worst path
set worst_path [get_timing_paths -delay_type max -nworst 1]
set critical_cells [get_cells -of [get_timing_path_pins $worst_path]]
## Check current Vt of cells
foreach cell $critical_cells {
puts "[get_object_name $cell]: [get_attribute $cell ref_name]"
}
## Swap specific cell from SVT to LVT
## Library naming: BUFX4 -> BUFX4_LVT
size_cell [get_cells u_core/U_crit] BUFX4_LVT
size_cell [get_cells u_core/U_and2] AND2X4_LVT
## After LVT swap, verify leakage impact
report_power -cell [get_cells {u_core/U_crit u_core/U_and2}]
## Check hold: LVT is faster, so hold slack decreases
report_timing -delay_type min -through [get_cells u_core/U_crit] -nworst 5
Paradoxically, removing buffers can improve setup timing. Buffers added for fanout control or signal integrity add delay to the path. If the fanout situation has changed (after placement optimisation) or the buffer was over-estimated as necessary, removing it saves 20–50ps per stage.
Look for buffers in the timing report where: (1) the buffer’s fanout is 1 (only one receiver), or (2) the net capacitance after the buffer is small (the buffer wasn’t needed), or (3) the buffer was inserted during CTS but doesn’t carry a clock signal.
When a path has too many logic levels (gates in series), reducing the gate count through restructuring is more powerful than any ECO fix. Common techniques:
compile_ultra -retime)## In Design Compiler: identify cells with highest fanin contributing to violations report_timing -path_type full -nworst 10 ## For a specific block, re-synthesise with restructuring enabled compile_ultra -retime ; # retime registers across paths compile_ultra -incremental ; # incremental optimisation only ## Target a specific path for restructuring group_path -name critical_adder \ -from [get_cells u_alu/add_in_reg/*] \ -to [get_cells u_alu/result_reg/*] compile_ultra -only_design_rule_fix -incremental ## In Genus: restructure-aware synthesis syn_opt -effort ultra time_design -pre_place restructure time_design -pre_place ; # check if restructuring helped
If a path fundamentally has too many logic levels — even after cell sizing, LVT swapping, and placement optimisation — the only path to fixing it is adding a pipeline register to split it into two stages. Each stage now has only half the logic depth.
The cost is one additional clock cycle of latency on that computation. This must be architecturally acceptable: downstream logic must be updated to account for the extra cycle. Retiming in synthesis can automate this across the whole design.
## Identify a long path candidate for pipelining ## Slack: -0.250 ns, path delay: 1.950 ns, clock period: 1.0 ns ## -> path needs to be split into 2 stages of ~0.975 ns each ## In RTL (Verilog): insert a pipeline register // Before: result = A + B + C + D (long combinational chain) // After: // always @(posedge clk) pipe_stage <= A + B; // Stage 1 // always @(posedge clk) result <= pipe_stage + C + D; // Stage 2 ## In synthesis (post-RTL fix): use retiming set_optimize_registers true -design [get_designs ADDER] compile_ultra -retime -sequential_area_recovery ## Verify after retiming report_timing -nworst 5 -group adder_path report_register -level_sensitive ; # ensure no latches created
Useful skew intentionally introduces a clock arrival time difference between launch and capture flip-flops to benefit timing. If the capture FF’s clock arrives slightly later than the launch FF’s clock, the required time effectively increases, giving data more margin.
Useful skew is applied during CTS by adjusting clock buffer delays to individual clock sinks. The maximum useful skew is limited by the hold requirement on the same path (introducing too much skew will cause a hold violation in the other direction).
## In ICC2: apply useful skew target for specific flip-flops ## Delay capture FF clock by 100 ps to gain 100 ps setup margin set_clock_latency -late 0.100 [get_cells u_core/ff_capture] ## Or set target insertion delay for the capture sink # (Tool will adjust buffer sizing to hit this target) set_clock_balance_point -delay 0.650 [get_pins u_core/ff_capture/CK] # vs nominal insertion delay of 0.550 for other FFs in this domain ## After CTS: verify skew and check hold report_clock_timing -type skew -clock clk_core -nworst 10 report_timing -delay_type min -nworst 10 ; # check hold on this path!
Every technique that reduces the data arrival time (faster data) or increases the required time (later capture clock) also reduces hold margin on the same path. After any setup fix, always run report_timing -delay_type min on the affected paths. Cell upsizing, LVT swap, and useful skew are the three most common inadvertent creators of hold violations during setup closure.
After gate-level fixes, placement and routing can also help:
The most effective immediate step is cell upsizing on the critical path. A stronger driver reduces gate delay and output slew, improving multiple downstream cells. It doesn’t change logic, doesn’t affect functionality, and is fast to implement in an ECO flow. Follow it by checking hold violations on the same path, as upsizing speeds up the cell and can reduce hold margin.
LVT (Low Threshold Voltage) cells switch faster than SVT cells because transistors turn on at a lower gate voltage. Swapping a critical path cell from SVT to LVT can reduce its delay by 15–30%. The trade-off is 5–10× higher leakage current on that cell, increasing static power. LVT should be used only on the most critical cells where other techniques have been exhausted.
Pipeline insertion is used when a path is fundamentally too long — even after upsizing, LVT swapping, and placement optimisation, the path delay exceeds one clock period. Inserting a flip-flop in the middle of the path splits it into two shorter stages. Each stage must meet only half the timing budget. The cost is one additional clock cycle of latency, which must be architecturally acceptable.
Useful skew delays the capture flip-flop’s clock arrival relative to the launch flip-flop, effectively increasing the required time for that path. By arriving later, the capture window opens later, giving the data more time to propagate. The CTS tool can implement this by adjusting local buffer delays. The maximum amount of useful skew is bounded by the hold margin on the same path.
A multicycle path exception is appropriate when the path is architecturally guaranteed to have multiple clock cycles before data is used. The justification must be explicit: which handshake, enable signal, or protocol ensures the multi-cycle property? Never apply exceptions to avoid fixing real single-cycle violations — that causes silicon failures. Document every exception with the architectural proof of why it is safe.