STA — Practical

Timing Closure —
Fix Setup & Hold Violations

EcrioniX · STA Series· ~22 min read· Real-world techniques
Setup Violation — Data Arrives Too Late
CLK launch capture DATA required arrival SLACK = −30ps setup Data (too slow) Required time

What Is Timing Closure?

Timing closure is the iterative process of eliminating all setup and hold violations in a chip design until every path meets its timing constraint at every signoff corner (process, voltage, temperature). It sits at the intersection of RTL, synthesis, and physical design — and is one of the most time-consuming phases of a real chip tape-out.

The STA tool (PrimeTime, Tempus, etc.) reports slack for every path:

Understanding the Slack Equations

Timing Equations
== SETUP CHECK ==
Data Arrival Time    = launch_clk_edge + Tclk2q + Tcomb + Tnet
Data Required Time   = capture_clk_edge + Tclk_skew - Tsetup

Setup Slack = Required − Arrival     (must be ≥ 0)

== HOLD CHECK ==
Data Arrival Time    = launch_clk_edge + Tclk2q_min + Tcomb_min + Tnet_min
Data Required Time   = capture_clk_edge + Tclk_skew_min + Thold

Hold Slack = Arrival − Required      (must be ≥ 0)

== EXAMPLE (setup violation) ==
Launch edge:   0ns
Tclk2q:        0.15ns
Tcomb:         1.80ns   ← too long — this is the bottleneck
Tnet:          0.12ns
Arrival:       2.07ns

Capture edge:  2.00ns   (500 MHz = 2ns period)
Tsetup:        0.05ns
Required:      1.95ns

Setup Slack = 1.95 − 2.07 = −0.12ns (120ps violation)

Fixing Setup Violations — Techniques in Order

① Logic Restructuring (Best — Zero Area Cost)

Reduce the number of logic levels in the critical path. Restructure Boolean equations, share subexpressions, or pipeline the path. This is the most efficient fix but requires RTL/synthesis changes.

Verilog — Reduce logic levels
// BEFORE: 4-level logic chain (slow)
assign out = ((a & b & c) | (d & e & f)) & (g | h);

// AFTER: Balance levels — same function, 3 levels
wire ab = a & b;
wire abc = ab & c;
wire def = d & e & f;  // now synthesizes in parallel
assign out = (abc | def) & (g | h);
② Gate Sizing — Upsize Cells on Critical Path

Replace a cell on the critical path with a higher-drive-strength version. Faster drive reduces Tpd and capacitive load effect. Done automatically by synthesis/place-route tools, but can be applied manually via ECO.

③ Buffer / Inverter Insertion

Long nets have high capacitance that slows all gates driving them. Insert repeater buffers at midpoints to break the RC delay. On clock paths this is done by CTS; on data paths it's done during routing optimization.

④ Retiming — Move Registers Across Logic

Push registers forward (toward the output) or backward (toward the input) across combinational logic to balance path delays between pipeline stages. Retiming preserves the functional behavior while redistributing delay.

Verilog — Pipeline to fix setup
// BEFORE: 8-level combinational path — critical
always @(posedge clk)
  result <= complex_8_level_logic(a, b, c);

// AFTER: Split into 2 pipeline stages — each 4 levels
logic [7:0] mid_stage;
always @(posedge clk) mid_stage <= first_4_levels(a, b);
always @(posedge clk) result    <= last_4_levels(mid_stage, c);
// Latency increases by 1 cycle — throughput unchanged
⑤ Multi-Cycle Path (MCP) Exception — If Data Rate is Slower

If the data only needs to be valid every N clock cycles (e.g., a divide-by-2 path), tell the STA tool to relax the constraint. This is not a physical fix — it's a specification correction.

SDC — Multi-cycle path
# Data valid every 2 clocks — relax setup by 1 extra period
set_multicycle_path -setup 2 -from [get_cells u_div/q_reg] \
                              -to   [get_cells u_proc/data_reg]

# Also relax hold to avoid over-insertion of hold buffers
set_multicycle_path -hold 1  -from [get_cells u_div/q_reg] \
                              -to   [get_cells u_proc/data_reg]

Fixing Hold Violations

⚠️
Critical insight: Hold violations are insertion-only fixes. You NEVER fix hold by making a path faster (that makes it worse). You always add delay on the short path. Hold fixes also do not depend on clock frequency — they must be fixed at the fastest process/voltage corner.
① Buffer / Delay Cell Insertion on Short Path

Insert delay buffers (or dedicated delay cells, X_DELAY) on the launch path to ensure data arrives at least Thold after the capture clock edge. P&R tools do this automatically in hold fixing mode.

② Avoid Logic that Creates Extremely Short Paths

In RTL, avoid direct register-to-register connections with no combinational logic between them when they share the same clock edge — these create zero-delay "short paths" that will need hold buffers in every corner.

③ Use set_false_path or set_multicycle_path for CDC Paths

Clock domain crossing paths should be marked as false paths in SDC — the STA tool cannot perform meaningful hold analysis on them anyway, and doing so forces unnecessary hold buffer insertion.

SDC — Hold fixes
# Mark CDC path as false — no hold check across async domains
set_false_path -from [get_clocks clk_a] -to [get_clocks clk_b]

# Set min delay to avoid hold violation (add 0.1ns minimum delay)
set_min_delay 0.1 -from [get_cells ff_launch] -to [get_cells ff_capture]

# Check hold at fast corner (best-case timing = worst hold)
read_sdc design.sdc
set_operating_conditions -min ff_0p95v_-40c -max ss_0p85v_125c

OCV and CPPR

ConceptWhat It DoesEffect on Slack
OCV (On-Chip Variation)Applies derate: launch path slower, capture path faster (setup) or vice versaReduces slack by adding pessimism
AOCV (Advanced OCV)Distance/depth-aware derate — cells far apart have more variationMore accurate than flat OCV
POCV (Parametric OCV)Statistical approach — uses Gaussian delay distributionsReduces unnecessary pessimism
CPPR (Clock Path Pessimism Removal)Removes double-derate on shared clock buffers (launch + capture share the same buffer tree up to fork point)Recovers pessimistic slack

Timing Closure Workflow

Timing Closure — Iteration Flow
1. Run STA at worst-case corner (ss_0p85v_125c)
   report_timing -slack_lesser_than 0 -max_paths 100

2. Sort violations by worst negative slack (WNS) and total negative slack (TNS)
   → WNS = worst single path slack
   → TNS = sum of all negative slacks (indicates volume of work)

3. Group violations by clock domain and module
   → Most violations in one module? → RTL restructure
   → Spread across chip? → Placement/routing issue

4. Apply fixes (priority order):
   a. SDC corrections (false paths, MCPs wrongly analyzed)
   b. RTL restructuring (pipelining, logic rebalancing)
   c. Synthesis constraint tightening (add -0.1ns margin)
   d. Physical: re-floorplan, re-place critical cells
   e. ECO: manual gate sizing, buffer insertion

5. Re-run STA after each fix — check for introduced hold violations

6. Sign off at all required corners:
   Setup: ss_0p85v_125c  (slow-slow, hot, low voltage)
   Hold:  ff_0p95v_-40c  (fast-fast, cold, high voltage)
   Leakage: tt_0p9v_25c  (typical)

Common Timing Closure Mistakes

MistakeConsequenceFix
Fixing setup by adding buffers to the clockMoves clock edge — helps setup but creates hold violations elsewhereSize data path instead
Setting false_path on a real pathMasks a real violation — silicon will failVerify path is truly asynchronous before marking false
Missing hold fix at fast cornerDesign fails at -40°C or with fast process lotsAlways run hold analysis at ff_0p95v_-40c
Applying OCV without CPPROver-pessimistic slack — unnecessary over-engineeringEnable CPPR in PrimeTime: set_app_var timing_remove_clock_reconvergence_pessimism true
Ignoring clock domain crossings in SDCSTA tries to analyze unconstrained CDC paths — false violationsExplicitly set_false_path on all async CDC paths
🔗