STA — Path Analysis

Timing Paths &
Critical Path Analysis

Every timing violation in VLSI is a story about a specific path — a chain of logic gates between two registers that takes too long. Understanding how STA classifies, traces, and reports these paths is essential for reading timing reports, identifying bottlenecks, and applying the right fix to the right place.

35 min read
PrimeTime / Tempus
Interactive Lab

The Four Types of Timing Paths

STA classifies every timing path by its start point and end point. Every path in the design falls into one of four categories. Understanding which type you are looking at tells you immediately what constraints apply and what the fix space is.

Register → Register

From a flip-flop's clock pin (launch) through its Q output, through combinational logic, to the next flip-flop's D input (capture). The most common path type — all internal datapath timing is R2R. Constrained by create_clock period.

Input → Register

From a primary input port through combinational logic to a flip-flop's D input. Constrained by set_input_delay. The external logic that drives the input port consumes part of the clock period before the signal reaches the chip.

Register → Output

From a flip-flop's Q output through combinational logic to a primary output port. Constrained by set_output_delay. The downstream chip receiving the output needs the data to arrive before its own setup deadline.

Input → Output

A purely combinational path with no sequential element — from input port directly to output port. Constrained by both set_input_delay and set_output_delay together. The available budget is the clock period minus both delays.

Path TypeLaunch pointCapture pointConstraintFrequency?
R2RFF clock pinFF D pincreate_clockYes — limits fmax
I2RInput portFF D pinset_input_delayYes
R2OFF Q pinOutput portset_output_delayYes
I2OInput portOutput portBoth I/O delaysYes

How STA Builds the Timing Graph

STA works on a timing graph — a directed acyclic graph (DAG) derived from the gate-level netlist. Every logic gate becomes a node, every wire becomes a directed edge with propagation delay, and every flip-flop becomes a source (from Q) or sink (to D).

  Timing Graph Construction:

  1. Parse gate-level netlist (after synthesis or P&R)
     └── Every cell → node (with delay from .lib timing arcs)
     └── Every net → edge (wire delay from parasitic extraction)

  2. Annotate delays
     └── Cell delays: from timing library at the specified PVT corner
     └── Wire delays: from SPEF/SPICE parasitic extraction (post-layout)
     └── Pre-layout: wire load models (estimated)

  3. Identify start/end points
     └── Start points: FF clock pins, primary input ports
     └── End points:   FF D pins, primary output ports

  4. Enumerate all paths (conceptually)
     └── In practice: forward/backward traversal to compute
         arrival time and required time at each node

  5. Compute slack at every end point
     └── Setup slack = required time − arrival time
     └── Report worst slack paths (top-N violators)
Why STA is fast: STA does not simulate — it does not apply test vectors or propagate logic values. It computes timing mathematically on the graph using static delay values from the library. This means it analyzes every possible path simultaneously in minutes, while simulation would take weeks for the same coverage.

Arrival Time, Required Time, and Slack

Every node in the timing graph has two key numbers: arrival time (when does the signal actually get here?) and required time (when does it need to be here?). The difference between them is slack.

Arrival time at node N = max(arrival at all inputs) + cell delay(N)
Required time at node N = required time at output − cell delay(N)
Slack at end point = Required time − Arrival time
Critical path = path with the smallest (most negative) slack

Forward propagation (arrival time)

STA propagates arrival times forward from all start points through the graph. At each gate, the arrival time = max(arrivals at all inputs) + gate delay. This max selects the latest-arriving input, which is the constraining input for that gate.

Backward propagation (required time)

STA propagates required times backward from all end points. At each gate, required time at input = required time at output − gate delay. This identifies how late a signal can be at any point and still meet the downstream deadline.

Setup slack (full equation):
 Slack = (T_capture_clk + T_period − T_su) − (T_launch_clk + T_cq + T_comb)
fmax = 1 / (T_cq + T_comb_critical + T_su − T_skew)
Critical path ≠ longest wire. The critical path is the path with the worst (least) timing slack, not necessarily the physically longest wire. A short wire through many slow high-fanout gates can be more critical than a long wire through fast buffers. Logic depth and gate drive strength matter more than wire length alone.

Reading a PrimeTime Timing Report

Every STA engineer spends significant time reading timing path reports from tools like Synopsys PrimeTime or Cadence Tempus. The report structure is standardized and once you know how to read it, you can instantly locate the bottleneck cell.

  ============================================================
  Timing Path Report: Setup Check
  Path Group: CLK
  Path Type:  max (setup)
  ============================================================

  Point                        Incr    Path
  ─────────────────────────────────────────────────────────
  clock CLK (rise edge)        0.000   0.000
  clock network delay (ideal)  0.000   0.000
  u_pipe_A/clk (DFF_X1)        0.000   0.000 r   ← launch FF clock pin

  u_pipe_A/Q (DFF_X1)          0.180   0.180 f   ← clock-to-Q delay
  u_and0/A (AND2_X1)           0.042   0.222 f   ← wire + gate delay
  u_and0/Z (AND2_X1)           0.065   0.287 r
  u_xor1/A (XOR2_X2)           0.038   0.325 r
  u_xor1/Z (XOR2_X2)           0.092   0.417 f   ← gate delay
  u_add/A[3] (ADDER_X1)        0.031   0.448 f
  u_add/SUM[3] (ADDER_X1)      0.340   0.788 r   ← adder is slow!
  u_reg_B/D (DFF_X1)           0.025   0.813 r   ← wire to capture FF

  data arrival time                     0.813     ← total path delay

  ─────────────────────────────────────────────────────────
  clock CLK (rise edge)        1.000   1.000
  clock network delay (ideal)  0.000   1.000
  u_reg_B/clk (DFF_X1)        0.000   1.000 r   ← capture FF clock
  library setup time          -0.045   0.955     ← setup time subtracted

  data required time                    0.955

  ─────────────────────────────────────────────────────────
  data required time                    0.955
  data arrival time                    -0.813
  ─────────────────────────────────────────────────────────
  slack (MET)                          +0.142    ← positive = pass

Reading the "Incr" column

The "Incr" column shows the incremental delay at each step — wire delay + cell delay. The largest single increment is the bottleneck gate. In the example above, u_add/SUM[3] adds 0.340 ns — the adder is the critical gate on this path.

r / f annotations

"r" = rising transition, "f" = falling transition. Cell delays are different for rising and falling edges (asymmetric PMOS/NMOS drive). STA reports the worst-case transition. XOR and adder paths often have long chains of inversion that create alternating r/f transitions.

Locating the bottleneck: Sort the "Incr" column mentally. The cell with the largest single increment is the bottleneck. Typical suspects: wide adders, multipliers, long mux chains, high-fanout nets with heavy load, and cells driving large wire capacitances. Fix that cell first — upsizing its drive strength or breaking the path with pipelining has the biggest impact.

Logic Depth and Gate-Level Analysis

Logic depth is the number of logic gate levels on the combinational path between two registers. Each gate level adds propagation delay. A path with 20 gate levels at 50 ps average delay contributes 1 ns of combinational delay — if your clock period is 1 ns, there is zero budget left for setup time and clock-to-Q.

Logic functionTypical gate levelsTypical delay (28nm)Optimization strategy
2:1 Mux1–280–120 psSynthesis restructuring
8-bit adder (RCA)16–18800–1000 psReplace with CLA/carry-select
8-bit adder (CLA)6–8300–400 psGate sizing, VT swap
16-bit comparator8–10400–500 psTree structure
32-bit multiplier20–301.5–3 nsPipeline, Booth encoding
Priority encoder (16-bit)4–6200–300 psRestructure OR tree
# PrimeTime command to report logic levels on critical paths
report_timing -max_paths 10 -nworst 1 \
              -path_type full_clock \
              -delay_type max \
              -sort_by slack

# Show number of logic levels explicitly
report_timing -max_paths 5 -input_pins -nets -transition_time \
              -capacitance -crosstalk_delta

# Report paths by logic depth (useful to find structurally deep paths)
report_timing -max_paths 20 -group_count 5 \
              -slack_lesser_than 0.5   ;# near-critical paths too

The number of logic levels is reported as "data path / logic levels" in PrimeTime and Tempus. A path with 25+ logic levels at advanced nodes almost always needs pipelining — no amount of gate sizing will fix a 25-level path if the target clock period is 1 ns.

Critical Path Fixing — Ordered by Impact

Not all timing fixes have the same impact. These techniques are ordered from most structural (largest impact, done early) to most local (smallest impact, done late in the flow).

1. Pipelining

Insert a flip-flop in the middle of a long combinational path, splitting it into two shorter paths. Each half now has the full clock period. Increases latency by 1 cycle but doubles the maximum achievable frequency for that path. Best applied early during RTL design.

2. Logic restructuring

Rebalance the logic tree to reduce the longest path. Example: a ripple-carry adder replaced by a carry-lookahead adder cuts gate depth by ~60%. Requires RTL or synthesis script changes. Works best on arithmetic-heavy paths.

3. Gate sizing (drive strength)

Replace a weak cell (e.g., AND2_X1) with a stronger version (AND2_X4). Larger drive strength charges downstream capacitance faster, reducing delay. Each upsizing step reduces delay by ~15–25%. Limited by area/power budget.

4. VT swapping

Replace High-Vt cells (slower, lower leakage) with Low-Vt cells (faster, higher leakage) on the critical path. Typically 10–20% delay reduction per swap. Tools do this automatically during timing optimization with a leakage power constraint.

5. Useful skew

Delay the capture clock slightly (positive skew) to give the data path more time. Directly adds to setup slack on that path. Must be balanced against hold margin. See the Clock Tree page for details.

6. Physical optimization

Move critical cells closer together to reduce wire delay. Reroute high-fanout nets with wider wires. Add repeater buffers to break long RC chains. These are post-placement fixes performed by the P&R tool during timing-driven optimization.

Fix order matters: Always fix the most negative slack path first. After each fix, re-run timing — fixing one path sometimes reveals a new critical path that was previously hidden by a larger violation. Work through the violation list iteratively rather than trying to fix all paths simultaneously.

Near-Critical Paths and Timing Margin

A design that meets timing with only 10 ps of slack on hundreds of paths is fragile. Any post-sign-off variation — ECO buffers added for functional bugs, slight floorplan changes, or different parasitic extraction — can push those paths into violation. Good timing closure targets a healthy margin above zero.

Slack rangeStatusAction
< 0 psFailing — must fixApply fixes before tapeout
0 – 50 psMarginally passingMonitor; any ECO may cause violations
50 – 200 psHealthy marginSafe for most ECOs and sign-off
> 200 psOver-designedConsider frequency increase or power reduction
# Report top-20 near-critical paths (slack < 0.2 ns)
report_timing -max_paths 20 \
              -slack_lesser_than 0.2 \
              -delay_type max

# Check path count per slack bucket
report_timing -max_paths 1000 -delay_type max \
  | grep "slack" | awk '{print $NF}' | sort -n | uniq -c

# Interactive: highlight critical paths in GUI
gui_highlight_timing_path [get_timing_paths -max_paths 5 -nworst 1]
Interactive Lab — Critical Path Analyzer
Build a logic chain stage by stage. Watch the path delay accumulate and see whether the path meets timing — and which gate is the bottleneck.
T_cq
80 ps
T_comb
520 ps
Total Arrival
640 ps
Required Time
955 ps
Setup Slack
+315 ps
Status
✓ PASS

Frequently Asked Questions

The critical path is the timing path with the smallest (or most negative) setup slack in the design. It limits the maximum operating frequency because the clock period must be long enough for data to propagate through the longest combinational path: fmax = 1 / (T_cq + T_comb_max + T_su). Every other path in the design has more margin. Improving fmax means only the critical path needs to be shortened — the rest are already fine. After fixing the critical path, the next-worst path becomes the new critical path, and the process repeats.
Setup analysis asks: can the data arrive in time for the capture FF? The worst case for this question is when the data path is as slow as possible — maximum delay. STA uses max (late) delay values from the slow (SS) corner to find the worst-case arrival time for setup checks. Hold analysis asks: does the data stay stable long enough after the clock edge? The worst case here is when the data changes as fast as possible — minimum delay. STA uses min (early) delay values from the fast (FF) corner for hold checks. Running both simultaneously — called BCWC (Best Case Worst Case) or OCV analysis — ensures both checks are analyzed pessimistically.
A path group is a collection of timing paths that share the same capture clock. By default, STA tools create one path group per clock domain. Path groups determine how the tool reports and optimizes paths — report_timing shows results per group, and synthesis/optimization tools improve each group's worst path independently. You can create custom path groups to give specific paths higher reporting priority, or to separate I/O paths from register-to-register paths for different optimization treatment.
In the timing report, look at the "Incr" column — this shows the incremental delay at each step (wire + cell delay combined). The single largest increment on the data path is the bottleneck gate or wire segment. Common bottlenecks: wide adders/multipliers with many gate levels, high-fanout nets driving many cells (the load capacitance slows the driver), long wires with high RC (post-layout), and cells operating at High-Vt for power savings on a timing-critical path. Fix the bottleneck first — upsizing it or replacing it with a faster implementation gives the maximum slack recovery for the minimum change.
Worst Negative Slack (WNS) is the slack of the single most-failing path — the most negative number in the entire design. Total Negative Slack (TNS) is the sum of all negative slacks across all failing paths. WNS tells you how far your worst path is from closure. TNS tells you the overall volume of timing work remaining. A design with WNS = −0.5 ns and TNS = −0.5 ns has one failing path. A design with WNS = −0.5 ns and TNS = −50 ns has a hundred failing paths of similar severity. Both numbers together give a complete picture of timing health.
Yes — several scenarios cause this. (1) Wrong SDC: false paths masking real violations, or overly loose I/O delays. (2) Missing corners: timing only closed at TT, but SS corner fails. (3) Post-silicon variation beyond modeled limits: real silicon varies more than library models. (4) IR drop: power grid resistance causes supply voltage to sag under load, slowing cells below library specs. (5) Crosstalk (SI): an aggressor switching net couples noise onto a victim net, changing its delay in ways STA may underestimate. Real silicon sign-off requires multi-corner analysis, SI-aware STA, and IR-drop-aware timing — not just nominal STA.

Explore Further

← SDC Constraints