What is setup time violation?

A setup time violation means data arrives at a flip-flop's input too late relative to the clock edge — the data hasn't stabilized before the clock samples it, potentially capturing wrong data.

MCMM (Multi-Corner Multi-Mode) runs STA across all combinations of process corners (fast/slow), voltage corners (high/low), temperature corners (hot/cold), and operating modes (functional, test, low-power).

Physical Design Day 7 — Static Timing Analysis & Timing Closure

Q: What is Static Timing Analysis (STA)?

STA is an exhaustive timing verification method that checks all paths in a design against setup and hold constraints without running simulation, ensuring all flip-flops capture data reliably at the target clock frequency.

1. What is Static Timing Analysis?

Static Timing Analysis (STA) is the exhaustive, simulation-free verification that every register-to-register path in a design meets its timing constraints. Unlike functional simulation which checks a subset of vectors, STA mathematically analyzes all paths simultaneously.

STA operates on a directed acyclic graph (DAG) of timing arcs:

Startpoint: A flip-flop Q output or primary input
Endpoint: A flip-flop D input or primary output
Combinational path: Logic gates and interconnect between start and end
Timing arc: Each gate pin-to-pin delay (rise/fall separately)

Timing Path Graph (STA View)

CLK │ ▼ ┌─────────┐ D1 ┌─────┐ D2 ┌─────┐ D3 ┌─────────┐ │ FF1 │──────► │ AND │──────► │ OR │──────► │ FF2 │ │ (reg) │ 15ps └─────┘ 12ps └─────┘ 10ps │ (reg) │ └─────────┘ └─────────┘ │ │ Clock(FF1) Wire delays: Clock(FF2) arrival: T_c1 D1→AND: 5ps arrival: T_c2 AND→OR: 4ps OR→FF2: 6ps Skew = T_c2 - T_c1 Total combinational delay: 15+12+10+5+4+6 = 52ps STA checks: Setup: 52 + T_c1 ≤ T_period + T_c2 - T_setup Hold: 52 + T_c1 ≥ T_c2 + T_hold

2. Setup Time Analysis

Setup time is the minimum time data must be stable BEFORE the capturing clock edge. If data arrives too late — setup violation — the flip-flop may capture an intermediate (metastable) value.

Setup Timing Equation: T_data_arrival ≤ T_clock_arrival - T_setup - T_uncertainty Where: T_data_arrival = launch_clock_edge + logic_delay + wire_delay T_clock_arrival = capture_clock_edge + clock_network_delay(FF2) T_setup = flip-flop's minimum setup requirement (library) T_uncertainty = clock jitter + OCV margin + cross-talk guard Setup Slack = T_clock_arrival - T_setup - T_uncertainty - T_data_arrival Positive slack → PASS (timing margin available) Negative slack → FAIL (timing violation, fix required) Example (1GHz clock, 1ns period): T_period = 1000ps Launch clock edge = 0ps Logic + wire delay = 750ps T_data_arrival = 0 + 750 = 750ps Capture edge = 1000ps Clock network delay(FF2) = 150ps T_clock_arrival = 1000 + 150 = 1150ps T_setup = 50ps T_uncertainty = 80ps Setup slack = 1150 - 50 - 80 - 750 = +270ps ✓ PASS (good margin) Tight example (same path, 2GHz clock): T_period = 500ps T_data_arrival = 750ps (same path) T_clock_arrival = 500 + 150 = 650ps Setup slack = 650 - 50 - 80 - 750 = -230ps ✗ FAIL (needs fix!)

Setup Violation Waveform

Clock at FF1 (launch): ─────┐ ┌─────────────────────┐ ┌──── └─────┘ └─────┘ 0ps 1000ps Data path output at FF2 D pin: │◄────────── 750ps propagation ──────────►│ ─────────────────────────────┐ ──────────────── Data = previous value │ Transition to new └──── at 750ps Clock at FF2 (capture): ─────────────────────────────────┐ ┌──────── └─────┘ 1150ps (skewed late) Setup window at FF2: Data must be stable by: 1150 - 50 = 1100ps Data arrives at: 750ps ← arrives at 750ps, stable before 1100ps ✓ No violation at 1GHz At 2GHz (period=500ps): Clock at FF2 capture edge: 500 + 150 = 650ps Data must be stable by: 650 - 50 = 600ps Data arrives at: 750ps ← 150ps TOO LATE! ✗ Setup violation! Flip-flop captures wrong data

3. Hold Time Analysis

Hold time is the minimum time data must remain stable AFTER the clock edge. Hold violations are particularly dangerous — they cannot be fixed by simply reducing the clock frequency, and they cause permanent functional failures.

Hold Timing Equation: T_data_arrival ≥ T_clock_arrival + T_hold Hold Slack = T_data_arrival - T_clock_arrival - T_hold Positive → PASS Negative → FAIL (must fix — cannot use slower clock!) Hold violation example: T_data_arrival = 25ps (very short combinational path) T_clock_arrival(FF2) = 100ps (FF2 clock arrives early) T_hold = 30ps Hold slack = 25 - 100 - 30 = -105ps ✗ FAIL! Data changes at 25ps, but FF2 needs data stable until 100+30=130ps Data is GONE before hold window ends! Fixes for hold violation (add delay to data path): 1. Insert delay cell (DELAYX1, DELAYX2) in path 2. Use larger drive strength cell (higher intrinsic delay) 3. Increase wire length (more capacitance = more delay) Note: These deliberately slow data — opposite of setup fix!

4. On-Chip Variation (OCV)

Real chips have process variation across the die. Two nominally identical gates at different die locations experience different transistor thresholds (Vt), oxide thickness, and metal resistivity. OCV modeling accounts for this statistically.

OCV Derating — Pessimistic Timing Model

Without OCV (ideal, all gates identical): Launch path delay: 500ps Capture clock delay: 200ps With OCV (realistic derating): Launch path uses SLOW derating: Cells: 500ps × 1.10 = 550ps (10% slower) Wires: 500ps × 1.05 = 525ps (5% more RC) Total launch: 550ps (worst case) Capture clock uses FAST derating: Clock to FF2: 200ps × 0.95 = 190ps (5% faster → earlier clock) (Earlier capture edge = tighter setup check) OCV-derated setup check: T_data_arrival = 550ps (launch, worst slow) T_clock_arrival = 190ps + 500ps = 690ps (capture, fast) Setup slack = 690 - 50 - 80 - 550 = +10ps (was +270ps without OCV!) OCV significantly reduces slack margins. Advanced OCV Methods: AOCV (Advanced OCV): Path-length-dependent derating (shorter paths = more variation) SOCV (Statistical OCV): Monte Carlo-based, less pessimistic than AOCV POCV (Parametric OCV): Uses sigma-based statistical models (most accurate)

5. Multi-Corner Multi-Mode (MCMM)

MCMM verifies timing across ALL combinations of operating conditions simultaneously. A chip that only passes one corner will fail in silicon under different temperatures, voltages, or process lots.

Dimension	Corners	Setup Check Corner	Hold Check Corner
Process	SS / TT / FF (slow/typical/fast)	Slow (SS)	Fast (FF)
Voltage	0.72V / 0.8V / 0.88V (±10%)	Low V (0.72V)	High V (0.88V)
Temperature	-40°C / 25°C / 125°C	Depends on node	Depends on node
Operating Mode	Functional / Scan / Low-Power	All modes	All modes

MCMM Corner Set (Typical Production Sign-Off): Setup corners (worst for setup — slow data, tight clock window): SLOW_0.72V_125C (slow process, low voltage, hot) SLOW_0.72V_-40C (slow process, low voltage, cold — reverse body effect) Hold corners (worst for hold — fast data, early clock): FAST_0.88V_-40C (fast process, high voltage, cold) FAST_0.88V_25C (fast process, high voltage, room temp) Total corners in sign-off: typically 8–16 corner/mode combos Run time per corner: 2–8 hours (distributed compute) Total MCMM run: 16 corners × 6 hours = 96 CPU-hours Distributed across 32-core cluster: Wall time: ~3 hours per iteration Typical iterations to clean sign-off: 5–10

6. Clock Uncertainty

Clock signals are not perfect. Three distinct sources of uncertainty eat into your timing budget at every cycle:

Clock Uncertainty Sources

┌─────────────────────────────────────┐ Ideal Clock │ Clock Period = 1000ps │ ────────────────┤◄──────────────────────────────────► │ └─────────────────────────────────────┘ 1. Jitter (cycle-to-cycle period variation): Ideal: │ │ │ │ │ │ ← equally spaced edges Real: │ │ │ │ │ │ ← edges shift ±30ps ±30ps jitter 2. Skew (spatial variation across die): FF1 clock arrival: 150ps FF2 clock arrival: 185ps Skew = 35ps 3. OCV on clock tree itself: Derating adds ±20ps uncertainty to buffered clock paths Total uncertainty added to timing check: Jitter: 30ps Skew: 35ps OCV: 20ps Guard: 15ps (margin) Total: 100ps subtracted from clock period for setup Effective timing budget = 1000 - 100 = 900ps available for logic

7. Critical Path Fixing Techniques

When a path fails timing, the fix must address the specific bottleneck. STA reports show which gate or wire contributes the most delay — target those first.

7.1 Cell Upsizing

Cell Upsizing Effect on Critical Path

Original path (timing FAIL, slack = -50ps): FF1_Q ─[BUF_X1]─ 30ps ─[AND_X1]─ 40ps ─[OR_X1]─ 35ps ─ FF2_D (drive X1) (drive X1) (drive X1) Total: 105ps → Slack: 500 - 105 - 50(setup) - 80(uncertainty) = 265ps Wait, path is 750ps total... Bottleneck: AND_X1 has 40ps delay due to high fan-out (10 loads) Fix: Upsize AND_X1 → AND_X4 (4× drive strength): AND_X4 drives same 10 loads at: 15ps (instead of 40ps) New total: 105 - (40-15) = 80ps (saved 25ps) Wire delay also decreases because driver is stronger → slews faster Effective savings: 30ps total New path: FF1_Q ─[BUF_X1]─[AND_X4]─[OR_X1]─ FF2_D = 720ps total New slack: +20ps ✓ PASS

7.2 Buffer Insertion (Logical Effort)

Long wires with many fanout loads benefit from intermediate buffering to reduce delay.

Buffer Insertion Formula (Logical Effort): Optimal # of buffers for a wire: n = log(h) / log(e) h = electrical effort (output capacitance / input capacitance) e = euler's number (~2.718) For h = 64 (64× more capacitive load than driving gate): n = log(64) / log(2.718) ≈ 4.16 → insert 4 buffers Each buffer stage has effort: 64^(1/4) = 2.83× Vs. direct drive effort: 64× (much worse) Delay with 4 buffers: proportional to 4 × 2.83 = 11.3 effort units Delay direct: 64 effort units Improvement: 64/11.3 = 5.7× faster with optimal buffering! Practical: EDA tools handle this automatically via "buffer tree synthesis" Manual intervention needed when auto-fix breaks constraints

7.3 Gate Restructuring

Rebalancing logic depth — converting a wide gate tree into a balanced binary tree — can cut delay by 40–60% on deep combinational paths.

Logic Restructuring — Reduce Depth

Before (sequential chain, 5 levels deep): A ─[AND]─ AB ─[AND]─ ABC ─[AND]─ ABCD ─[AND]─ ABCDE 5ps 5ps 5ps 5ps Total: 20ps delay (4 gate levels) After (balanced binary tree, 3 levels deep): A ─[AND]─ AB ─┐ B ─[AND]─ AB ─┤ [AND]──[AND]─ ABCDE C ─[AND]─ CD ─┤ 5ps D ─[AND]─ CD ─┘ E ─────────────────┘ Total: 5+5+5 = 15ps delay (3 gate levels) ← 25% faster! Tools: Logic synthesis restructuring (Synopsys Design Compiler) Post-route ECO gate replacement (Cadence Innovus)

8. Timing Sign-Off — TSMC N5 Example

Apple M2 Chip (TSMC N5)

Clock frequency: 3.49 GHz (performance cores), 2.42 GHz (efficiency cores)
Critical path target: ~286ps at 3.49 GHz
Clock uncertainty used: 120ps (jitter + skew + OCV)
Effective logic budget: 286 - 120 = 166ps for combinational logic
MCMM corners: 12 corner/mode combinations signed off
Timing iterations to clean: ~15 (over 6 weeks)
Worst slack at sign-off: +2ps (razor-thin! A-grade closure)

Timing Slack Distribution at Sign-Off (Apple M2)

Number of paths (millions): ^ │ ████████ │ ████████ ████ │ ████████ ████ ████ │ ████████ ████ ████ ███ │ ████████ ████ ████ ███ ██ │ ████████ ████ ████ ███ ██ █ └─────────────────────────────────► Slack (ps) -50 0 +20 +50 +100 +200 +500 ^ All paths must move to right of 0! Percentage of violating paths at start of timing closure: Week 0: 8.3% (millions of violations) Week 2: 2.1% Week 4: 0.3% Week 6: 0.02% Week 8: 0% ← CLEAN SIGN-OFF Tools used: Synopsys PrimeTime (STA engine) Cadence Tempus (STA + ECO integration) Custom timing scripts (internal, proprietary)

9. STA Tools and ECO Flows

Tool	Vendor	Strength	Key Feature
PrimeTime	Synopsys	Industry reference for STA	POCV, AOCV, graph-based analysis
Tempus	Cadence	ECO-integrated STA	Auto ECO with placement awareness
GoldTime	Siemens	Fast signoff verification	Correlation with Calibre extraction
StarRC	Synopsys	Parasitic extraction	Best accuracy for parasitic delays
Quantus	Cadence	RC extraction	Fast extraction with Innovus integration

10. Timing Closure Checklist

Production Timing Sign-Off Checklist

✅ MCMM setup defined: All process/voltage/temperature/mode corners configured
✅ Constraints verified (SDC): Clock definitions, false paths, multicycle paths correct
✅ Clock uncertainty budgeted: Jitter, skew, OCV all accounted for
✅ Post-PEX extraction: RC parasitics from final routed layout included
✅ Setup violations: zero: All paths pass setup at all MCMM corners
✅ Hold violations: zero: All paths pass hold at all MCMM corners
✅ OCV/AOCV/POCV mode: Appropriate derating applied for technology node
✅ Critical path depth reviewed: No path deeper than target gate count
✅ Clock path checked separately: CTS meets skew and latency targets
✅ Cross-clock path analyzed: CDC paths either constrained or properly synchronized
✅ I/O timing met: All input/output ports meet interface constraints
✅ Final sign-off report generated: Foundry-accepted format, all corners green

Next — Day 8: Power optimization and thermal management — voltage scaling, clock gating, power domains, and thermal budgeting for production VLSI chips.