1. What is Static Timing Analysis?
Static Timing Analysis (STA) is the exhaustive, simulation-free verification that every register-to-register path in a design meets its timing constraints. Unlike functional simulation which checks a subset of vectors, STA mathematically analyzes all paths simultaneously.
STA operates on a directed acyclic graph (DAG) of timing arcs:
- Startpoint: A flip-flop Q output or primary input
- Endpoint: A flip-flop D input or primary output
- Combinational path: Logic gates and interconnect between start and end
- Timing arc: Each gate pin-to-pin delay (rise/fall separately)
Timing Path Graph (STA View)
CLK
│
▼
┌─────────┐ D1 ┌─────┐ D2 ┌─────┐ D3 ┌─────────┐
│ FF1 │──────► │ AND │──────► │ OR │──────► │ FF2 │
│ (reg) │ 15ps └─────┘ 12ps └─────┘ 10ps │ (reg) │
└─────────┘ └─────────┘
│ │
Clock(FF1) Wire delays: Clock(FF2)
arrival: T_c1 D1→AND: 5ps arrival: T_c2
AND→OR: 4ps
OR→FF2: 6ps
Skew = T_c2 - T_c1
Total combinational delay: 15+12+10+5+4+6 = 52ps
STA checks:
Setup: 52 + T_c1 ≤ T_period + T_c2 - T_setup
Hold: 52 + T_c1 ≥ T_c2 + T_hold
2. Setup Time Analysis
Setup time is the minimum time data must be stable BEFORE the capturing clock edge. If data arrives too late — setup violation — the flip-flop may capture an intermediate (metastable) value.
Setup Timing Equation:
T_data_arrival ≤ T_clock_arrival - T_setup - T_uncertainty
Where:
T_data_arrival = launch_clock_edge + logic_delay + wire_delay
T_clock_arrival = capture_clock_edge + clock_network_delay(FF2)
T_setup = flip-flop's minimum setup requirement (library)
T_uncertainty = clock jitter + OCV margin + cross-talk guard
Setup Slack = T_clock_arrival - T_setup - T_uncertainty - T_data_arrival
Positive slack → PASS (timing margin available)
Negative slack → FAIL (timing violation, fix required)
Example (1GHz clock, 1ns period):
T_period = 1000ps
Launch clock edge = 0ps
Logic + wire delay = 750ps
T_data_arrival = 0 + 750 = 750ps
Capture edge = 1000ps
Clock network delay(FF2) = 150ps
T_clock_arrival = 1000 + 150 = 1150ps
T_setup = 50ps
T_uncertainty = 80ps
Setup slack = 1150 - 50 - 80 - 750 = +270ps ✓ PASS (good margin)
Tight example (same path, 2GHz clock):
T_period = 500ps
T_data_arrival = 750ps (same path)
T_clock_arrival = 500 + 150 = 650ps
Setup slack = 650 - 50 - 80 - 750 = -230ps ✗ FAIL (needs fix!)
Setup Violation Waveform
Clock at FF1 (launch):
─────┐ ┌─────────────────────┐ ┌────
└─────┘ └─────┘
0ps 1000ps
Data path output at FF2 D pin:
│◄────────── 750ps propagation ──────────►│
─────────────────────────────┐ ────────────────
Data = previous value │ Transition to new
└──── at 750ps
Clock at FF2 (capture):
─────────────────────────────────┐ ┌────────
└─────┘
1150ps (skewed late)
Setup window at FF2:
Data must be stable by: 1150 - 50 = 1100ps
Data arrives at: 750ps ← arrives at 750ps, stable before 1100ps
✓ No violation at 1GHz
At 2GHz (period=500ps):
Clock at FF2 capture edge: 500 + 150 = 650ps
Data must be stable by: 650 - 50 = 600ps
Data arrives at: 750ps ← 150ps TOO LATE!
✗ Setup violation! Flip-flop captures wrong data
3. Hold Time Analysis
Hold time is the minimum time data must remain stable AFTER the clock edge. Hold violations are particularly dangerous — they cannot be fixed by simply reducing the clock frequency, and they cause permanent functional failures.
Hold Timing Equation:
T_data_arrival ≥ T_clock_arrival + T_hold
Hold Slack = T_data_arrival - T_clock_arrival - T_hold
Positive → PASS
Negative → FAIL (must fix — cannot use slower clock!)
Hold violation example:
T_data_arrival = 25ps (very short combinational path)
T_clock_arrival(FF2) = 100ps (FF2 clock arrives early)
T_hold = 30ps
Hold slack = 25 - 100 - 30 = -105ps ✗ FAIL!
Data changes at 25ps, but FF2 needs data stable until 100+30=130ps
Data is GONE before hold window ends!
Fixes for hold violation (add delay to data path):
1. Insert delay cell (DELAYX1, DELAYX2) in path
2. Use larger drive strength cell (higher intrinsic delay)
3. Increase wire length (more capacitance = more delay)
Note: These deliberately slow data — opposite of setup fix!
4. On-Chip Variation (OCV)
Real chips have process variation across the die. Two nominally identical gates at different die locations experience different transistor thresholds (Vt), oxide thickness, and metal resistivity. OCV modeling accounts for this statistically.
OCV Derating — Pessimistic Timing Model
Without OCV (ideal, all gates identical):
Launch path delay: 500ps
Capture clock delay: 200ps
With OCV (realistic derating):
Launch path uses SLOW derating:
Cells: 500ps × 1.10 = 550ps (10% slower)
Wires: 500ps × 1.05 = 525ps (5% more RC)
Total launch: 550ps (worst case)
Capture clock uses FAST derating:
Clock to FF2: 200ps × 0.95 = 190ps (5% faster → earlier clock)
(Earlier capture edge = tighter setup check)
OCV-derated setup check:
T_data_arrival = 550ps (launch, worst slow)
T_clock_arrival = 190ps + 500ps = 690ps (capture, fast)
Setup slack = 690 - 50 - 80 - 550 = +10ps (was +270ps without OCV!)
OCV significantly reduces slack margins.
Advanced OCV Methods:
AOCV (Advanced OCV): Path-length-dependent derating (shorter paths = more variation)
SOCV (Statistical OCV): Monte Carlo-based, less pessimistic than AOCV
POCV (Parametric OCV): Uses sigma-based statistical models (most accurate)
5. Multi-Corner Multi-Mode (MCMM)
MCMM verifies timing across ALL combinations of operating conditions simultaneously. A chip that only passes one corner will fail in silicon under different temperatures, voltages, or process lots.
| Dimension | Corners | Setup Check Corner | Hold Check Corner |
| Process | SS / TT / FF (slow/typical/fast) | Slow (SS) | Fast (FF) |
| Voltage | 0.72V / 0.8V / 0.88V (±10%) | Low V (0.72V) | High V (0.88V) |
| Temperature | -40°C / 25°C / 125°C | Depends on node | Depends on node |
| Operating Mode | Functional / Scan / Low-Power | All modes | All modes |
MCMM Corner Set (Typical Production Sign-Off):
Setup corners (worst for setup — slow data, tight clock window):
SLOW_0.72V_125C (slow process, low voltage, hot)
SLOW_0.72V_-40C (slow process, low voltage, cold — reverse body effect)
Hold corners (worst for hold — fast data, early clock):
FAST_0.88V_-40C (fast process, high voltage, cold)
FAST_0.88V_25C (fast process, high voltage, room temp)
Total corners in sign-off: typically 8–16 corner/mode combos
Run time per corner: 2–8 hours (distributed compute)
Total MCMM run: 16 corners × 6 hours = 96 CPU-hours
Distributed across 32-core cluster:
Wall time: ~3 hours per iteration
Typical iterations to clean sign-off: 5–10
6. Clock Uncertainty
Clock signals are not perfect. Three distinct sources of uncertainty eat into your timing budget at every cycle:
Clock Uncertainty Sources
┌─────────────────────────────────────┐
Ideal Clock │ Clock Period = 1000ps │
────────────────┤◄──────────────────────────────────► │
└─────────────────────────────────────┘
1. Jitter (cycle-to-cycle period variation):
Ideal: │ │ │ │ │ │ ← equally spaced edges
Real: │ │ │ │ │ │ ← edges shift ±30ps
±30ps jitter
2. Skew (spatial variation across die):
FF1 clock arrival: 150ps
FF2 clock arrival: 185ps
Skew = 35ps
3. OCV on clock tree itself:
Derating adds ±20ps uncertainty to buffered clock paths
Total uncertainty added to timing check:
Jitter: 30ps
Skew: 35ps
OCV: 20ps
Guard: 15ps (margin)
Total: 100ps subtracted from clock period for setup
Effective timing budget = 1000 - 100 = 900ps available for logic
7. Critical Path Fixing Techniques
When a path fails timing, the fix must address the specific bottleneck. STA reports show which gate or wire contributes the most delay — target those first.
7.1 Cell Upsizing
Cell Upsizing Effect on Critical Path
Original path (timing FAIL, slack = -50ps):
FF1_Q ─[BUF_X1]─ 30ps ─[AND_X1]─ 40ps ─[OR_X1]─ 35ps ─ FF2_D
(drive X1) (drive X1) (drive X1)
Total: 105ps → Slack: 500 - 105 - 50(setup) - 80(uncertainty) = 265ps
Wait, path is 750ps total...
Bottleneck: AND_X1 has 40ps delay due to high fan-out (10 loads)
Fix: Upsize AND_X1 → AND_X4 (4× drive strength):
AND_X4 drives same 10 loads at: 15ps (instead of 40ps)
New total: 105 - (40-15) = 80ps (saved 25ps)
Wire delay also decreases because driver is stronger → slews faster
Effective savings: 30ps total
New path: FF1_Q ─[BUF_X1]─[AND_X4]─[OR_X1]─ FF2_D = 720ps total
New slack: +20ps ✓ PASS
7.2 Buffer Insertion (Logical Effort)
Long wires with many fanout loads benefit from intermediate buffering to reduce delay.
Buffer Insertion Formula (Logical Effort):
Optimal # of buffers for a wire: n = log(h) / log(e)
h = electrical effort (output capacitance / input capacitance)
e = euler's number (~2.718)
For h = 64 (64× more capacitive load than driving gate):
n = log(64) / log(2.718) ≈ 4.16 → insert 4 buffers
Each buffer stage has effort: 64^(1/4) = 2.83×
Vs. direct drive effort: 64× (much worse)
Delay with 4 buffers: proportional to 4 × 2.83 = 11.3 effort units
Delay direct: 64 effort units
Improvement: 64/11.3 = 5.7× faster with optimal buffering!
Practical: EDA tools handle this automatically via "buffer tree synthesis"
Manual intervention needed when auto-fix breaks constraints
7.3 Gate Restructuring
Rebalancing logic depth — converting a wide gate tree into a balanced binary tree — can cut delay by 40–60% on deep combinational paths.
Logic Restructuring — Reduce Depth
Before (sequential chain, 5 levels deep):
A ─[AND]─ AB ─[AND]─ ABC ─[AND]─ ABCD ─[AND]─ ABCDE
5ps 5ps 5ps 5ps
Total: 20ps delay (4 gate levels)
After (balanced binary tree, 3 levels deep):
A ─[AND]─ AB ─┐
B ─[AND]─ AB ─┤
[AND]──[AND]─ ABCDE
C ─[AND]─ CD ─┤ 5ps
D ─[AND]─ CD ─┘
E ─────────────────┘
Total: 5+5+5 = 15ps delay (3 gate levels) ← 25% faster!
Tools: Logic synthesis restructuring (Synopsys Design Compiler)
Post-route ECO gate replacement (Cadence Innovus)
8. Timing Sign-Off — TSMC N5 Example
Apple M2 Chip (TSMC N5)
- Clock frequency: 3.49 GHz (performance cores), 2.42 GHz (efficiency cores)
- Critical path target: ~286ps at 3.49 GHz
- Clock uncertainty used: 120ps (jitter + skew + OCV)
- Effective logic budget: 286 - 120 = 166ps for combinational logic
- MCMM corners: 12 corner/mode combinations signed off
- Timing iterations to clean: ~15 (over 6 weeks)
- Worst slack at sign-off: +2ps (razor-thin! A-grade closure)
Timing Slack Distribution at Sign-Off (Apple M2)
Number of paths (millions):
^
│ ████████
│ ████████ ████
│ ████████ ████ ████
│ ████████ ████ ████ ███
│ ████████ ████ ████ ███ ██
│ ████████ ████ ████ ███ ██ █
└─────────────────────────────────► Slack (ps)
-50 0 +20 +50 +100 +200 +500
^ All paths must move to right of 0!
Percentage of violating paths at start of timing closure:
Week 0: 8.3% (millions of violations)
Week 2: 2.1%
Week 4: 0.3%
Week 6: 0.02%
Week 8: 0% ← CLEAN SIGN-OFF
Tools used:
Synopsys PrimeTime (STA engine)
Cadence Tempus (STA + ECO integration)
Custom timing scripts (internal, proprietary)
9. STA Tools and ECO Flows
| Tool | Vendor | Strength | Key Feature |
| PrimeTime | Synopsys | Industry reference for STA | POCV, AOCV, graph-based analysis |
| Tempus | Cadence | ECO-integrated STA | Auto ECO with placement awareness |
| GoldTime | Siemens | Fast signoff verification | Correlation with Calibre extraction |
| StarRC | Synopsys | Parasitic extraction | Best accuracy for parasitic delays |
| Quantus | Cadence | RC extraction | Fast extraction with Innovus integration |
10. Timing Closure Checklist
Production Timing Sign-Off Checklist
- ✅ MCMM setup defined: All process/voltage/temperature/mode corners configured
- ✅ Constraints verified (SDC): Clock definitions, false paths, multicycle paths correct
- ✅ Clock uncertainty budgeted: Jitter, skew, OCV all accounted for
- ✅ Post-PEX extraction: RC parasitics from final routed layout included
- ✅ Setup violations: zero: All paths pass setup at all MCMM corners
- ✅ Hold violations: zero: All paths pass hold at all MCMM corners
- ✅ OCV/AOCV/POCV mode: Appropriate derating applied for technology node
- ✅ Critical path depth reviewed: No path deeper than target gate count
- ✅ Clock path checked separately: CTS meets skew and latency targets
- ✅ Cross-clock path analyzed: CDC paths either constrained or properly synchronized
- ✅ I/O timing met: All input/output ports meet interface constraints
- ✅ Final sign-off report generated: Foundry-accepted format, all corners green
Next — Day 8: Power optimization and thermal management — voltage scaling, clock gating, power domains, and thermal budgeting for production VLSI chips.