1. What is Clock Tree Synthesis?
In any synchronous digital design, every flip-flop must be clocked. A single clock source drives thousands (or millions) of flip-flop clock pins. If the clock arrives at different times at different flip-flops, timing violations result. CTS builds a physical network of buffers and inverters that distributes the clock uniformly.
Output: Updated netlist with clock buffers/inverters inserted, DEF with their placement
Done before: Signal routing (clock nets need special treatment — NDR rules, shielding)
2. Key CTS Metrics
Clock Skew
| Skew Type | Definition | Effect on Setup | Effect on Hold |
|---|---|---|---|
| Zero skew | Both FFs receive clock simultaneously | Neutral | Neutral |
| Positive skew | Capturing FF gets clock later | Relaxed (more time for data) | Tightened (less hold margin) |
| Negative skew | Capturing FF gets clock earlier | Tightened (less time for data) | Relaxed (more hold margin) |
Setup Timing with Skew
Hold Timing with Skew
Insertion Delay (Latency)
Insertion delay is the time from the clock source pin to a flip-flop's clock pin. For a balanced tree, all FFs should have approximately equal insertion delay.
| Metric | Typical Range | Impact |
|---|---|---|
| Clock skew | 0 – 200ps (target <100ps) | Directly reduces setup/hold margin |
| Insertion delay | 200ps – 1ns | Not directly a timing violation, but affects power and jitter |
| Clock slew | 50 – 200ps | Slow slew → excess power, jitter, noise coupling |
| Clock jitter | 10 – 100ps | Reduces effective setup margin → must be budgeted in STA |
3. Clock Tree Topologies
| Topology | Skew | Power | Area | Best For |
|---|---|---|---|---|
| H-Tree | Very low (~0) | Low–Medium | Medium | Regular, symmetric floorplans (GPU shader cores, ARM Cortex) |
| Fishbone/Spine | Low–Medium | Low | Low | Linear cell arrays, less dense designs |
| Clock Mesh | Near zero | High | High | High-performance CPUs needing <50ps skew (Intel/AMD) |
| Hybrid | Low | Medium | Medium | Irregular floorplans — tree to sink clusters, mesh within cluster |
4. CTS Implementation Flow
CTS follows a structured flow in commercial tools. Each step refines clock distribution quality.
| Step | Action | Goal |
|---|---|---|
| 1. Pre-CTS setup | Define clock sources, set CTS spec (target skew, max insertion delay, buffer lib) | Constrain the builder |
| 2. Clock tree build | Tool inserts clock buffers/inverters, builds balanced tree from source to sinks | Achieve balanced latency |
| 3. Clock tree optimization | Resize buffers, adjust tap points, insert local inverters to fix skew imbalances | Meet skew target |
| 4. Post-CTS timing | STA with actual clock latencies on both launch and capture paths | Verify setup + hold with skew |
| 5. Hold fixing | Insert delay buffers (hold buffers) on paths with negative hold slack | Fix hold violations introduced by skew |
| 6. Clock routing | Route clock nets with NDR rules (wider wire, more spacing, shielding) | Protect from noise, SI |
5. NDR Rules, Shielding & Clock Net Special Treatment
Clock nets carry the most sensitive signals in the chip. They need special routing rules to prevent noise coupling and ensure clean signal integrity.
Non-Default Routing Rules (NDR)
| Rule | Typical Value | Reason |
|---|---|---|
| Wire width | 2× minimum width | Lower resistance → less RC delay and slew degradation |
| Wire spacing | 2× minimum spacing | Reduce capacitive coupling from adjacent signal nets |
| Via doubling | 2 vias per layer change | Redundancy — single-via open causes clock failure = chip dead |
| Layer preference | Top metal layers (M5+) | Lower resistance per unit length, less congestion |
Clock Shielding
Shielding surrounds the clock wire with VDD or VSS wires on both sides. This prevents switching noise from adjacent data signals from coupling onto the clock net (which would cause jitter).
6. Useful Skew (Intentional Skew)
Instead of fighting all skew, useful skew deliberately introduces skew to fix timing violations. The key insight: if a launch→capture path has setup violation, delaying the capturing FF's clock arrival time gives the data more time — effectively borrowing from the clock budget.
7. CTS Commands — Innovus & ICC2
| Tool | Command | Purpose |
|---|---|---|
| Innovus | clock_design | Full CTS: build + optimize clock tree |
| Innovus | set_ccopt_property target_skew 100 | Set target skew to 100ps |
| Innovus | set_ccopt_property max_fanout 20 | Limit buffer fanout in clock tree |
| Innovus | set_ccopt_property insertion_delay 500 | Target insertion delay 500ps |
| Innovus | report_clock_timing -type skew | Report clock skew across all paths |
| Innovus | report_clock_timing -type latency | Report insertion delay per clock endpoint |
| ICC2 | synthesize_clock_tree | Build the clock tree |
| ICC2 | set_clock_tree_options -target_skew 0.1 | Set skew target (in ns) |
| ICC2 | clock_opt -from build -to build | CTS build phase only |
| ICC2 | report_clock_qor | Clock tree quality report |
| OpenROAD | clock_tree_synthesis | CTS via TritonCTS |
Pre-CTS SDC Clock Definitions
set_clock_uncertainty -setup 0.05 [get_clocks CLK]
set_clock_uncertainty -hold 0.02 [get_clocks CLK]
set_clock_transition 0.1 [get_clocks CLK]
CTS Best Practices
- Fix all DRVs (max transition, max capacitance) before CTS — bad slew propagates through clock tree
- Don't clock gate inside the clock tree — only use ICG cells before the tree root
- Use only CTS-approved buffer/inverter cells (matched drive strength pairs)
- Apply NDR rules before CTS so the router knows which nets need special treatment
- Check post-CTS hold violations immediately — they are cheaper to fix before routing
- Shield all clock nets from trunk to leaf — partial shielding still couples at unshielded segments
- Target insertion delay variation < 20ps across all domains for multi-domain chips
8. Interview Questions & Answers
CTS inserts a tree of clock buffers and inverters, sized and placed so that each path from source to any FF clock pin has approximately equal RC delay. This minimizes skew and ensures all timing analysis is meaningful.
Insertion delay (latency) = absolute time from clock source to a single FF's clock pin. A large insertion delay means the entire clock domain is delayed — this is fine for intra-domain timing (setup/hold checks cancel out) but matters for cross-domain paths and I/O timing. Typical: 300–800ps for 1GHz designs.
Key point: Low skew is critical for timing. Insertion delay only matters at chip boundaries (I/O, cross-domain).
2× width: Reduces wire resistance → lower RC delay → better slew preservation through the tree
2× spacing: Increases distance to neighboring wires → reduces capacitive coupling (SI noise) that would inject jitter
Double vias: Two vias at every layer transition → redundancy. A single-via open on a clock net = chip fails; double via dramatically reduces this risk
Layer preference: Top metal layers for clock trunk (lower resistance per unit length)
Without NDR, signal integrity problems cause clock jitter, which directly reduces effective timing margin.
When to apply: Post-CTS setup violations that routing optimization cannot close. Typically last resort before ECO.
Trade-off: Positive skew (delaying capture) fixes setup but tightens hold on the same path. Must always check hold after applying useful skew. Also, skewing one FF affects all paths through it, potentially creating new violations on other paths sharing that FF.
Hold slack = T_data_path − T_hold − Skew
Negative skew reduces hold slack. A path that had +200ps hold margin pre-CTS might have −50ps hold violation post-CTS if the skew is 250ps.
Fix: Insert delay buffers (hold buffers, also called delay cells or DLY cells) on the data path to lengthen it. These are sized buffers whose purpose is to add propagation delay. Fix hold violations before routing — post-route hold fixing requires ECO routes and is much harder.
When to choose mesh: (1) Very high frequency (>2GHz), (2) Irregular FF distribution that makes H-tree unbalanced, (3) Design has many clock domains at same frequency (share mesh), (4) Robustness requirement — process variation has less impact on mesh than tree.
Trade-off: Mesh consumes 2–5× more clock power than H-tree (more buffers, more wire switching), uses significant routing resources on top metal layers. Only justified when skew target is extremely tight.