Why the Clock Needs a Tree
In RTL simulation, a clock signal arrives simultaneously at every flip-flop. In silicon, it does not. A single wire from the PLL output to a million flip-flops would have an enormous fanout — the driving cell would need to charge the combined input capacitance of every FF simultaneously. That drive current would be enormous, the wire resistance would cause huge RC delay, and the signal would arrive wildly late and with a degraded slew rate.
Clock Tree Synthesis solves this by inserting a balanced tree of buffers and inverters between the clock source and the flip-flops. Each buffer drives only a small number of downstream loads, the tree branches progressively, and the total delay from source to every leaf is made as equal as possible.
What CTS controls
Insertion delay (total latency from source to FF clock pin), clock skew (arrival time difference between FFs), slew rate (transition time at each leaf), and fanout per buffer stage. All four are specified as CTS constraints before the tool runs.
What CTS cannot fix
Floorplan problems that force clock wires to travel long distances, extremely high clock fanout from ungated always-on clocks, and fundamental setup violations caused by too-long combinational paths — these must be resolved before CTS.
Clock Tree Topologies
Different designs use different physical structures for the clock tree, each with trade-offs in skew, power, and routability.
H-Tree
Routes the clock through wires shaped like nested letter H. Each branch has the same wire length from the center, giving inherent geometric symmetry. Minimizes skew before any buffers are inserted. Used in high-performance regularchip blocks like SRAM, CPU cores, and FPGAs.
Balanced Binary Tree
A standard buffer tree where each node drives exactly two children. CTS tools equalize delay by sizing buffers and adjusting wire lengths. Most common in ASIC design — tools like Innovus and ICC2 build this automatically from the placed netlist.
Mesh (Grid)
A global metal grid is driven from multiple points, creating a low-impedance distributed clock network. Extremely low skew (near zero) because every point on the mesh is connected. Very high power due to large capacitance. Used in ultra-high-performance designs like server CPUs (Intel, AMD).
| Topology | Skew | Power | Routability | Typical use |
|---|---|---|---|---|
| H-Tree | Very Low | Low | Requires regular layout | Memory, FPGA fabric, custom blocks |
| Balanced Binary | Low | Medium | Handles irregular floorplan | Standard ASIC, SoC design |
| Mesh | Near Zero | Very High | Needs dedicated metal layers | High-performance server CPUs |
| Hybrid | Low | Medium | Moderate | Mixed-block SoCs with multiple clocks |
Clock Skew — Definition and Types
Clock skew is the difference in clock arrival time between the launch flip-flop and the capture flip-flop on a timing path. It arises from differences in buffer delays, wire lengths, and RC parasitics through the clock tree branches connecting source to each leaf.
Local Skew
The skew between two flip-flops that are directly connected by a timing path (launch FF → capture FF). This is what STA tools analyze per-path. Local skew directly appears in setup and hold equations. Modern CTS targets local skew < 50–100 ps at 7nm.
Global Skew
The maximum skew across the entire clock domain — the difference between the latest and earliest clock arrival among all flip-flops. A useful health metric for the clock tree but not directly used in per-path STA. Global skew is always larger than local skew.
| Skew direction | Effect on Setup slack | Effect on Hold slack | Intuition |
|---|---|---|---|
| Positive (+) Capture clock later |
Improves | Tightens | Data has more time to arrive before capture clock edge |
| Negative (−) Capture clock earlier |
Tightens | Improves | Capture edge moves left — data must arrive sooner |
| Zero | Neutral | Neutral | Ideal CTS target — balanced tree |
Clock Insertion Delay (Clock Latency)
Clock insertion delay is the total propagation time from the clock source to a flip-flop's clock pin, measured through the PLL, clock network, and buffer tree. It has two components that STA treats separately.
Source Latency
The delay from the PLL or oscillator output to the clock definition point in the design (usually the top-level clock port). This includes off-chip PCB traces, package inductance, and on-chip wiring from the pad to the first clock buffer. Specified in SDC using set_clock_latency -source.
Network Latency
The delay from the clock definition port through all clock buffers and inverters in the clock tree to the flip-flop's clock pin. This is what CTS physically builds. After CTS, the tool annotates actual network latency from extraction. Before CTS, an ideal clock model is used with zero or estimated latency.
set_clock_latency estimates. After CTS, the actual annotated latency is used. If estimated and actual latencies differ significantly, timing that passed pre-CTS may fail post-CTS. Always re-run STA with propagated clocks after CTS.# SDC: specify clock latency before CTS
set_clock_latency -source 0.5 [get_clocks CLK] ;# source: PLL to port
set_clock_latency 1.2 [get_clocks CLK] ;# network: estimated tree delay
# After CTS: use propagated clocks (actual network delay from extraction)
set_propagated_clocks [all_clocks]
CTS Goals, Constraints, and Flow
Before running CTS, the designer specifies a set of targets that the tool must meet. These are set as CTS constraints in the tool's configuration or in the SDC.
| Constraint | Typical target | What happens if violated |
|---|---|---|
| Max clock skew | 50–150 ps (7nm–28nm) | Setup/hold timing margin is reduced; failing paths may appear |
| Max insertion delay | 500 ps – 2 ns | Data path timing uses higher latency in STA, may cause setup failures |
| Max slew (transition time) | 100–300 ps | Slow slew increases noise susceptibility and clock-to-Q delay |
| Max fanout per buffer | 8–20 FFs | Too high fanout degrades slew; too low wastes buffer area and power |
| Max capacitance per node | Per-cell library limit | Excessive load slows the buffer, degrading slew and delay |
The CTS flow runs in these steps:
1. Clock tree specification
└── Define: clock roots, exceptions (don't-touch cells, clock gating),
skew targets, slew targets, buffer/inverter cell list
2. Virtual tree construction
└── Tool builds a virtual balanced tree ignoring physical layout
3. Physical tree construction
└── Cells are placed, wires are routed (in-CTS routing)
└── Buffer sizes are chosen to match delays across branches
4. Clock tree optimization (CTO)
└── Iterative fixing of skew hotspots, slew violations,
max-fanout violations
5. Post-CTS timing analysis
└── STA with propagated clocks (actual annotated delays)
└── Fix remaining setup/hold violations from skew imbalance
6. Incremental CTS for ECO
└── Fix specific skew issues without re-running full CTS
Clock Gating Cells in the Clock Tree
Clock gating is the most effective technique for reducing dynamic power in VLSI chips — by stopping the clock to idle registers, switching activity drops to near zero for those flip-flops. However, inserting clock gates into the clock tree has direct implications for CTS and timing.
Integrated Clock Gating (ICG) Cell
An ICG cell is a latch-based AND gate designed specifically for clock gating. The latch samples the enable signal on the clock's low phase, and the AND gate combines the latched enable with the clock. The latch eliminates glitches that a plain AND gate would produce when the enable changes while the clock is high.
ICG placement in the tree
ICG cells are treated as part of the clock tree. CTS must balance clock arrival time through ICG cells just as it balances through regular buffers. An ICG cell adds its own insertion delay (typically 50–150 ps), which must be accounted for in setup and hold analysis for all flip-flops downstream of the gate.
// RTL clock gating — synthesized into ICG cell
always_ff @(posedge clk) begin
if (en) data_reg <= data_in; // synthesis tool infers ICG
end
// Or explicit ICG instantiation in RTL
ICGX1 u_icg (.CLK(clk), .EN(enable), .SE(scan_en), .GCLK(gated_clk));
// Downstream FFs use gated_clk — they stop switching when enable = 0
clock_gating_check in SDC or as special cells in the CTS spec. If not, the tool may balance through them incorrectly, causing unexpected skew on downstream flip-flop clusters.During scan test mode, the SE (Scan Enable) pin of the ICG forces the gate open so the scan clock can propagate to all FFs — critical for structural DFT test coverage.
Useful Skew — Intentional Timing Slack
Useful skew is the deliberate introduction of unequal clock arrival times to improve timing margins on specific critical paths. Instead of targeting zero skew everywhere, the CTS tool or timing engineer intentionally delays the capture clock on a setup-critical path — effectively giving data more time to travel through the combinational logic.
When to use useful skew
When a timing path has negative setup slack that cannot be fixed by gate sizing or logic restructuring — typically late in the design cycle when netlist changes are risky. Also used at the block level to trade slack from paths with positive margin to paths that are failing.
Useful skew risks
Every ps of positive skew added for setup removes 1 ps from hold margin for the same path pair. Aggressive useful skew can cause hold violations in the FF/FF path, requiring hold buffer insertion — which increases area, power, and routing congestion.
set_clock_skew or through CTS-aware timing optimization (CCOPT in Innovus). The tool automatically balances setup gain against hold risk, inserting hold buffers as needed.How Skew Appears in STA Reports
After CTS, every timing path report from tools like PrimeTime includes the clock arrival times for both the launch and capture flip-flops. The skew is visible as the difference between these two numbers.
Timing Path Report — Setup Check
─────────────────────────────────────────────────────────
Path: FF_A/Q → [comb logic] → FF_B/D
Data path:
FF_A clock-to-Q 0.18 ns
Combinational delay 0.62 ns
FF_B setup time 0.05 ns
──────────────────────────────────────
Data required arrival 0.85 ns
Clock path:
Clock source (PLL out) 0.00 ns
Buffer BUF1 0.15 ns
Buffer BUF2 (launch, FF_A) 0.38 ns ← launch latency
Buffer BUF3 (capture, FF_B) 0.46 ns ← capture latency
Clock period 1.00 ns
Capture edge 1.46 ns (1.00 + 0.46)
Skew = 0.46 − 0.38 = +0.08 ns (positive — helps setup)
Setup slack = 1.46 − 0.85 = +0.61 ns ✓ PASS
Common CTS Problems and Fixes
| Problem | Root cause | Fix |
|---|---|---|
| High local skew on critical path | Unequal buffer stages or wire length to launch vs capture FF | Add/resize buffers on the shorter branch to equalize delay; use useful skew carefully |
| Clock slew violation | Too much capacitance on a clock node (high fanout or long wire) | Insert additional buffer stage to reduce per-buffer load; upsize buffer drive strength |
| Hold violations post-CTS | Positive skew from CTS or useful skew tightened hold margin below zero | Insert delay buffers (filler buffers) on the data path of the violating paths |
| Clock tree power too high | Overly deep buffer tree or insufficient clock gating | Add clock gating (ICG) on idle register banks; target minimum insertion delay in CTS spec |
| Skew mismatch between corners | Different RC extraction or cell delay at SS vs FF corner | Run CTS with worst-case extraction; use AOCV/POCV derating on clock tree cells |
| Post-route skew degradation | Clock wires re-routed during detailed routing, changing RC | Use clock net shielding and NDR (Non-Default Rules) for clock wires; re-run CTO post-route |
Frequently Asked Questions
Explore Further
Clock Gating (ICG)
Study the Integrated Clock Gating cell that CTS must handle specially — how latch-based AND gates prevent glitches and how the SE pin opens the gate during scan test.
Glitch-Free Clock Mux
Learn how to safely switch between two clock sources without glitches — the exact topology CTS uses for muxed clock trees and how SDC constrains each selection.
Clock Domain Crossing
Clock skew between independent clock domains is unbounded — see how 2-FF synchronizers, handshake protocols, and async FIFOs safely cross asynchronous clock boundaries.