Every flip-flop in a chip waits for the same clock edge. CTS is the art and engineering of making that "same" a reality — distributing a clock signal across a die with picosecond-level skew control.
Clock skew is the difference in clock arrival time between any two flip-flops in the design. If FF-A receives the clock at time 100 ps and FF-B at time 140 ps, the skew between them is 40 ps.
This 40 ps has opposite effects on setup and hold timing between these two FFs:
| Violation type | Path direction | Effect of skew | Formula adjustment |
|---|---|---|---|
| Setup (max delay) | FF-A launches → FF-B captures | Positive skew helps setup: capture edge arrives later | Slack = T_clk + skew − T_data − T_setup |
| Hold (min delay) | FF-A launches → FF-B captures | Positive skew hurts hold: capture edge arrives later, data must hold longer | Slack = T_data − T_hold − skew |
| Setup | FF-B launches → FF-A captures | Positive skew hurts setup: capture edge arrives earlier | Slack = T_clk − skew − T_data − T_setup |
Insertion delay is the time from the clock source pin to the clock pin of any flip-flop — essentially how long the clock tree is. Skew is the variation in insertion delay across all sinks.
Both matter, but for different reasons:
The H-tree is the most common clock distribution structure. It gets its name from the H-shaped routing pattern visible at each level of the tree. The key property of a symmetric H-tree is that the wire length from the root to every leaf is identical — which means equal insertion delay at every sink (zero skew in an ideal wire model).
An H-tree works by halving the problem recursively:
At each level, the buffer is sized to drive the wire segment to the next level. A 4-level H-tree reaches 16 sub-trees; a 5-level tree reaches 32. Each sub-tree then distributes to its local cluster of FFs via a "last-mile" local clock tree.
When fanout exceeds ~5000 FFs (large designs), or when very tight skew (< 20 ps) is required, H-trees give way to clock mesh topology:
The mesh's low-impedance property means local variations (from manufacturing, temperature gradients) are averaged out. This reduces skew significantly — mesh designs routinely achieve skew < 15 ps across large domains. The trade-off is power: clock mesh consumes 2–4× more power than an H-tree for the same fanout, because of the large capacitance of the mesh wires.
CTS uses a restricted set of cells called clock cells — standard cells whose timing is characterized for use in clock paths. Regular combinational cells cannot be used in clock trees because their timing varies with input slew, making skew analysis inaccurate.
| Cell type | Use | Notes |
|---|---|---|
| Clock buffer (CLKBUF) | Non-inverting clock distribution | Equal rise/fall transition; low skew across PVT |
| Clock inverter (CLKINV) | Inverting stage (two in series = buffer) | Smaller than CLKBUF, used in pairs to maintain polarity |
| Clock gate (ICG) | Enable-controlled clock gating | Contains latch + AND gate; must be in CTS clock tree |
| Local clock buffer (LCBUF) | Last-mile distribution to FF clusters | Smaller, lower power than root buffers |
# Define CTS spec: which cells to use, skew/insertion targets create_clock_tree_spec \ -name main_clk_tree \ -clock clk \ -buf_list {CLKBUF1 CLKBUF2 CLKBUF4 CLKBUF8} \ -inv_list {CLKINV1 CLKINV2 CLKINV4} \ -max_skew 0.050 \ -max_insertion_delay 0.500 \ -max_fanout 20 \ -max_transition 0.150 # Run CTS cts -spec main_clk_tree # Report clock tree quality report_clock_tree \ -summary \ -skew \ -insertion_delay \ -power
Useful skew (also called intentional skew) is the deliberate introduction of clock skew to improve timing. Instead of targeting zero skew everywhere, the CTS engine intentionally delivers the clock later to capturing FFs on critical paths, and earlier to launching FFs — effectively "borrowing" time from the clock period.
If a path from FF-A to FF-B has a setup violation of −80 ps, delivering the clock to FF-B 100 ps later (positive skew on the capture side) gives that path an extra 100 ps, curing the violation. This is equivalent to having a longer clock period for that specific path.
The same 100 ps of positive skew on FF-B now creates a hold risk on any path launching from FF-A to FF-B — because the capture edge is 100 ps later, the data must hold for 100 ps longer. Useful skew must always be checked for hold violations on all affected paths.
After CTS, the tool runs STA with the actual (not ideal) clock tree. Hold violations are extremely common post-CTS because:
Hold violations are fixed by inserting delay buffers (DELBUFs) on the offending data paths. These buffers add delay to the data path without affecting the clock. Each hold fix adds area and power — a design with many hold violations may require 5–10% area overhead from hold fix buffers alone.
# After CTS, check and fix hold violations report_timing -delay_type min -max_paths 50 ## Look for negative hold slack (min path too short) # Auto fix hold violations with delay cell insertion opt_hold \ -effort high \ -hold_slack_limit 0.020 # Leaves 20ps hold margin above requirement # Verify after fixing report_timing -delay_type min report_power # hold buffers increase dynamic power ~3-8%
Clock gating is the most powerful dynamic power reduction technique: when a block is idle, its clock is stopped, eliminating all switching activity (and dynamic power) in that block. An integrated clock gate (ICG) is a specific standard cell that combines an enable latch and an AND gate in a glitch-free configuration.
ICG cells are inserted into the clock tree by the CTS engine. They must be placed close to the FFs they gate (low insertion delay from gate to FFs) and their enable signals must be properly timed. A typical SoC might have 500–2000 clock gates, reducing dynamic power by 20–40%.