HomeRTL→SiToolsInterview
Chapter 5 of 10
← Ch.4 Placement Ch.6 Routing →
🌳 Interactive Clock Tree Builder inside

Clock Tree Synthesis

Every flip-flop in a chip waits for the same clock edge. CTS is the art and engineering of making that "same" a reality — distributing a clock signal across a die with picosecond-level skew control.

📖 ~35 min read 🎯 H-Tree · Skew · Insertion Delay · Useful Skew 🏭 Next: Routing →
In this chapter
  1. Why Clock Skew Matters
  2. Insertion Delay vs Skew
  3. H-Tree Topology
  4. Fishbone & Mesh Topology
  5. Clock Buffer Selection
  6. Useful Skew
  7. Interactive: Clock Tree Builder
  8. Post-CTS Hold Fixing
  9. Clock Gating in the Clock Tree
  10. Key Takeaways

1. Why Clock Skew Matters

Clock skew is the difference in clock arrival time between any two flip-flops in the design. If FF-A receives the clock at time 100 ps and FF-B at time 140 ps, the skew between them is 40 ps.

This 40 ps has opposite effects on setup and hold timing between these two FFs:

Violation typePath directionEffect of skewFormula adjustment
Setup (max delay)FF-A launches → FF-B capturesPositive skew helps setup: capture edge arrives laterSlack = T_clk + skew − T_data − T_setup
Hold (min delay)FF-A launches → FF-B capturesPositive skew hurts hold: capture edge arrives later, data must hold longerSlack = T_data − T_hold − skew
SetupFF-B launches → FF-A capturesPositive skew hurts setup: capture edge arrives earlierSlack = T_clk − skew − T_data − T_setup
Skew budget rule of thumb: Total clock skew should be < 10% of the clock period. For a 1 GHz design (1000 ps period), keep skew < 100 ps. CTS targets for advanced nodes: skew < 50 ps.

2. Insertion Delay vs Skew

Insertion delay is the time from the clock source pin to the clock pin of any flip-flop — essentially how long the clock tree is. Skew is the variation in insertion delay across all sinks.

Both matter, but for different reasons:

CTS targets (typical 7nm design): Insertion delay < 400–600 ps, skew < 30–50 ps. Ultra-high-performance designs may target skew < 20 ps, requiring mesh topologies and active de-skew techniques.

3. H-Tree Topology

The H-tree is the most common clock distribution structure. It gets its name from the H-shaped routing pattern visible at each level of the tree. The key property of a symmetric H-tree is that the wire length from the root to every leaf is identical — which means equal insertion delay at every sink (zero skew in an ideal wire model).

An H-tree works by halving the problem recursively:

  1. Place a buffer at the clock root (center of the die or clock domain)
  2. Drive two branches (left and right) of equal length to sub-roots
  3. At each sub-root, insert a buffer and drive two more equal-length branches
  4. Repeat for 4–6 levels until branches reach individual FFs or small clusters

At each level, the buffer is sized to drive the wire segment to the next level. A 4-level H-tree reaches 16 sub-trees; a 5-level tree reaches 32. Each sub-tree then distributes to its local cluster of FFs via a "last-mile" local clock tree.

Why not wire directly? A single long wire from the clock root to every FF would have microseconds of RC delay and massive skew from resistance variation. Buffers regenerate the clock signal at each level, keeping drive strength matched to load and delay controlled.

4. Fishbone / Mesh Topology for High-Fanout

When fanout exceeds ~5000 FFs (large designs), or when very tight skew (< 20 ps) is required, H-trees give way to clock mesh topology:

The mesh's low-impedance property means local variations (from manufacturing, temperature gradients) are averaged out. This reduces skew significantly — mesh designs routinely achieve skew < 15 ps across large domains. The trade-off is power: clock mesh consumes 2–4× more power than an H-tree for the same fanout, because of the large capacitance of the mesh wires.

5. Clock Buffer Selection

CTS uses a restricted set of cells called clock cells — standard cells whose timing is characterized for use in clock paths. Regular combinational cells cannot be used in clock trees because their timing varies with input slew, making skew analysis inaccurate.

Cell typeUseNotes
Clock buffer (CLKBUF)Non-inverting clock distributionEqual rise/fall transition; low skew across PVT
Clock inverter (CLKINV)Inverting stage (two in series = buffer)Smaller than CLKBUF, used in pairs to maintain polarity
Clock gate (ICG)Enable-controlled clock gatingContains latch + AND gate; must be in CTS clock tree
Local clock buffer (LCBUF)Last-mile distribution to FF clustersSmaller, lower power than root buffers
# Define CTS spec: which cells to use, skew/insertion targets
create_clock_tree_spec \
  -name main_clk_tree \
  -clock clk \
  -buf_list {CLKBUF1 CLKBUF2 CLKBUF4 CLKBUF8} \
  -inv_list {CLKINV1 CLKINV2 CLKINV4} \
  -max_skew  0.050 \
  -max_insertion_delay 0.500 \
  -max_fanout 20 \
  -max_transition 0.150

# Run CTS
cts -spec main_clk_tree

# Report clock tree quality
report_clock_tree \
  -summary \
  -skew \
  -insertion_delay \
  -power

6. Useful Skew

Useful skew (also called intentional skew) is the deliberate introduction of clock skew to improve timing. Instead of targeting zero skew everywhere, the CTS engine intentionally delivers the clock later to capturing FFs on critical paths, and earlier to launching FFs — effectively "borrowing" time from the clock period.

Setup improvement via useful skew

If a path from FF-A to FF-B has a setup violation of −80 ps, delivering the clock to FF-B 100 ps later (positive skew on the capture side) gives that path an extra 100 ps, curing the violation. This is equivalent to having a longer clock period for that specific path.

Hold risk from useful skew

The same 100 ps of positive skew on FF-B now creates a hold risk on any path launching from FF-A to FF-B — because the capture edge is 100 ps later, the data must hold for 100 ps longer. Useful skew must always be checked for hold violations on all affected paths.

Useful skew limits: Most CTS tools cap useful skew at ±20% of the clock period. Beyond this, hold fixing becomes expensive (many delay buffers required) and the skew tree becomes very hard to build physically.
🌳 Interactive: Clock Tree Builder
Build an H-tree level by level. Watch insertion delay accumulate at each level. Toggle Balanced vs Skewed mode to visualize useful skew.
Leaf Arrival Times
Build the tree to see arrival times
Wire color = insertion delay
Early (<200ps) Mid (200–400ps) Late (>400ps)
0
Levels built
Clock Skew (ps)
Max Insertion Delay (ps)

8. Post-CTS Hold Fixing

After CTS, the tool runs STA with the actual (not ideal) clock tree. Hold violations are extremely common post-CTS because:

Hold violations are fixed by inserting delay buffers (DELBUFs) on the offending data paths. These buffers add delay to the data path without affecting the clock. Each hold fix adds area and power — a design with many hold violations may require 5–10% area overhead from hold fix buffers alone.

# After CTS, check and fix hold violations
report_timing -delay_type min -max_paths 50
## Look for negative hold slack (min path too short)

# Auto fix hold violations with delay cell insertion
opt_hold \
  -effort high \
  -hold_slack_limit 0.020
# Leaves 20ps hold margin above requirement

# Verify after fixing
report_timing -delay_type min
report_power  # hold buffers increase dynamic power ~3-8%

9. Clock Gating Cells in the Clock Tree

Clock gating is the most powerful dynamic power reduction technique: when a block is idle, its clock is stopped, eliminating all switching activity (and dynamic power) in that block. An integrated clock gate (ICG) is a specific standard cell that combines an enable latch and an AND gate in a glitch-free configuration.

ICG cells are inserted into the clock tree by the CTS engine. They must be placed close to the FFs they gate (low insertion delay from gate to FFs) and their enable signals must be properly timed. A typical SoC might have 500–2000 clock gates, reducing dynamic power by 20–40%.

Why a latch in the ICG? If the enable signal changes while the clock is high, a simple AND gate would create a glitch (partial clock pulse) that causes timing issues in the gated domain. The latch samples the enable on the falling edge of the clock, ensuring the gate output is glitch-free regardless of when the enable changes.

✅ Chapter 5 Key Takeaways

Next → Chapter 6
Routing
Global and detailed routing, track assignment, DRC fixing, via optimization, and achieving DRC-clean sign-off.