STA — Clock Distribution

Clock Tree Synthesis
& Clock Skew

A chip with a billion transistors shares one clock source. Getting that signal to every flip-flop with controlled timing is Clock Tree Synthesis — one of the most critical steps in physical design. Skew, insertion delay, useful skew, and CTS topology choices directly determine whether your design meets setup and hold timing at every corner.

35 min read
Physical Design / STA
Interactive Lab

Why the Clock Needs a Tree

In RTL simulation, a clock signal arrives simultaneously at every flip-flop. In silicon, it does not. A single wire from the PLL output to a million flip-flops would have an enormous fanout — the driving cell would need to charge the combined input capacitance of every FF simultaneously. That drive current would be enormous, the wire resistance would cause huge RC delay, and the signal would arrive wildly late and with a degraded slew rate.

Clock Tree Synthesis solves this by inserting a balanced tree of buffers and inverters between the clock source and the flip-flops. Each buffer drives only a small number of downstream loads, the tree branches progressively, and the total delay from source to every leaf is made as equal as possible.

What CTS controls

Insertion delay (total latency from source to FF clock pin), clock skew (arrival time difference between FFs), slew rate (transition time at each leaf), and fanout per buffer stage. All four are specified as CTS constraints before the tool runs.

What CTS cannot fix

Floorplan problems that force clock wires to travel long distances, extremely high clock fanout from ungated always-on clocks, and fundamental setup violations caused by too-long combinational paths — these must be resolved before CTS.

Key insight: CTS is run after placement and before routing in the physical design flow. At this point cell positions are fixed, so the tool knows exactly how far each FF is from the clock source and can size buffer stages to equalize arrival times.

Clock Tree Topologies

Different designs use different physical structures for the clock tree, each with trade-offs in skew, power, and routability.

H-Tree

Routes the clock through wires shaped like nested letter H. Each branch has the same wire length from the center, giving inherent geometric symmetry. Minimizes skew before any buffers are inserted. Used in high-performance regularchip blocks like SRAM, CPU cores, and FPGAs.

Balanced Binary Tree

A standard buffer tree where each node drives exactly two children. CTS tools equalize delay by sizing buffers and adjusting wire lengths. Most common in ASIC design — tools like Innovus and ICC2 build this automatically from the placed netlist.

Mesh (Grid)

A global metal grid is driven from multiple points, creating a low-impedance distributed clock network. Extremely low skew (near zero) because every point on the mesh is connected. Very high power due to large capacitance. Used in ultra-high-performance designs like server CPUs (Intel, AMD).

TopologySkewPowerRoutabilityTypical use
H-Tree Very Low Low Requires regular layout Memory, FPGA fabric, custom blocks
Balanced Binary Low Medium Handles irregular floorplan Standard ASIC, SoC design
Mesh Near Zero Very High Needs dedicated metal layers High-performance server CPUs
Hybrid Low Medium Moderate Mixed-block SoCs with multiple clocks

Clock Skew — Definition and Types

Clock skew is the difference in clock arrival time between the launch flip-flop and the capture flip-flop on a timing path. It arises from differences in buffer delays, wire lengths, and RC parasitics through the clock tree branches connecting source to each leaf.

Skew = T_clk_capture − T_clk_launch
Positive skew: capture clock arrives LATER than launch clock
Negative skew: capture clock arrives EARLIER than launch clock

Local Skew

The skew between two flip-flops that are directly connected by a timing path (launch FF → capture FF). This is what STA tools analyze per-path. Local skew directly appears in setup and hold equations. Modern CTS targets local skew < 50–100 ps at 7nm.

Global Skew

The maximum skew across the entire clock domain — the difference between the latest and earliest clock arrival among all flip-flops. A useful health metric for the clock tree but not directly used in per-path STA. Global skew is always larger than local skew.

Skew directionEffect on Setup slackEffect on Hold slackIntuition
Positive (+)
Capture clock later
Improves Tightens Data has more time to arrive before capture clock edge
Negative (−)
Capture clock earlier
Tightens Improves Capture edge moves left — data must arrive sooner
Zero Neutral Neutral Ideal CTS target — balanced tree
The skew trade-off: Any skew that helps setup simultaneously hurts hold by the same amount. You cannot improve both simultaneously through skew alone. Hold violations introduced by positive (useful) skew must be fixed by inserting delay buffers on the data path.

Clock Insertion Delay (Clock Latency)

Clock insertion delay is the total propagation time from the clock source to a flip-flop's clock pin, measured through the PLL, clock network, and buffer tree. It has two components that STA treats separately.

Source Latency

The delay from the PLL or oscillator output to the clock definition point in the design (usually the top-level clock port). This includes off-chip PCB traces, package inductance, and on-chip wiring from the pad to the first clock buffer. Specified in SDC using set_clock_latency -source.

Network Latency

The delay from the clock definition port through all clock buffers and inverters in the clock tree to the flip-flop's clock pin. This is what CTS physically builds. After CTS, the tool annotates actual network latency from extraction. Before CTS, an ideal clock model is used with zero or estimated latency.

Total Clock Latency = Source Latency + Network Latency
Setup check (simplified):
  T_launch_clk + T_cq + T_comb + T_su ≤ T_capture_clk + T_period
  where T_launch_clk and T_capture_clk include full clock latency to each FF
Why latency matters for sign-off: Before CTS, SDC uses set_clock_latency estimates. After CTS, the actual annotated latency is used. If estimated and actual latencies differ significantly, timing that passed pre-CTS may fail post-CTS. Always re-run STA with propagated clocks after CTS.
# SDC: specify clock latency before CTS
set_clock_latency -source 0.5 [get_clocks CLK]   ;# source: PLL to port
set_clock_latency 1.2 [get_clocks CLK]            ;# network: estimated tree delay

# After CTS: use propagated clocks (actual network delay from extraction)
set_propagated_clocks [all_clocks]

CTS Goals, Constraints, and Flow

Before running CTS, the designer specifies a set of targets that the tool must meet. These are set as CTS constraints in the tool's configuration or in the SDC.

ConstraintTypical targetWhat happens if violated
Max clock skew 50–150 ps (7nm–28nm) Setup/hold timing margin is reduced; failing paths may appear
Max insertion delay 500 ps – 2 ns Data path timing uses higher latency in STA, may cause setup failures
Max slew (transition time) 100–300 ps Slow slew increases noise susceptibility and clock-to-Q delay
Max fanout per buffer 8–20 FFs Too high fanout degrades slew; too low wastes buffer area and power
Max capacitance per node Per-cell library limit Excessive load slows the buffer, degrading slew and delay

The CTS flow runs in these steps:

1. Clock tree specification
   └── Define: clock roots, exceptions (don't-touch cells, clock gating),
               skew targets, slew targets, buffer/inverter cell list

2. Virtual tree construction
   └── Tool builds a virtual balanced tree ignoring physical layout

3. Physical tree construction
   └── Cells are placed, wires are routed (in-CTS routing)
   └── Buffer sizes are chosen to match delays across branches

4. Clock tree optimization (CTO)
   └── Iterative fixing of skew hotspots, slew violations,
       max-fanout violations

5. Post-CTS timing analysis
   └── STA with propagated clocks (actual annotated delays)
   └── Fix remaining setup/hold violations from skew imbalance

6. Incremental CTS for ECO
   └── Fix specific skew issues without re-running full CTS

Clock Gating Cells in the Clock Tree

Clock gating is the most effective technique for reducing dynamic power in VLSI chips — by stopping the clock to idle registers, switching activity drops to near zero for those flip-flops. However, inserting clock gates into the clock tree has direct implications for CTS and timing.

Integrated Clock Gating (ICG) Cell

An ICG cell is a latch-based AND gate designed specifically for clock gating. The latch samples the enable signal on the clock's low phase, and the AND gate combines the latched enable with the clock. The latch eliminates glitches that a plain AND gate would produce when the enable changes while the clock is high.

ICG placement in the tree

ICG cells are treated as part of the clock tree. CTS must balance clock arrival time through ICG cells just as it balances through regular buffers. An ICG cell adds its own insertion delay (typically 50–150 ps), which must be accounted for in setup and hold analysis for all flip-flops downstream of the gate.

// RTL clock gating — synthesized into ICG cell
always_ff @(posedge clk) begin
  if (en) data_reg <= data_in;   // synthesis tool infers ICG
end

// Or explicit ICG instantiation in RTL
ICGX1 u_icg (.CLK(clk), .EN(enable), .SE(scan_en), .GCLK(gated_clk));

// Downstream FFs use gated_clk — they stop switching when enable = 0
CTS exception for ICG: Clock gates must be marked as clock_gating_check in SDC or as special cells in the CTS spec. If not, the tool may balance through them incorrectly, causing unexpected skew on downstream flip-flop clusters.

During scan test mode, the SE (Scan Enable) pin of the ICG forces the gate open so the scan clock can propagate to all FFs — critical for structural DFT test coverage.

Useful Skew — Intentional Timing Slack

Useful skew is the deliberate introduction of unequal clock arrival times to improve timing margins on specific critical paths. Instead of targeting zero skew everywhere, the CTS tool or timing engineer intentionally delays the capture clock on a setup-critical path — effectively giving data more time to travel through the combinational logic.

Setup slack = T_period + T_skew − T_cq − T_comb − T_su
Adding positive T_skew directly adds to setup slack
Hold slack = T_cq + T_comb_min − T_h − T_skew
The same positive T_skew subtracts from hold slack

When to use useful skew

When a timing path has negative setup slack that cannot be fixed by gate sizing or logic restructuring — typically late in the design cycle when netlist changes are risky. Also used at the block level to trade slack from paths with positive margin to paths that are failing.

Useful skew risks

Every ps of positive skew added for setup removes 1 ps from hold margin for the same path pair. Aggressive useful skew can cause hold violations in the FF/FF path, requiring hold buffer insertion — which increases area, power, and routing congestion.

Useful skew in tools: PrimeTime and Tempus support useful skew optimization via set_clock_skew or through CTS-aware timing optimization (CCOPT in Innovus). The tool automatically balances setup gain against hold risk, inserting hold buffers as needed.

How Skew Appears in STA Reports

After CTS, every timing path report from tools like PrimeTime includes the clock arrival times for both the launch and capture flip-flops. The skew is visible as the difference between these two numbers.

  Timing Path Report — Setup Check
  ─────────────────────────────────────────────────────────
  Path:  FF_A/Q  →  [comb logic]  →  FF_B/D

  Data path:
    FF_A clock-to-Q               0.18 ns
    Combinational delay           0.62 ns
    FF_B setup time               0.05 ns
    ──────────────────────────────────────
    Data required arrival         0.85 ns

  Clock path:
    Clock source (PLL out)        0.00 ns
    Buffer BUF1                   0.15 ns
    Buffer BUF2 (launch, FF_A)    0.38 ns   ← launch latency
    Buffer BUF3 (capture, FF_B)   0.46 ns   ← capture latency

    Clock period                  1.00 ns
    Capture edge                  1.46 ns   (1.00 + 0.46)

    Skew = 0.46 − 0.38 = +0.08 ns (positive — helps setup)

  Setup slack = 1.46 − 0.85 = +0.61 ns  ✓ PASS
Reading skew from STA: Launch clock latency is subtracted; capture clock latency is added to the required time. A positive skew (capture > launch latency) adds to the window available for data, improving setup slack. Always check the hold path after seeing positive skew — it costs the same amount on hold.

Common CTS Problems and Fixes

ProblemRoot causeFix
High local skew on critical path Unequal buffer stages or wire length to launch vs capture FF Add/resize buffers on the shorter branch to equalize delay; use useful skew carefully
Clock slew violation Too much capacitance on a clock node (high fanout or long wire) Insert additional buffer stage to reduce per-buffer load; upsize buffer drive strength
Hold violations post-CTS Positive skew from CTS or useful skew tightened hold margin below zero Insert delay buffers (filler buffers) on the data path of the violating paths
Clock tree power too high Overly deep buffer tree or insufficient clock gating Add clock gating (ICG) on idle register banks; target minimum insertion delay in CTS spec
Skew mismatch between corners Different RC extraction or cell delay at SS vs FF corner Run CTS with worst-case extraction; use AOCV/POCV derating on clock tree cells
Post-route skew degradation Clock wires re-routed during detailed routing, changing RC Use clock net shielding and NDR (Non-Default Rules) for clock wires; re-run CTO post-route
Interactive Lab — Clock Skew Impact Simulator
Adjust clock skew and data path delay. See in real time how skew shifts setup and hold slack, and where violations occur.
Setup Slack
+100 ps
Hold Slack
+30 ps
Status
✓ PASS

Frequently Asked Questions

Clock skew is a spatial effect — the fixed difference in clock arrival time between two physical flip-flops caused by unequal clock tree delays. It is deterministic and reproducible for a given design. Clock jitter is a temporal effect — the cycle-to-cycle variation in the clock period caused by PLL noise, supply noise, and thermal variation. Jitter changes every clock cycle and is modeled statistically. STA accounts for both: skew appears in the path timing equations, jitter is subtracted from the available clock period as an uncertainty budget.
Setup checks that data arrives before the capture clock edge. If the capture clock is delayed (positive skew), the deadline moves later — data has more time, so setup slack increases. Hold checks that data does not change too soon after the capture clock edge. The same delayed capture clock edge means the hold window starts later, and the data path (which launches from the fixed launch FF) may already be changing when the window opens — reducing hold slack. Both are governed by the same skew term appearing with opposite signs in the setup and hold equations.
An H-tree is a clock routing structure where the clock travels from a center point through wires forming a letter H, then each endpoint fans out through another H, recursively. At every level, both left and right branches are identical wire lengths. This geometric symmetry means the RC delay from the center to every leaf point is identical — before any buffers are inserted. H-trees are used in memory arrays and custom datapath blocks where the regular grid layout makes them practical. In irregular ASIC floorplans, H-trees are not practical because the floorplan does not have the geometric regularity needed.
Pre-CTS timing uses an ideal clock model: zero insertion delay, zero skew, and no clock tree. This is how all RTL synthesis and early physical implementation timing is analyzed. Post-CTS timing uses propagated clocks: actual cell delays and wire RC from extraction are annotated onto the clock tree, so real insertion delay and real skew appear in every path's timing. A design that meets timing pre-CTS may fail post-CTS if clock latency is higher than the estimated SDC value, or if skew is larger than assumed. Post-CTS signoff timing is the authoritative result.
During scan test mode, the Scan Enable (SE) signal forces all ICG (clock gating) cells open, allowing the test clock to reach every flip-flop regardless of the functional enable signals. CTS must ensure the scan clock also meets slew and fanout requirements through the ICG cells. At-speed test (ATPG) requires the scan shift clock to propagate with the same timing as the functional clock — so the clock tree must work correctly in both functional and scan modes. CTS constraints usually include a scan_mode scenario to verify this.
Non-Default Rules (NDR) are special routing rules applied to clock nets to make them more robust than data nets. Typical NDR rules for clocks use double-width wires (reducing resistance) and double-spacing between clock wires and neighboring wires (reducing capacitive coupling noise). Clock nets may also use shielding — grounded wires on both sides of the clock wire — to prevent switching noise from data nets from coupling into the clock signal and causing jitter or glitches. NDR rules increase clock wire area but are essential for clock integrity at advanced nodes.

Explore Further

← Setup & Hold Time