H-tree vs fishbone vs mesh topologies, clock skew budgets, insertion delay, useful skew, NDR rules, hold violation fixing, power optimization, and the complete Innovus/ICC2 CTS flow — with real numbers from billion-transistor chips.
Every flip-flop in the design must receive its clock edge at nearly the same instant. In a modern SoC running at 3 GHz, one clock period is 333 ps. Setup timing requires data to be stable at the capture FF at least 80 ps before the clock edge. Hold timing requires data to not arrive too early at the very next edge, which is only 333 ps later. This leaves a razor-thin margin — and clock skew (the difference in clock arrival times between flip-flops) eats directly into both margins.
CTS is the step that builds the physical clock distribution network: a tree of buffers and inverters from the clock source to every sink (flip-flop clock pin). It must simultaneously achieve:
The H-tree is the classic symmetric topology where the clock is routed in an H-pattern that recursively divides the chip area. At each H junction, the wire splits into equal-length branches, guaranteeing equal path length — and therefore equal insertion delay — to all leaf sinks. This is the closest thing to a mathematically perfect balanced tree.
H-tree limitations: Real designs are not square, not uniform in FF density, and have macros that cannot be crossed. The textbook H-tree works for regular arrays (DRAM, SRAM sense amps) but breaks down for irregular SoC floorplans. Tools use modified balanced tree algorithms instead, inserting buffered branches of different length to compensate for irregular sink distributions.
The fishbone places a single horizontal "spine" wire across the chip (or block), with shorter "ribs" branching off vertically. It trades the perfect symmetry of the H-tree for better routability in elongated blocks. The spine is a very wide, shielded wire on a top metal layer with strong buffers driving it; the ribs are shorter and thinner. Fishbone is common in memory compilers and CPU pipeline lanes.
A clock mesh is a continuous grid of wires covering the block, driven from multiple injection points. Every FF is very close to a mesh wire, so RC variation is minimised — this gives the lowest possible skew (often <10 ps across a block). The cost: clock mesh consumes 5–15% of routing resources and has enormous capacitance → high dynamic power. Used in high-end CPUs (Intel, AMD) where skew control is paramount and power budget allows it.
| Topology | Skew | Power | Routing Area | Best Use |
|---|---|---|---|---|
| H-tree | 50–200 ps | Low | Low | Regular arrays, DRAM, small blocks |
| Balanced tree | 50–150 ps | Medium | Medium | Standard SoC CTS (most common) |
| Fishbone/spine | 100–300 ps | Low–medium | Medium | Long pipeline blocks, memory columns |
| Clock mesh | <10–30 ps | Very high (2–5× tree) | Very high | High-end CPUs, GPU shader arrays |
| Hybrid (tree + local mesh) | 20–80 ps | High | High | Modern Apple/AMD CPUs |
Skew is a signed quantity relative to the data path direction. The same skew that adds 100 ps to your setup slack on one path steals 100 ps from the hold slack on the same path. This means aggressive useful skew always creates hold violations that must be repaired with delay buffers — the "hold tax" of useful skew.
| Skew Type | Measured Between | Typical Target | Impact |
|---|---|---|---|
| Global skew | Any two FFs on the chip | 200–500 ps | Determines worst-case timing margin across chip |
| Local skew | Launch and capture FF of same data path | 30–100 ps | Directly sets setup/hold margin for that path |
| Inter-domain skew | FFs in different clock domains | N/A (CDC handled separately) | Must be managed with synchronizers, not minimized |
For a 3 GHz design (T = 333 ps), a typical skew budget looks like:
Clock nets are routed with non-default rules (NDR) that are stricter than signal nets to ensure signal integrity and consistent RC delay. The main NDR characteristics for clock wires:
| NDR Parameter | Default Signal Wire | Clock NDR Rule | Reason |
|---|---|---|---|
| Wire width | 1× minimum width | 2× minimum width | Lower resistance → controlled latency; less EM risk on high-frequency toggling wire |
| Wire spacing | 1× minimum spacing | 2× minimum spacing | Shielding gap prevents crosstalk from aggressor nets causing glitch on clock |
| Shielding | None | VDD/VSS shield on both sides | Eliminates capacitive coupling from switching signals onto clock net |
| Via type | Single via | Via array (2×2 or 3×3) | Reliability; clock vias switch 100M–3G times/second; single-via EM risk is high |
| Metal layer | M1–M5 | M4–M8 (higher preferred) | Higher metals have lower RC per unit length → less skew from RC variation |
A 2W/2S clock wire with shields consumes 6–8× more routing tracks than a standard signal wire. In congested designs, clock shielding can cause detour routing of other signals, which itself creates timing issues. Tools have routing cost functions that balance clock quality against congestion. Reducing shielding to critical clock segments only (near the root) is a common congestion relief technique.
# 1. Define the non-default routing rule for clock nets
add_ndr \
-name CLK_NDR_2W2S \
-spacing {M3:M7 0.14} \
-width {M3:M7 0.14} \
-via {V3:V6 VIAX2_2x2}
# Wire widths are 2× min; spacing is 2× min at each layer
# 2. Create clock specification for CTS
create_ccopt_clock_tree_spec \
-file cts_spec.tcl \
-immediate
# 3. Core CTS settings
set_ccopt_property buffer_cells \
{CLKBUF1 CLKBUF2 CLKBUF4 CLKBUF8 CLKBUF16}
set_ccopt_property inverter_cells \
{CLKINV1 CLKINV2 CLKINV4 CLKINV8}
set_ccopt_property target_max_trans 0.10 ;# 100ps max slew on clock nets
set_ccopt_property target_skew 0.050 ;# 50ps local skew target
set_ccopt_property max_fanout 16 ;# max FF per leaf buffer
# 4. Run CTS
ccopt_design
# 5. Verify results
report_ccopt_clock_trees -summary
report_clock_timing -type skew -clocks [all_clocks]
report_clock_timing -type latency -clocks [all_clocks]Useful skew (sometimes called intentional skew or clock skew scheduling) deliberately sets different clock arrival times at different flip-flops to fix timing violations. Instead of treating all skew as harmful, useful skew treats it as a design knob.
After CTS, hold violations are almost guaranteed. CTS changes the absolute clock latency of every flip-flop, and these latency changes are not uniform — they create new skew relationships that pre-CTS STA never modeled. Hold fixing is therefore always a post-CTS task, not a pre-CTS task.
Pre-CTS, the tool uses an ideal clock model (zero latency everywhere). Post-CTS, a particular FF may get a clock that arrives 800 ps later than another, due to its position in the tree. If a data path exists between these two FFs and the data delay is short (e.g. 150 ps), then:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Insert delay buffers | Add BUFX1/BUFX2 cells in data path to increase t_data | Simple, effective, precise | Increases area; can worsen setup on same path if setup already tight |
| Resize buffers | Use slower (HVT/LVT swap) buffers to increase delay | No area overhead | Limited delay range; changes power profile |
| Reroute path | Detour data path wire to increase RC delay | No cell changes | Routing congestion; hard to control precisely |
| Adjust useful skew | Reduce beneficial skew to lessen hold exposure | Systemic fix | May re-introduce setup violations that were fixed by skew |
# After CTS, run timing analysis to reveal hold violations
setAnalysisMode -analysisType onChipVariation -cppr both
timeDesign -postCTS -hold -expandedViews -outDir timing_reports/postCTS_hold
# Fix hold violations by inserting delay cells in data paths
setOptMode \
-fixHoldAllowSetupTns 0 \
-holdTargetSlack 0.0 \
-holdFixingCells {BUFX1 BUFX2 BUFX4 DLYG1 DLYG2}
optDesign -postCTS -hold
# Verify all hold violations fixed
report_timing -path_type full_clock_expanded \
-max_paths 20 \
-slack_lesser_than 0 \
-holdThe clock network is almost always the largest single consumer of dynamic power in a synchronous digital design — typically 20–40% of total chip power. At 3 GHz, every clock net toggles 3 billion times per second, charging and discharging its capacitance on each edge.
// RTL style that synthesis infers as ICG
always_ff @(posedge clk) begin
if (enable)
q <= d; // synthesis tool maps this to ICG + regular FF
end
// Explicit ICG cell instantiation (post-synthesis or in custom blocks)
// Library cell: ICGX1 has: CLK, EN, SE (scan enable) -> GCLK
ICGX1 u_clk_gate (
.CLK (clk_in), // ungated clock input
.EN (reg_enable), // functional enable
.SE (scan_en), // scan enable (bypasses gate for ATPG)
.GCLK (clk_gated) // gated clock to downstream FFs
);
// Synthesis directive to preserve ICG (prevent optimization removing it)
// set_dont_touch [get_cells u_clk_gate]| Step | Command / Action | What to Check |
|---|---|---|
| 1. Pre-CTS timing | timeDesign -preCTS | Setup WNS/TNS before clock. Ideal-clock setup violations must be <0 (acceptable) or fixed before CTS. |
| 2. Load CTS spec | create_ccopt_clock_tree_spec | All defined clock roots present; buffer/inverter list loaded from tech lib. |
| 3. Set NDR rules | add_ndr; set_ccopt_property route_type NDR_CLK | NDR rule covers all metal layers used by clock; double-check via rule. |
| 4. Run CTS | ccopt_design | Skew < target; max slew < 100 ps; max fanout respected. |
| 5. Post-CTS STA (setup) | timeDesign -postCTS | Setup WNS ≥ 0 (or within ECO budget); TNS near zero. |
| 6. Post-CTS hold fix | optDesign -postCTS -hold | Hold WNS ≥ 0 across all corners; no new setup violations created. |
| 7. Verify clock reports | report_ccopt_clock_trees | Latency, skew, buffer count, power estimate per clock domain. |
| 8. Clock mesh (if used) | Separate mesh synthesis flow | Mesh injection point distribution; droop across mesh <30 ps differential. |
# Post-CTS: check clock tree quality
report_ccopt_clock_trees \
-summary \
-file cts_summary.rpt
# Skew report per clock domain
report_clock_timing \
-type skew \
-clocks [all_clocks] \
-verbose \
-file cts_skew.rpt
# Latency (insertion delay) report
report_clock_timing \
-type latency \
-clocks [all_clocks] \
-file cts_latency.rpt
# Buffer count and power estimate
report_clock_timing \
-type summary
# Check for slew violations on clock nets
report_constraint -check_ports -all_violators \
-drv_types max_transition \
-file clock_slew_violations.rpt
# Run MCMM check across all corners/modes
foreach_in_collection corner [all_corners] {
set_analysis_view \
-setup [list func_tt_setup] \
-hold [list func_ss_hold]
timeDesign -postCTS -hold
}| Chip | Process | Clock Freq | Local Skew | Insertion Delay | Clock Power | Topology |
|---|---|---|---|---|---|---|
| Apple A17 Pro | TSMC 3nm | 3.78 GHz (P-core) | <30 ps | ~0.6 ns | ~15% of chip power | Hybrid tree + local mesh |
| AMD Ryzen 9 (Zen 4) | TSMC 5nm | 5.7 GHz (boost) | <25 ps | ~0.5 ns | ~18% of chip power | Optimised balanced tree |
| Intel Core i9 (13th gen) | Intel 7 | 5.8 GHz (boost) | <20 ps | ~0.45 ns | ~20% of chip power | Clock mesh on P-cores |
| Qualcomm Snapdragon 8 Gen 3 | TSMC 4nm | 3.3 GHz | <40 ps | ~0.7 ns | ~12% of chip power | Balanced tree + ICG heavy |
| AMD EPYC Genoa | TSMC 5nm | 3.7 GHz (base) | <35 ps | ~0.9 ns (larger die) | ~22% of chip power | Per-chiplet tree, mesh within core |
One reason AMD's Zen 4 achieves such high boost frequencies is clock discipline: very tight local skew (<25 ps), aggressive ICG (nearly all idle logic clock-gated), and per-core clock domains that can individually boost while others idle. The CTS team essentially treats each core as an independent sub-chip with its own balanced tree, then connects them through a low-skew inter-core distribution layer.
| # | Question | Key Points in the Answer |
|---|---|---|
| 1 | What is the difference between clock skew and clock jitter? | Skew is a static, spatial difference between two FF clock arrivals — deterministic, caused by unequal buffer/wire delays. Jitter is a temporal variation of the clock edge arrival at a single FF from cycle to cycle — caused by supply noise, thermal noise, PLL bandwidth. Skew is modeled in STA as clock uncertainty; jitter is modeled as sdiv in the PLL spec and clock uncertainty derating. |
| 2 | Why do hold violations increase after CTS? | Before CTS, ideal clock model assumes zero latency everywhere. After CTS, real buffers create unequal latency. FFs that are close in the data domain may be far apart in the clock tree, creating large skew. Large positive skew → large hold risk. Short data paths between these FFs violate hold timing that didn't exist pre-CTS. |
| 3 | What is an NDR rule and why is it needed for clocks? | Non-default routing rule: wider wires (2× min width → lower resistance), wider spacing (2× min → shielding gap), and via arrays. Needed to: (1) control RC-based latency variation, (2) prevent crosstalk glitches on the clock from switching signal aggressors, (3) reduce EM risk at GHz switching frequency. |
| 4 | What is useful skew and what is the downside? | Useful skew = intentionally delaying the capture FF's clock to give a tight path more effective cycle time. Fixes setup. The downside: the same delay worsens hold on that same path. Maximum useful skew is ~15–20% of clock period before hold violations become unfixable. |
| 5 | How does clock gating reduce power? | ICG cells gate the clock to idle register groups. If a block is idle 80% of the time, its clock capacitance stops switching 80% of the time, reducing dynamic power by 80% for that branch. Chip-level clock gating with 30% average activity → 50–70% reduction in total clock dynamic power. |
| 6 | What is clock mesh and when do you use it? | Clock mesh is a full grid of driven wires, giving sub-10 ps skew. Used in high-performance CPU cores where skew is the binding constraint and power budget allows the 2–5× overhead vs tree topology. Not practical for full-chip; used per block (e.g., per CPU core cluster). |