Home Physical Design Day 10 — Hierarchical Design

Hierarchical Design & Multi-Die Integration

Top-down partitioning, black-box methodology, 2.5D/3D chiplet integration, UCIe standard, inter-die timing closure, and production chiplet validation strategies.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,800 words · 15 min read

1. Why Hierarchical Design?

Modern chips contain 10–100+ billion transistors. No EDA tool — and no team — can successfully place and route a billion-gate flat design in a reasonable timeframe. Hierarchical design solves this by breaking the chip into independently implementable blocks.

Benefits of hierarchical design:

Hierarchical Design Partitioning (Mobile SoC)
TOP LEVEL (SoC Assembly) ┌────────────────────────────────────────────────────────────────────┐ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────────┐│ │ │ CPU CLUSTER │ │ GPU COMPLEX │ │ MEMORY SUBSYSTEM ││ │ │ │ │ │ │ ││ │ │ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ │ │ ┌────────────┐ ││ │ │ │P-Core│ │E-Core│ │ │ │Shade│ │Shade│ │ │ │ L3 Cache │ ││ │ │ │ ×4 │ │ ×4 │ │ │ │Array│ │Array│ │ │ │ 16MB │ ││ │ │ └─────┘ └─────┘ │ │ └─────┘ └─────┘ │ │ └────────────┘ ││ │ │ ┌───────────────┐│ │ ┌───────────────┐│ │ ┌────────────┐ ││ │ │ │ Shared L2 ││ │ │ Texture Cache ││ │ │ DRAM Ctrl │ ││ │ │ └───────────────┘│ │ └───────────────┘│ │ └────────────┘ ││ │ └──────────────────┘ └──────────────────┘ └───────────────────┘│ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────────┐│ │ │ MEDIA ENGINE │ │ NEURAL ENGINE │ │ I/O & PHY ││ │ │ (Video encode) │ │ (ML inference) │ │ (USB, PCIe, WiFi) ││ │ └──────────────────┘ └──────────────────┘ └───────────────────┘│ │ │ │ ══════════════════ On-chip Network (NoC / Ring Bus) ═════════════ │ └────────────────────────────────────────────────────────────────────┘ Design teams: ~15 teams per block = 100+ physical design engineers total Timeline: 18–24 months from spec to tape-out

2. Top-Down Partitioning Flow

2.1 Partition Criteria

Good partitioning minimizes interface complexity while enabling maximum parallel work. Key criteria:

2.2 Black Box Methodology

Black Box Interface Contract
Block Interface Definition (CPU Core, fixed for entire project): ┌─────────────────────────────────────────────────────────────┐ │ CPU_CORE_v1 │ │ │ │ INPUTS: │ │ clk_cpu[1] - Clock input (3.49GHz max) │ │ rst_n[1] - Active-low reset │ │ instr_bus[128] - Instruction fetch data │ │ data_bus[512] - L2 cache data bus │ │ irq[8] - Interrupt requests │ │ pwr_en[1] - Power domain enable │ │ │ │ OUTPUTS: │ │ instr_addr[48] - Instruction fetch address │ │ data_addr[48] - Load/store address │ │ data_wr[512] - Write data to L2 │ │ data_wr_en[1] - Write enable │ │ perf_cnt[32] - Performance counters │ │ │ │ TIMING: - All I/O at clk_cpu edge │ │ AREA: - 4.5mm² at 75% utilization │ │ POWER: - 1.8W at 3.49GHz, 1.05V │ │ PIN LOCATIONS: - Fixed at block boundary (±5µm) │ └─────────────────────────────────────────────────────────────┘ This contract NEVER changes after project kickoff. Top-level timing closure depends on this stability.

2.3 Interface Timing Budgeting

Hierarchical Timing Budget Allocation: Total timing budget (3.49GHz) = 1 / 3.49GHz = 286ps Allocation for a path from CPU_CORE → L2_CACHE: CPU_CORE internal logic: 100ps (internal design budget) Output pad + driver: 15ps (cell delays at block boundary) Interconnect (on-chip wire): 40ps (routing from CPU to L2) L2_CACHE input logic: 80ps (internal design budget) Setup time (FF): 10ps Clock uncertainty: 41ps ────────────────────────────────────────────────────────── Total: 286ps ← uses full budget exactly! Implementation rule: Block teams own their internal budget (100ps / 80ps above) Top-level team owns the routing wire budget (40ps) Any budget overrun requires negotiation and re-budgeting

3. Chiplet Architecture — The New Paradigm

Chiplets replace monolithic single-die designs with assemblies of smaller specialized dies. Instead of cramming CPU, GPU, and memory controller onto one die, each function becomes a separate chip manufactured on the optimal process node.

Why Chiplets Dominate the Industry

A 500mm² monolithic die at 5nm yields ~10% (most die have at least one defect). Split into five 100mm² chiplets: each yields ~60%. Combined system yield: 60%^5 = 7.8% — but only defective chiplets are discarded, not the entire assembly. Plus, CPUs get TSMC 5nm while analog gets cheaper 28nm. Win-win.

Chiplet Package Architecture (AMD EPYC Genoa Equivalent)
Package Top-Down View ┌─────────────────────────────────────────────────────────────────┐ │ Organic Substrate │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ CCD #0 │ │ CCD #1 │ │ CCD #2 │ │ │ │ CPU 3nm │ │ CPU 3nm │ │ CPU 3nm │ │ │ │ 12 cores │ │ 12 cores │ │ 12 cores │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ ┌──────┴─────────────────┴──────────────────┴──────┐ │ │ │ IOD (I/O Die, 6nm) │ │ │ │ PCIe 5.0 × 128 │ DDR5 Memory Ctrl ×8 │ │ │ │ USB 4.0 ×4 │ Infinity Fabric │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ CCD #3 │ │ CCD #4 │ │ CCD #5 │ │ │ │ CPU 3nm │ │ CPU 3nm │ │ CPU 3nm │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ Die-to-die connections: ~1,000 µbumps per CCD↔IOD interface Bump pitch: 55µm (organic substrate) or 10µm (silicon interposer) Bandwidth: 2TB/s die-to-die (Infinity Fabric at 2.4GHz × 512 bits)

4. 2.5D and 3D Integration

4.1 2.5D — Silicon Interposer

A passive silicon interposer acts as a routing substrate between chiplets, enabling much higher density connections than an organic package substrate.

TechnologyBump PitchBandwidth/mmExample
Organic substrate130–400µm~1 GB/s/mmAMD EPYC (early)
2.5D Silicon interposer40–100µm~10 GB/s/mmXilinx/AMD FPGA SLR
CoWoS (TSMC)10–45µm~50 GB/s/mmNVIDIA H100, AMD MI300X
3D SoIC (TSMC)0.5–9µm~1 TB/s/mmApple M3 (SoIC stacking)

4.2 3D Stacking — HBM and Logic-Memory

3D HBM Stack — NVIDIA H100 GPU Equivalent
Side-view cross-section: HBM Stack (SK Hynix / Micron): ┌───────────────────────────┐ ← DRAM die #8 (top) ├───────────────────────────┤ ← DRAM die #7 ├───────────────────────────┤ ← DRAM die #6 ├───────────────────────────┤ ← DRAM die #5 ├───────────────────────────┤ ← DRAM die #4 ├───────────────────────────┤ ← DRAM die #3 ├───────────────────────────┤ ← DRAM die #2 ├───────────────────────────┤ ← DRAM die #1 ├───────────────────────────┤ ← Base die (logic/PHY) │ ↕ TSV (Through-Silicon) │ ← 1000s of TSVs connecting all layers └───────────────────────────┘ │ µbumps ┌───────────────────────────┐ │ Silicon Interposer │ ← Routes HBM ↔ GPU compute die └───────────────────────────┘ │ µbumps ┌───────────────────────────┐ │ GPU Compute Die (5nm) │ └───────────────────────────┘ H100 specs: 4× HBM3 stacks, 8 dies each Total DRAM: 80GB TSV count per stack: 1,024 Memory bandwidth: 3.35 TB/s (vs 900 GB/s for DDR5 @ same area)

5. Die-to-Die Interconnect Standards

UCIe — Universal Chiplet Interconnect Express

UCIe is the open standard for die-to-die interfaces, analogous to PCIe for board-level interconnect. Chiplets from Intel, AMD, Qualcomm, and Arm can all interoperate via UCIe.

UCIe Specification Summary: Physical layer variants: Standard package (organic): 16 Gbps/lane, 45µm pitch Advanced package (CoWoS): 32 Gbps/lane, 10µm pitch Bandwidth calculation: Standard: 16 Gbps × 64 lanes = 1 Tbps per direction Advanced: 32 Gbps × 256 lanes = 8 Tbps per direction Latency: Die-to-die: ~2ns (vs 10–100ns for chiplet via PCIe) Comparable to on-chip NoC latency! Protocol stack: Layer 3: UCIe Transport (TLP format similar to PCIe) Layer 2: Die-to-Die Adapter (retimer, flow control) Layer 1: Physical Layer (SerDes at 16/32 Gbps) Layer 0: PHY (bumps, electrical signaling, equalization) UCIe adopters: Intel (Foveros), AMD (3D V-Cache), Qualcomm, Arm, TSMC, Samsung, ASE (packaging) Adoption: Required for DARPA CHIPS Act funded projects

6. Inter-Die Timing Closure

Timing paths that cross die boundaries require special treatment. The die-to-die link adds latency and uncertainty that must be budgeted into the system timing model.

Inter-Die Timing Path (CPU Die → I/O Die)
CPU Die (3nm): I/O Die (5nm): ┌───────────────────┐ ┌────────────────────┐ │ │ │ │ │ FF_launch ──────►│ │►────────── FF_cap │ │ data path │ │ input buf │ │ logic: 100ps │ │ logic: 30ps │ │ │ Die-to-die link: │ │ TX PHY: 50ps │───────────► RX PHY: 50ps │ │ bump → interposer│ 100µm trace + µbump × 2 │ │ │ wire delay: 30ps │ └───────────────────┘ └────────────────────┘ Total path delay: 100 + 50 + 30 + 50 + 30 = 260ps Die-to-die timing uncertainty (added to clock budget): TX PHY jitter: ±15ps Interconnect jitter: ±5ps RX PHY jitter: ±15ps CDR recovery: ±10ps Total uncertainty: 45ps (vs 10ps for on-chip path!) For 1GHz die-to-die clock (1000ps period): Effective budget = 1000 - 45(uncertainty) - 50(setup) = 905ps Used: 260ps → slack = 645ps ✓ (comfortable) For 4GHz die-to-die (250ps period): Effective budget = 250 - 45 - 50 = 155ps Used: 260ps → slack = -105ps ✗ (timing violation! need retimer) Fix: Insert retimer FF at die boundary → 2-cycle latency penalty

7. Real-World Chiplet Examples

AMD MI300X (AI Accelerator)

Intel Meteor Lake

8. Hierarchical Design Challenges

ChallengeProblemSolution
Interface timingBudget cuts from top may exceed block capabilityEarly co-design of timing budget, interface spec frozen early
Congestion at boundaryAll wires crossing block edge at single stripSpread interface pins, use multiple metal layers for feed-throughs
Power domain boundaryLevel shifters needed at every VDD crossingMinimize cross-domain signals, batch into buses
Block mismatchBlock A and Block B interface mismatch at topStrict ECO procedure; only interface-preserving changes allowed
Die-to-die latency2–5 extra clock cycles for cross-die pathsPipeline interfaces, use burst protocols to amortize latency
Thermal cross-dieHot compute die heats memory die (3D)Thermal modeling at package level, active cooling solutions

9. Chiplet Design Checklist

Hierarchical / Chiplet Production Checklist

  • Partition complete: All blocks defined with fixed interface specs
  • Timing budgets allocated: Internal vs boundary vs interconnect
  • Pin locations frozen: Interface pin coordinates locked per block
  • Black-box models ready: Abstract timing models (ETMs) for all blocks
  • Level shifters accounted: All cross-domain pins have LS cells budgeted
  • Die-to-die PHY characterized: Latency, jitter, bandwidth verified
  • UCIe/PHY compliant: Interface matches standard (or proprietary) spec
  • Interposer design complete: 2.5D routing verified with SI/EM analysis
  • Thermal model assembled: Full-package thermal verified at max workload
  • System-level timing closed: End-to-end paths across all dies close
  • Test strategy defined: KGD (Known Good Die) test before packaging
  • Yield model signed off: Die yield × assembly yield meets production target

Next — Day 11: Engineering Change Orders (ECO) — late-stage netlist changes, metal-only ECO, functional ECO, and post-silicon bug fixing strategies.