What is hierarchical design in physical design?

Hierarchical design divides a large chip into independently implementable blocks (partitions), each with defined interfaces. Each block is implemented in parallel by different teams, then assembled at the top level — enabling billion-transistor designs that would be impossible to route as a flat design.

What are chiplets in semiconductor design?

Chiplets are separately manufactured silicon dies that are co-packaged together using advanced packaging (2.5D interposers or 3D stacking). Each chiplet can use a different process node optimized for its function — e.g., CPU core at 3nm, DRAM at 1x-nm, analog at 28nm — all connected via high-density die-to-die interconnect.

UCIe (Universal Chiplet Interconnect Express) is an open industry standard for die-to-die interfaces in chiplet designs. It defines the physical layer, die-to-die adapter protocol, and software stack, enabling chiplets from different vendors to interoperate — similar to how PCIe enables board-level interoperability.

Physical Design Day 10 — Hierarchical Design & Multi-Die Integration

1. Why Hierarchical Design?

Modern chips contain 10–100+ billion transistors. No EDA tool — and no team — can successfully place and route a billion-gate flat design in a reasonable timeframe. Hierarchical design solves this by breaking the chip into independently implementable blocks.

Benefits of hierarchical design:

Parallel implementation: 20 teams implement 20 blocks simultaneously
EDA tool scalability: Each block fits in tool memory constraints
Reuse: A designed block (e.g., CPU core) can be replicated 8× in top-level
Interface stability: Teams work independently with fixed pin locations
QA isolation: Bug fixes in one block don't invalidate other blocks

Hierarchical Design Partitioning (Mobile SoC)

TOP LEVEL (SoC Assembly) ┌────────────────────────────────────────────────────────────────────┐ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────────┐│ │ │ CPU CLUSTER │ │ GPU COMPLEX │ │ MEMORY SUBSYSTEM ││ │ │ │ │ │ │ ││ │ │ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ │ │ ┌────────────┐ ││ │ │ │P-Core│ │E-Core│ │ │ │Shade│ │Shade│ │ │ │ L3 Cache │ ││ │ │ │ ×4 │ │ ×4 │ │ │ │Array│ │Array│ │ │ │ 16MB │ ││ │ │ └─────┘ └─────┘ │ │ └─────┘ └─────┘ │ │ └────────────┘ ││ │ │ ┌───────────────┐│ │ ┌───────────────┐│ │ ┌────────────┐ ││ │ │ │ Shared L2 ││ │ │ Texture Cache ││ │ │ DRAM Ctrl │ ││ │ │ └───────────────┘│ │ └───────────────┘│ │ └────────────┘ ││ │ └──────────────────┘ └──────────────────┘ └───────────────────┘│ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────────┐│ │ │ MEDIA ENGINE │ │ NEURAL ENGINE │ │ I/O & PHY ││ │ │ (Video encode) │ │ (ML inference) │ │ (USB, PCIe, WiFi) ││ │ └──────────────────┘ └──────────────────┘ └───────────────────┘│ │ │ │ ══════════════════ On-chip Network (NoC / Ring Bus) ═════════════ │ └────────────────────────────────────────────────────────────────────┘ Design teams: ~15 teams per block = 100+ physical design engineers total Timeline: 18–24 months from spec to tape-out

2. Top-Down Partitioning Flow

2.1 Partition Criteria

Good partitioning minimizes interface complexity while enabling maximum parallel work. Key criteria:

Interface bandwidth: Fewer, wider buses preferred over many narrow signals
Clock domains: Keep same-domain logic in same block where possible
Power domains: Group by voltage level to minimize level shifters at boundary
Functional cohesion: Blocks should be self-contained (a CPU core, a memory controller)
Area balance: ~10–30% of die area per partition (avoid one huge block)

2.2 Black Box Methodology

Black Box Interface Contract

Block Interface Definition (CPU Core, fixed for entire project): ┌─────────────────────────────────────────────────────────────┐ │ CPU_CORE_v1 │ │ │ │ INPUTS: │ │ clk_cpu[1] - Clock input (3.49GHz max) │ │ rst_n[1] - Active-low reset │ │ instr_bus[128] - Instruction fetch data │ │ data_bus[512] - L2 cache data bus │ │ irq[8] - Interrupt requests │ │ pwr_en[1] - Power domain enable │ │ │ │ OUTPUTS: │ │ instr_addr[48] - Instruction fetch address │ │ data_addr[48] - Load/store address │ │ data_wr[512] - Write data to L2 │ │ data_wr_en[1] - Write enable │ │ perf_cnt[32] - Performance counters │ │ │ │ TIMING: - All I/O at clk_cpu edge │ │ AREA: - 4.5mm² at 75% utilization │ │ POWER: - 1.8W at 3.49GHz, 1.05V │ │ PIN LOCATIONS: - Fixed at block boundary (±5µm) │ └─────────────────────────────────────────────────────────────┘ This contract NEVER changes after project kickoff. Top-level timing closure depends on this stability.

2.3 Interface Timing Budgeting

Hierarchical Timing Budget Allocation: Total timing budget (3.49GHz) = 1 / 3.49GHz = 286ps Allocation for a path from CPU_CORE → L2_CACHE: CPU_CORE internal logic: 100ps (internal design budget) Output pad + driver: 15ps (cell delays at block boundary) Interconnect (on-chip wire): 40ps (routing from CPU to L2) L2_CACHE input logic: 80ps (internal design budget) Setup time (FF): 10ps Clock uncertainty: 41ps ────────────────────────────────────────────────────────── Total: 286ps ← uses full budget exactly! Implementation rule: Block teams own their internal budget (100ps / 80ps above) Top-level team owns the routing wire budget (40ps) Any budget overrun requires negotiation and re-budgeting

3. Chiplet Architecture — The New Paradigm

Chiplets replace monolithic single-die designs with assemblies of smaller specialized dies. Instead of cramming CPU, GPU, and memory controller onto one die, each function becomes a separate chip manufactured on the optimal process node.

Why Chiplets Dominate the Industry

A 500mm² monolithic die at 5nm yields ~10% (most die have at least one defect). Split into five 100mm² chiplets: each yields ~60%. Combined system yield: 60%^5 = 7.8% — but only defective chiplets are discarded, not the entire assembly. Plus, CPUs get TSMC 5nm while analog gets cheaper 28nm. Win-win.

Chiplet Package Architecture (AMD EPYC Genoa Equivalent)

Package Top-Down View ┌─────────────────────────────────────────────────────────────────┐ │ Organic Substrate │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ CCD #0 │ │ CCD #1 │ │ CCD #2 │ │ │ │ CPU 3nm │ │ CPU 3nm │ │ CPU 3nm │ │ │ │ 12 cores │ │ 12 cores │ │ 12 cores │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ ┌──────┴─────────────────┴──────────────────┴──────┐ │ │ │ IOD (I/O Die, 6nm) │ │ │ │ PCIe 5.0 × 128 │ DDR5 Memory Ctrl ×8 │ │ │ │ USB 4.0 ×4 │ Infinity Fabric │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ CCD #3 │ │ CCD #4 │ │ CCD #5 │ │ │ │ CPU 3nm │ │ CPU 3nm │ │ CPU 3nm │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ Die-to-die connections: ~1,000 µbumps per CCD↔IOD interface Bump pitch: 55µm (organic substrate) or 10µm (silicon interposer) Bandwidth: 2TB/s die-to-die (Infinity Fabric at 2.4GHz × 512 bits)

4. 2.5D and 3D Integration

4.1 2.5D — Silicon Interposer

A passive silicon interposer acts as a routing substrate between chiplets, enabling much higher density connections than an organic package substrate.

Technology	Bump Pitch	Bandwidth/mm	Example
Organic substrate	130–400µm	~1 GB/s/mm	AMD EPYC (early)
2.5D Silicon interposer	40–100µm	~10 GB/s/mm	Xilinx/AMD FPGA SLR
CoWoS (TSMC)	10–45µm	~50 GB/s/mm	NVIDIA H100, AMD MI300X
3D SoIC (TSMC)	0.5–9µm	~1 TB/s/mm	Apple M3 (SoIC stacking)

4.2 3D Stacking — HBM and Logic-Memory

3D HBM Stack — NVIDIA H100 GPU Equivalent

Side-view cross-section: HBM Stack (SK Hynix / Micron): ┌───────────────────────────┐ ← DRAM die #8 (top) ├───────────────────────────┤ ← DRAM die #7 ├───────────────────────────┤ ← DRAM die #6 ├───────────────────────────┤ ← DRAM die #5 ├───────────────────────────┤ ← DRAM die #4 ├───────────────────────────┤ ← DRAM die #3 ├───────────────────────────┤ ← DRAM die #2 ├───────────────────────────┤ ← DRAM die #1 ├───────────────────────────┤ ← Base die (logic/PHY) │ ↕ TSV (Through-Silicon) │ ← 1000s of TSVs connecting all layers └───────────────────────────┘ │ µbumps ┌───────────────────────────┐ │ Silicon Interposer │ ← Routes HBM ↔ GPU compute die └───────────────────────────┘ │ µbumps ┌───────────────────────────┐ │ GPU Compute Die (5nm) │ └───────────────────────────┘ H100 specs: 4× HBM3 stacks, 8 dies each Total DRAM: 80GB TSV count per stack: 1,024 Memory bandwidth: 3.35 TB/s (vs 900 GB/s for DDR5 @ same area)

5. Die-to-Die Interconnect Standards

UCIe — Universal Chiplet Interconnect Express

UCIe is the open standard for die-to-die interfaces, analogous to PCIe for board-level interconnect. Chiplets from Intel, AMD, Qualcomm, and Arm can all interoperate via UCIe.

UCIe Specification Summary: Physical layer variants: Standard package (organic): 16 Gbps/lane, 45µm pitch Advanced package (CoWoS): 32 Gbps/lane, 10µm pitch Bandwidth calculation: Standard: 16 Gbps × 64 lanes = 1 Tbps per direction Advanced: 32 Gbps × 256 lanes = 8 Tbps per direction Latency: Die-to-die: ~2ns (vs 10–100ns for chiplet via PCIe) Comparable to on-chip NoC latency! Protocol stack: Layer 3: UCIe Transport (TLP format similar to PCIe) Layer 2: Die-to-Die Adapter (retimer, flow control) Layer 1: Physical Layer (SerDes at 16/32 Gbps) Layer 0: PHY (bumps, electrical signaling, equalization) UCIe adopters: Intel (Foveros), AMD (3D V-Cache), Qualcomm, Arm, TSMC, Samsung, ASE (packaging) Adoption: Required for DARPA CHIPS Act funded projects

6. Inter-Die Timing Closure

Timing paths that cross die boundaries require special treatment. The die-to-die link adds latency and uncertainty that must be budgeted into the system timing model.

Inter-Die Timing Path (CPU Die → I/O Die)

CPU Die (3nm): I/O Die (5nm): ┌───────────────────┐ ┌────────────────────┐ │ │ │ │ │ FF_launch ──────►│ │►────────── FF_cap │ │ data path │ │ input buf │ │ logic: 100ps │ │ logic: 30ps │ │ │ Die-to-die link: │ │ TX PHY: 50ps │───────────► RX PHY: 50ps │ │ bump → interposer│ 100µm trace + µbump × 2 │ │ │ wire delay: 30ps │ └───────────────────┘ └────────────────────┘ Total path delay: 100 + 50 + 30 + 50 + 30 = 260ps Die-to-die timing uncertainty (added to clock budget): TX PHY jitter: ±15ps Interconnect jitter: ±5ps RX PHY jitter: ±15ps CDR recovery: ±10ps Total uncertainty: 45ps (vs 10ps for on-chip path!) For 1GHz die-to-die clock (1000ps period): Effective budget = 1000 - 45(uncertainty) - 50(setup) = 905ps Used: 260ps → slack = 645ps ✓ (comfortable) For 4GHz die-to-die (250ps period): Effective budget = 250 - 45 - 50 = 155ps Used: 260ps → slack = -105ps ✗ (timing violation! need retimer) Fix: Insert retimer FF at die boundary → 2-cycle latency penalty

7. Real-World Chiplet Examples

AMD MI300X (AI Accelerator)

Architecture: 13 total dies (3 GPU compute + 1 IO + 8 HBM + 1 base)
CPU compute dies: 3× GPU compute dies (3nm, TSMC)
I/O die: 1× CDNA3 I/O (5nm, PCIe Gen5, UCIe interconnect)
Memory: 8× HBM3 stacks (192GB total)
Memory bandwidth: 5.2 TB/s
Die-to-die interconnect: CoWoS-L (TSMC), 8µm bump pitch
Total transistors: ~146 billion

Intel Meteor Lake

Compute tile: Intel 4 (EUV) — CPU P-cores and E-cores
GPU tile: TSMC N5 — Arc graphics chiplet
SoC tile: TSMC N6 — PCIe, Thunderbolt, media engine
I/O tile: TSMC N6 — DDR5 controller, USB
Die-to-die: Intel Foveros Direct (hybrid bonding, 10µm pitch)
Unique: First Intel laptop chip manufactured on external foundry

8. Hierarchical Design Challenges

Challenge	Problem	Solution
Interface timing	Budget cuts from top may exceed block capability	Early co-design of timing budget, interface spec frozen early
Congestion at boundary	All wires crossing block edge at single strip	Spread interface pins, use multiple metal layers for feed-throughs
Power domain boundary	Level shifters needed at every VDD crossing	Minimize cross-domain signals, batch into buses
Block mismatch	Block A and Block B interface mismatch at top	Strict ECO procedure; only interface-preserving changes allowed
Die-to-die latency	2–5 extra clock cycles for cross-die paths	Pipeline interfaces, use burst protocols to amortize latency
Thermal cross-die	Hot compute die heats memory die (3D)	Thermal modeling at package level, active cooling solutions

9. Chiplet Design Checklist

Hierarchical / Chiplet Production Checklist

✅ Partition complete: All blocks defined with fixed interface specs
✅ Timing budgets allocated: Internal vs boundary vs interconnect
✅ Pin locations frozen: Interface pin coordinates locked per block
✅ Black-box models ready: Abstract timing models (ETMs) for all blocks
✅ Level shifters accounted: All cross-domain pins have LS cells budgeted
✅ Die-to-die PHY characterized: Latency, jitter, bandwidth verified
✅ UCIe/PHY compliant: Interface matches standard (or proprietary) spec
✅ Interposer design complete: 2.5D routing verified with SI/EM analysis
✅ Thermal model assembled: Full-package thermal verified at max workload
✅ System-level timing closed: End-to-end paths across all dies close
✅ Test strategy defined: KGD (Known Good Die) test before packaging
✅ Yield model signed off: Die yield × assembly yield meets production target

Next — Day 11: Engineering Change Orders (ECO) — late-stage netlist changes, metal-only ECO, functional ECO, and post-silicon bug fixing strategies.