Modern chips contain 10–100+ billion transistors. No EDA tool — and no team — can successfully place and route a billion-gate flat design in a reasonable timeframe. Hierarchical design solves this by breaking the chip into independently implementable blocks.
Benefits of hierarchical design:
Parallel implementation: 20 teams implement 20 blocks simultaneously
EDA tool scalability: Each block fits in tool memory constraints
Reuse: A designed block (e.g., CPU core) can be replicated 8× in top-level
Interface stability: Teams work independently with fixed pin locations
QA isolation: Bug fixes in one block don't invalidate other blocks
Good partitioning minimizes interface complexity while enabling maximum parallel work. Key criteria:
Interface bandwidth: Fewer, wider buses preferred over many narrow signals
Clock domains: Keep same-domain logic in same block where possible
Power domains: Group by voltage level to minimize level shifters at boundary
Functional cohesion: Blocks should be self-contained (a CPU core, a memory controller)
Area balance: ~10–30% of die area per partition (avoid one huge block)
2.2 Black Box Methodology
Black Box Interface Contract
Block Interface Definition (CPU Core, fixed for entire project):
┌─────────────────────────────────────────────────────────────┐
│ CPU_CORE_v1 │
│ │
│ INPUTS: │
│ clk_cpu[1] - Clock input (3.49GHz max) │
│ rst_n[1] - Active-low reset │
│ instr_bus[128] - Instruction fetch data │
│ data_bus[512] - L2 cache data bus │
│ irq[8] - Interrupt requests │
│ pwr_en[1] - Power domain enable │
│ │
│ OUTPUTS: │
│ instr_addr[48] - Instruction fetch address │
│ data_addr[48] - Load/store address │
│ data_wr[512] - Write data to L2 │
│ data_wr_en[1] - Write enable │
│ perf_cnt[32] - Performance counters │
│ │
│ TIMING: - All I/O at clk_cpu edge │
│ AREA: - 4.5mm² at 75% utilization │
│ POWER: - 1.8W at 3.49GHz, 1.05V │
│ PIN LOCATIONS: - Fixed at block boundary (±5µm) │
└─────────────────────────────────────────────────────────────┘
This contract NEVER changes after project kickoff.
Top-level timing closure depends on this stability.
2.3 Interface Timing Budgeting
Hierarchical Timing Budget Allocation:
Total timing budget (3.49GHz) = 1 / 3.49GHz = 286ps
Allocation for a path from CPU_CORE → L2_CACHE:
CPU_CORE internal logic: 100ps (internal design budget)
Output pad + driver: 15ps (cell delays at block boundary)
Interconnect (on-chip wire): 40ps (routing from CPU to L2)
L2_CACHE input logic: 80ps (internal design budget)
Setup time (FF): 10ps
Clock uncertainty: 41ps
──────────────────────────────────────────────────────────
Total: 286ps ← uses full budget exactly!
Implementation rule:
Block teams own their internal budget (100ps / 80ps above)
Top-level team owns the routing wire budget (40ps)
Any budget overrun requires negotiation and re-budgeting
3. Chiplet Architecture — The New Paradigm
Chiplets replace monolithic single-die designs with assemblies of smaller specialized dies. Instead of cramming CPU, GPU, and memory controller onto one die, each function becomes a separate chip manufactured on the optimal process node.
Why Chiplets Dominate the Industry
A 500mm² monolithic die at 5nm yields ~10% (most die have at least one defect). Split into five 100mm² chiplets: each yields ~60%. Combined system yield: 60%^5 = 7.8% — but only defective chiplets are discarded, not the entire assembly. Plus, CPUs get TSMC 5nm while analog gets cheaper 28nm. Win-win.
A passive silicon interposer acts as a routing substrate between chiplets, enabling much higher density connections than an organic package substrate.
Technology
Bump Pitch
Bandwidth/mm
Example
Organic substrate
130–400µm
~1 GB/s/mm
AMD EPYC (early)
2.5D Silicon interposer
40–100µm
~10 GB/s/mm
Xilinx/AMD FPGA SLR
CoWoS (TSMC)
10–45µm
~50 GB/s/mm
NVIDIA H100, AMD MI300X
3D SoIC (TSMC)
0.5–9µm
~1 TB/s/mm
Apple M3 (SoIC stacking)
4.2 3D Stacking — HBM and Logic-Memory
3D HBM Stack — NVIDIA H100 GPU Equivalent
Side-view cross-section:
HBM Stack (SK Hynix / Micron):
┌───────────────────────────┐ ← DRAM die #8 (top)
├───────────────────────────┤ ← DRAM die #7
├───────────────────────────┤ ← DRAM die #6
├───────────────────────────┤ ← DRAM die #5
├───────────────────────────┤ ← DRAM die #4
├───────────────────────────┤ ← DRAM die #3
├───────────────────────────┤ ← DRAM die #2
├───────────────────────────┤ ← DRAM die #1
├───────────────────────────┤ ← Base die (logic/PHY)
│ ↕ TSV (Through-Silicon) │ ← 1000s of TSVs connecting all layers
└───────────────────────────┘
│ µbumps
┌───────────────────────────┐
│ Silicon Interposer │ ← Routes HBM ↔ GPU compute die
└───────────────────────────┘
│ µbumps
┌───────────────────────────┐
│ GPU Compute Die (5nm) │
└───────────────────────────┘
H100 specs:
4× HBM3 stacks, 8 dies each
Total DRAM: 80GB
TSV count per stack: 1,024
Memory bandwidth: 3.35 TB/s (vs 900 GB/s for DDR5 @ same area)
5. Die-to-Die Interconnect Standards
UCIe — Universal Chiplet Interconnect Express
UCIe is the open standard for die-to-die interfaces, analogous to PCIe for board-level interconnect. Chiplets from Intel, AMD, Qualcomm, and Arm can all interoperate via UCIe.
UCIe Specification Summary:
Physical layer variants:
Standard package (organic): 16 Gbps/lane, 45µm pitch
Advanced package (CoWoS): 32 Gbps/lane, 10µm pitch
Bandwidth calculation:
Standard: 16 Gbps × 64 lanes = 1 Tbps per direction
Advanced: 32 Gbps × 256 lanes = 8 Tbps per direction
Latency:
Die-to-die: ~2ns (vs 10–100ns for chiplet via PCIe)
Comparable to on-chip NoC latency!
Protocol stack:
Layer 3: UCIe Transport (TLP format similar to PCIe)
Layer 2: Die-to-Die Adapter (retimer, flow control)
Layer 1: Physical Layer (SerDes at 16/32 Gbps)
Layer 0: PHY (bumps, electrical signaling, equalization)
UCIe adopters: Intel (Foveros), AMD (3D V-Cache), Qualcomm, Arm,
TSMC, Samsung, ASE (packaging)
Adoption: Required for DARPA CHIPS Act funded projects
6. Inter-Die Timing Closure
Timing paths that cross die boundaries require special treatment. The die-to-die link adds latency and uncertainty that must be budgeted into the system timing model.