What are the main sources of power consumption in VLSI chips?

VLSI power has three main components: dynamic switching power (C×V²×f×α, from charging/discharging capacitances), short-circuit power (from brief PMOS+NMOS conduction during transitions), and static leakage power (subthreshold and gate leakage even when idle).

What is clock gating in chip design?

Clock gating disables the clock to idle register groups using an AND gate or ICG cell. When the enable signal is low, the clock stops toggling the downstream flip-flops, eliminating their switching power — typically saving 20–40% of total chip power.

What is DVFS in mobile chips?

DVFS (Dynamic Voltage and Frequency Scaling) reduces both voltage and frequency when full performance is not needed. Since dynamic power scales with V², dropping voltage from 1.0V to 0.8V cuts power by 36% while also allowing a proportional frequency reduction.

Physical Design Day 8 — Power Optimization & Thermal Management

1. Power Consumption Sources in VLSI

Every transistor switching, leaking, or conducting contributes to chip power. Modern VLSI designers spend enormous effort on power optimization because:

Data center chips drive electricity costs of millions of dollars/year
Mobile chips must achieve days of battery life from a small cell
IoT devices may run years from coin cells
Thermal limits cap performance if power density is too high

Total Power = Dynamic Power + Short-Circuit Power + Static Leakage 1. Dynamic (Switching) Power: P_dynamic = α × C × V_DD² × f α = activity factor (fraction of clock cycles where node switches) C = total capacitance (load cap + wire cap) V = supply voltage f = clock frequency Example: 5nm chip core α = 0.15 (15% average switching activity) C = 2pF total (all gates + wires) V = 0.8V f = 3GHz P = 0.15 × 2e-12 × (0.8)² × 3e9 = 576mW 2. Short-Circuit Power: P_sc = I_sc × V_DD I_sc = current when both PMOS and NMOS briefly conduct during transition Typical: 5–15% of dynamic power (can be minimized with sharp input slews) 3. Static (Leakage) Power: P_leak = I_leak × V_DD I_leak = sum of all transistor subthreshold currents (even when OFF) In 5nm: leakage can be 30–50% of total power at idle! Dominant in advanced nodes due to thin gate oxide and low Vt

2. Dynamic Power Reduction Techniques

2.1 Clock Gating

The single highest-impact power optimization in most designs. When a group of flip-flops doesn't need to update, gating their clock stops all switching power — both the flip-flop power AND the combinational logic power feeding them.

Clock Gating — Integrated Clock Gate (ICG) Cell

WITHOUT Clock Gating (always toggling): CLK ─────┬──────────────────────────────────► FF1 (always switches) │ └──────────────────────────────────► FF2 (always switches) Every cycle: both FFs toggle → waste power if data unchanged WITH Clock Gating (ICG cell): CLK ─────────────────┐ AND ──────────────────► FF1 (only when EN=1) EN ─────[LATCH]─────┘ ←─ ICG cell FF2 (only when EN=1) ICG = Integrated Clock Gating cell (latch + AND, glitch-free) Power savings calculation: Clock gating ratio = fraction of cycles when EN=0 If 70% of cycles gate OFF: Power saved = 0.70 × P_dynamic(FFs) For a 500mW design where FFs = 30%: FF power = 150mW Gated savings = 70% × 150mW = 105mW ← significant! Hierarchical gating (coarse + fine): Block-level gate: turns off entire CPU core when idle Register-level gate: turns off specific register banks Cell-level gate: finest granularity, max savings but area overhead

2.2 Operand Isolation

When a functional block output is ignored (e.g., multiplier result not used this cycle), its input operands can be frozen to prevent unnecessary switching through the entire datapath.

Operand Isolation — Multiplier Example

Without isolation (WASTED power every cycle): A[31:0] ─────────────────────────────────► MULT (64-bit multiply) B[31:0] ─────────────────────────────────► MULT ↑ switches even when result unused With operand isolation (save ~40% mult power): A[31:0] ─────[AND_32bit]─────────────────► MULT B[31:0] ─────[AND_32bit]─────────────────► MULT │ EN ───────┘ (EN=1 only when result needed) When EN=0: all operand inputs clamped to 0 Multiplier internal nodes don't toggle Power saved: ~40% of multiplier's dynamic power

3. Voltage Scaling — DVFS

Dynamic Voltage and Frequency Scaling (DVFS) simultaneously reduces voltage and frequency when maximum performance is not needed. It is the most powerful single technique for power reduction:

Power Scaling With DVFS: P_dynamic ∝ V_DD² × f If V drops from 1.0V to 0.7V and f drops proportionally: Power ratio = (0.7/1.0)² × (0.7/1.0) = 0.49 × 0.7 = 0.343 Power savings = 65.7% (enormous!) Apple A17 DVFS operating points (efficiency core example): High performance: 2.4GHz @ 1.05V → P = 1.0 (normalized) Active balanced: 1.8GHz @ 0.90V → P = 0.56 Low power: 1.2GHz @ 0.80V → P = 0.30 Ultra low: 600MHz @ 0.70V → P = 0.13 (7.7× savings!) DVFS implementation requires: 1. On-chip voltage regulator (PMIC on separate die or co-packaged) 2. OS/firmware voltage table (each OPP = Operating Performance Point) 3. Response time: voltage change takes 5–20µs (fast enough for workload) 4. Guardband: extra voltage margin to cover transition uncertainty

DVFS Voltage & Frequency Over Time (Mobile Workload)

Frequency (GHz) 3.0 │ ████████████ 2.5 │ ██ ██████ 2.0 │ ██ ████ 1.5 │ ██ ████████████████ 1.0 │ █ █████████ 0.5 │ ████ ████ 0.0 └──────────────────────────────────────────────────────────────► Time (s) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Voltage (V) 1.1 │ ████████████ 1.0 │ ██ ██████ 0.9 │ ██ ████ 0.8 │ ██ ████████████████ 0.7 │ ████ ─ ─ ─ ─ ─ ─ ─ ─ ─████ 0.6 └──────────────────────────────────────────────────────────────► Time Power (W) 8W │ ████████████ 6W │ ██ ██████ 4W │ ██ ████ 2W │ ██ ████████████████ 1W │ ████████████ 0W └──────────────────────────────────────────────────────────────► Time Scenario: Mobile app launch → active use → background → idle Peak: 8W (launch burst), Idle: 0.8W Average power: ~2.5W → 7+ hours on 20Wh battery

4. Multi-Voltage Power Domains

Different chip blocks need different voltages. Running everything at maximum voltage wastes power. Multi-voltage design assigns the minimum necessary voltage to each block.

Power Domain	Block Type	Voltage (typical)	Frequency	Power Savings vs Max V
VDD_CPU_HIGH	Performance CPU cores	1.0–1.1V	3.0–4.0GHz	Baseline
VDD_CPU_LOW	Efficiency CPU cores	0.8–0.9V	1.5–2.5GHz	35–50%
VDD_GPU	GPU shaders	0.85–0.95V	1.5–2.0GHz	25–40%
VDD_MEM	L2/L3 cache SRAM	0.75–0.85V	2.0GHz	40–55%
VDD_IO	I/O pads, PHY blocks	1.8V	N/A	Required for interface
VDD_ALWAYS_ON	Power controller, RTC	0.7–0.8V	32kHz–100MHz	70%+ (lowest V)

Level Shifter — Required at Voltage Domain Boundary

VDD_HIGH (1.1V) │ VDD_LOW (0.8V) │ Signal A ─────────────────────────│────────────────────► FF_B (Logic 1 = 1.1V) │ (Needs logic 1 = 0.8V) │ Without level shifter: FAIL! 1.1V driving 0.8V input might overstress gate oxide or violate I/O spec WITH level shifter: Signal A ─────────────────────[LS]─────────────────────► FF_B (1.1V swing) │ (0.8V swing) Level Shifter cell from cell library Level shifter types: Low-to-High (LH): 0.8V → 1.1V (signal crossing to higher domain) High-to-Low (HL): 1.1V → 0.8V (signal crossing to lower domain) Required at EVERY net crossing domain boundary! Area overhead: ~20–30 µm² per level shifter instance Timing overhead: 20–50ps (adds to path delay at domain crossing)

5. Power Gating — Retention Registers

Power gating cuts VDD entirely to idle blocks. It eliminates even leakage power — the dominant concern in advanced nodes. But state must be preserved and restored correctly.

Power Gating Implementation: Standard block with power gating: VDD_MAIN ─── HEADER_SWITCH ─── VDD_VIRTUAL ─── Logic HEADER_SWITCH: Large PMOS transistor controlled by power controller ON (normal): PMOS conducting, VDD_VIRTUAL ≈ VDD_MAIN - V_drop OFF (sleep): PMOS cut off, entire block loses power State retention options: Option 1: Save state to memory before power down, restore on wake Wake latency: 1–10µs (write all registers to SRAM) Use case: deep sleep (seconds–minutes off) Option 2: Retention flip-flops (shadow latch) Retention FF = normal FF + always-on shadow latch Sleep: data saved to shadow latch (tiny always-on power) Wake: shadow latch restores main FF in 1 clock cycle Wake latency: ~1ns ← fast! Use case: fine-grained power gating (microseconds off) Leakage savings from power gating (7nm): Active: 500µW leakage per 100K gates Gated: 5µW (header + retention overhead only) Savings: 99% leakage reduction on gated block!

6. Thermal Management

Power dissipated as heat. Heat causes timing slowdown, reliability failures, and in extreme cases thermal runaway. Physical design must ensure no thermal hotspot exceeds the junction temperature limit.

Thermal Heatmap (Die Top-Down View)

Temperature distribution (°C) at 5W total power, 25°C ambient: CPU Core 0 CPU Core 1 GPU Shader Array ┌─────────┐┌─────────┐┌───────────────────────┐ Top of die: │ 105°C ││ 103°C ││ 108°C │ ← HOT! │ Core 0 ││ Core 1 ││ (high activity) │ └─────────┘└─────────┘└───────────────────────┘ ▲ ▲ ▲ │ Heat flow through Si → package → ambient ┌─────────────────────────────────────────────┐ Memory (L3): │ 85–90°C │ └─────────────────────────────────────────────┘ ┌──────────────┐ ┌────────────────────────────┐ I/O blocks: │ 75°C │ │ 70°C │ └──────────────┘ └────────────────────────────┘ Thermal limits: Junction temperature limit: 125°C (commercial) / 150°C (automotive) Thermal resistance die→ambient (θ_JA): ~3–15 °C/W depending on package Thermal budget: T_junction = T_ambient + P_chip × θ_JA Example: 85°C + 5W × 8°C/W = 125°C ← right at limit! Hotspot mitigation strategies: 1. Spread high-activity blocks across die (don't cluster) 2. Reduce utilization near hot blocks (more whitespace → routing for heat) 3. Use thermal vias (metal-filled vias conduct heat vertically) 4. Co-design with package (heatspreader directly over hotspot) 5. Dynamic thermal management: throttle core if T > 110°C

Thermal-Aware Placement

EDA tools can perform thermal-aware placement by modeling power density per tile and adjusting cell placement to distribute heat more evenly.

Block	Power Density	Placement Strategy	Thermal Impact
CPU cores (active)	0.5 W/mm²	Spread across die quadrants	+20°C over baseline
GPU shaders	0.4 W/mm²	Spread in array, not clustered	+18°C
L2 cache	0.15 W/mm²	Wrap around CPU cores	+6°C
L3 cache	0.08 W/mm²	Die periphery (cooler zone)	+3°C
I/O PHYs	0.05 W/mm²	Die edge (near thermal path)	+2°C

7. Power Analysis Tools and Flow

Power Sign-Off Flow: 1. Activity generation: Run RTL simulation with representative workloads Capture switching activity (VCD file — value change dump) 2. Gate-level power analysis: Apply VCD to gate netlist Power tools: Synopsys PrimePower, Cadence Joules, Mentor PowerPro 3. Components analyzed: a) Cell internal power (from liberty models: rise/fall energy) b) Net switching power (C × V² × f × α per net) c) Leakage power (Iddq from liberty models, temperature-dependent) 4. Power report example: ────────────────────────────────────────────────────── Block Dynamic(mW) Leakage(mW) %Total ────────────────────────────────────────────────────── CPU_P-core 1,200 280 38% CPU_E-core 400 80 12% GPU 800 200 25% Memory Ctrl 150 40 5% L2 Cache 80 30 3% L3 Cache 120 50 4% I/O & PHY 200 20 6% Misc 80 20 2% ────────────────────────────────────────────────────── TOTAL 3,030 720 100% Peak = 3,750mW Average (mix) = 2,100mW

8. Production Power Sign-Off Checklist

Power Optimization Checklist

✅ Power budget defined: Total chip TDP and per-block budget allocated
✅ Clock gating coverage ≥ 85%: Most register groups have ICG cells
✅ Operand isolation: All major datapaths have input isolation when idle
✅ DVFS OPP table complete: All voltage/frequency operating points validated
✅ Multi-voltage verified: Level shifters at all domain crossings, isolation cells inserted
✅ Power gating implemented: Retention FFs or SRAM save/restore for all gated domains
✅ PDN IR drop analysis: Static + dynamic IR drop within 10% of supply
✅ Thermal hotspot analysis: No zone exceeds 125°C at max workload
✅ Thermal-aware placement: High-power blocks distributed across die
✅ VCD-based power analysis: Representative workload activity captured
✅ Peak power estimation: Package and PCB rated for peak current
✅ Idle/sleep power: Leakage meets always-on budget

Next — Day 9: Signal integrity and crosstalk mitigation — coupling capacitance, aggressor/victim nets, crosstalk delay, and noise-aware routing strategies.