HomeDay 28

Power Optimization

Dynamic power (clock gating, voltage scaling), static power (leakage). Real techniques from mobile to datacenter chips.

Power Breakdown

Total Power = Dynamic Power + Static Power + Switching Power Dynamic Power = α × C × V² × f α = activity factor (% of gates switching) C = capacitance (load) V = voltage f = frequency Static Power = I_leakage × V Dominated by subthreshold leakage in modern tech nodes

Technique 1: Clock Gating

Don't clock logic that's not computing

Without gating: - All flip-flops toggle every cycle (even if data doesn't change) - Power ∝ f × number of flip-flops With gating: - AND gate checks: "is data incoming?" - If not, disable clock to those flip-flops - Saves ~40% dynamic power in typical designs

Technique 2: Voltage Scaling

Lower voltage = quadratic power reduction, but slower clock

VoltagePower (relative)Freq PossibleUse Case
1.0V (nominal)1.0×2.0 GHzPeak performance
0.9V0.73×1.8 GHzTypical
0.8V0.57×1.5 GHzLow power
0.6V (near threshold)0.30×0.5 GHzBattery mode

Apple A17 Approach

Dynamic voltage and frequency scaling (DVFS): - Running heavy inference: 1.0V @ 2.0 GHz → 2W - Running light task: 0.7V @ 0.5 GHz → 100 mW - Idle (power gates disabled): <1 mW Controller monitors workload, adjusts V/f 100× per second.

Technique 3: Precision Reduction

Lower-precision arithmetic uses smaller multipliers → less power

PrecisionMultiplier AreaPower/MACSpeed
INT864 gates1.0 pJ1.0 ns
INT16256 gates2.5 pJ1.2 ns
INT321024 gates5 pJ1.5 ns

Technique 4: Power Gating

Turn off entire blocks when not needed

  • Systolic array: Can gate unused tiles if batch size < 256
  • Mobile NPU: Entire unit off when no inference needed
  • Cost: Retention registers + slow wake-up (10s of μs)

Real Example: Google TPU v4 Power

TPU v4 (200W sustained): - Compute (systolic): 100W (50%) - Memory (HBM): 50W (25%) - Interconnect (NoC): 30W (15%) - Control/misc: 20W (10%) Optimizations: - Clock gating: Save 20W (reduce activity from 70% to 50%) - Voltage scaling: Save 10W (lower compute voltage by 0.1V) - Sparsity: Skip zero multiplies (save 15W on sparse workloads) Result: 200W → 155W possible (22% reduction)

Day 29: Area & cost reduction: how to shrink silicon and manufacturing cost.