Power Breakdown
Total Power = Dynamic Power + Static Power + Switching Power
Dynamic Power = α × C × V² × f
α = activity factor (% of gates switching)
C = capacitance (load)
V = voltage
f = frequency
Static Power = I_leakage × V
Dominated by subthreshold leakage in modern tech nodes
Technique 1: Clock Gating
Don't clock logic that's not computing
Without gating:
- All flip-flops toggle every cycle (even if data doesn't change)
- Power ∝ f × number of flip-flops
With gating:
- AND gate checks: "is data incoming?"
- If not, disable clock to those flip-flops
- Saves ~40% dynamic power in typical designs
Technique 2: Voltage Scaling
Lower voltage = quadratic power reduction, but slower clock
| Voltage | Power (relative) | Freq Possible | Use Case |
|---|---|---|---|
| 1.0V (nominal) | 1.0× | 2.0 GHz | Peak performance |
| 0.9V | 0.73× | 1.8 GHz | Typical |
| 0.8V | 0.57× | 1.5 GHz | Low power |
| 0.6V (near threshold) | 0.30× | 0.5 GHz | Battery mode |
Apple A17 Approach
Dynamic voltage and frequency scaling (DVFS):
- Running heavy inference: 1.0V @ 2.0 GHz → 2W
- Running light task: 0.7V @ 0.5 GHz → 100 mW
- Idle (power gates disabled): <1 mW
Controller monitors workload, adjusts V/f 100× per second.
Technique 3: Precision Reduction
Lower-precision arithmetic uses smaller multipliers → less power
| Precision | Multiplier Area | Power/MAC | Speed |
|---|---|---|---|
| INT8 | 64 gates | 1.0 pJ | 1.0 ns |
| INT16 | 256 gates | 2.5 pJ | 1.2 ns |
| INT32 | 1024 gates | 5 pJ | 1.5 ns |
Technique 4: Power Gating
Turn off entire blocks when not needed
- Systolic array: Can gate unused tiles if batch size < 256
- Mobile NPU: Entire unit off when no inference needed
- Cost: Retention registers + slow wake-up (10s of μs)
Real Example: Google TPU v4 Power
TPU v4 (200W sustained):
- Compute (systolic): 100W (50%)
- Memory (HBM): 50W (25%)
- Interconnect (NoC): 30W (15%)
- Control/misc: 20W (10%)
Optimizations:
- Clock gating: Save 20W (reduce activity from 70% to 50%)
- Voltage scaling: Save 10W (lower compute voltage by 0.1V)
- Sparsity: Skip zero multiplies (save 15W on sparse workloads)
Result: 200W → 155W possible (22% reduction)
Day 29: Area & cost reduction: how to shrink silicon and manufacturing cost.