HomeFPGA Neural NetworkDay 12 — Power Optimization

Power Optimization
for Edge AI

Why FPGAs win at the edge. Dynamic vs static power, clock gating, partial reconfiguration, DVFS, precision reduction — and the techniques that bring a real CNN accelerator under 5W.

By EcrioniX Engineering Team · Published June 16, 2026 · ~4,600 words · 14 min read

1. Why Power Is the Edge AI Battle

At the edge, power is the constraint that defines everything. A drone, a security camera, a wearable, a car's perception unit — none can dissipate 400W like a datacenter GPU. They run on batteries or passive cooling, with budgets of 1–25W. This is precisely where FPGAs beat GPUs (recall Day 1: Kria KV260 at 80 FPS/W vs A100 at 13 FPS/W). But that efficiency isn't free — you design for it.

2. Dynamic vs Static Power

FPGA power splits into two fundamentally different parts. Knowing which dominates your design tells you which optimizations matter.

Total Power = Dynamic + Static DYNAMIC POWER (from switching): P_dyn = α · C · V² · f α = activity factor (fraction of nodes toggling) C = switched capacitance V = supply voltage f = clock frequency → dominates DURING active inference → V² term means voltage scaling is hugely effective STATIC POWER (leakage): P_static = V · I_leakage leakage flows even when idle; set by device, node, temperature → dominates for ALWAYS-ON, mostly-idle edge devices → only fixed by smaller device, power gating, or lower temp
Where the Power Goes (typical edge CNN accelerator)
Active DSP/MAC switching 50% BRAM 20% clk 12% Idle leakage clk ← idle wastes power if clock keeps running! Dynamic (switching) Static (leakage)

3. Clock Gating — Stop Switching What You Don't Use

The single highest-impact dynamic-power technique. Since P_dyn ∝ activity × frequency, freezing the clock to idle logic drops its switching power to near zero. On bursty edge workloads (process a frame, then wait), gating the accelerator between frames cuts average power dramatically.

Clock Gating with BUFGCE
clk enable BUFGCE gated clk MAC Arrayfrozen when enable=0 enable=0 → no clock edges reach the array → ~0 dynamic power BUFGCE is a dedicated global-clock buffer with a clock-enable — glitch-free
clock gating (HLS & RTL)
// --- HLS: gating is automatic --- // If a DATAFLOW region or function has no valid input, HLS-generated // logic naturally idles. Use ap_ctrl_chain so stages stall (no toggling) // when starved — power drops without any manual gating. // --- RTL: explicit global clock gating with BUFGCE --- BUFGCE clk_gate ( .I (clk), // source clock .CE (accel_active), // 1 = run, 0 = freeze the whole accelerator .O (clk_gated) // drives the MAC array's clock ); // accel_active is driven low between frames / when input FIFO empty. // // Fine-grained: gate per-layer engines independently so only the // layer currently processing toggles. Idle layers cost only leakage.

4. Voltage & Frequency Scaling (DVFS)

The V² term makes voltage scaling the most powerful lever. Because dynamic power scales with the square of voltage, even a modest voltage drop yields large savings — and lowering frequency lets you lower voltage safely.

DVFS power scaling: P_dyn ∝ V² · f Example: drop V from 0.85V → 0.72V and f from 300 → 200 MHz Voltage factor: (0.72/0.85)² = 0.72× Frequency factor: 200/300 = 0.67× Combined dynamic power: 0.72 × 0.67 = 0.48× → ~52% saving! Trade-off: lower f → lower throughput (FPS). Use when the workload doesn't need peak speed (e.g. 15 FPS camera vs 60 FPS). Edge strategy — match performance to demand: Burst mode: full V/f for the active frame, then gate off Sustained: drop to the lowest V/f that still meets the FPS target

5. Partial Reconfiguration — Time-Share the Fabric

Partial reconfiguration (PR) reprograms a region of the FPGA at runtime while the rest keeps running. For power and area, it lets a small, low-power device run a model bigger than would otherwise fit — by swapping layer hardware in and out.

Partial Reconfiguration — Layer Time-Sharing
Static Region CPU, DMA, control (always running) Reconfigurable Region (one at a time) Conv1 HWt = 0 Conv2 HWt = 1 (reload) FC HWt = 2 One layer's hardware in the fabric at a time → big model on a small, low-power device

PR Has a Latency Cost

Reconfiguring a region takes time (microseconds to milliseconds depending on size). That overhead is paid between layers, so PR suits throughput-relaxed edge apps — not ultra-low-latency ones. The win is fitting a large model on cheap silicon with a tiny power envelope, not raw speed.

6. Precision Reduction = Power Reduction

Lower precision isn't just about memory (Day 2) — it directly cuts power. Smaller multipliers switch less capacitance, and INT4/INT8 packs more MACs per DSP, so you hit the same throughput at lower frequency/voltage.

PrecisionRelative MAC EnergyDSP PackingAccuracy Impact
FP321.0× (baseline)0.25 MAC/DSPReference
FP16~0.4×0.5 MAC/DSP<0.1%
INT8~0.15×2 MAC/DSP~0.5–1%
INT4~0.07×4 MAC/DSP1–3%
Binary/Ternary~0.02×many/DSP or LUT5–15%

7. Putting It Together — Under 5W

Edge AI power budget worked example (MobileNetV2, Kria-class): Start: FP32, 300MHz, no gating ............. ~14 W ✗ (way over) + INT8 quantization (Day 2) ................ ~8 W (smaller MACs) + Clock gating idle engines ................ ~6 W (less switching) + DVFS to 0.72V / 200MHz (meets 30 FPS) .... ~4 W (V²·f saving) + Power-gate unused blocks ................. ~3.5 W ✓ under budget Result: 30 FPS MobileNetV2 at 3.5 W → ~8.5 FPS/W vs the same model on a mobile GPU at ~15 W → 2 FPS/W → 4× better efficiency at the edge

This Is Why FPGAs Own the Edge

Stacking these techniques — INT8, clock gating, DVFS, power gating, optional partial reconfig — is how an FPGA delivers real-time vision in a battery-powered drone or a fanless camera. A GPU can't be reshaped this way; its power floor is fixed. The FPGA's reconfigurability is its power advantage.

Day 12 — Key Takeaways

Next — Day 13: Transformer Attention on FPGA — the self-attention mechanism in hardware, QKV matrix multiplies, softmax, and multi-head parallelism for BERT/ViT inference.

← Previous
Day 11: Vitis AI & DPU
Next →
Day 13: Transformer Attention