What are the sources of power consumption in an FPGA?

FPGA power has two parts: dynamic power (from signal switching, proportional to capacitance × voltage² × frequency × activity) and static power (leakage current that flows even when idle, set by the device, process node, and temperature). In AI accelerators dynamic power dominates during inference, while static power matters most for always-on edge devices.

How does clock gating save power on FPGA?

Clock gating stops the clock to inactive parts of the design using BUFGCE buffers or HLS-inferred gating. Since dynamic power is proportional to switching activity, freezing the clock to idle MAC arrays or unused layers eliminates their switching power. For bursty edge workloads where the accelerator is idle between frames, clock gating can cut average power by 30-60%.

What is partial reconfiguration and how does it help power?

Partial reconfiguration (PR) reprograms part of the FPGA fabric at runtime while the rest keeps running. For power, it lets a small device time-share: load layer 1's hardware, run it, then reconfigure that region for layer 2. This fits a large model on a small low-power FPGA, and unused regions can be blanked to cut both dynamic and static power.

Power Optimization for Edge AI on FPGA — Clock Gating, DVFS, Partial Reconfig

1. Why Power Is the Edge AI Battle

At the edge, power is the constraint that defines everything. A drone, a security camera, a wearable, a car's perception unit — none can dissipate 400W like a datacenter GPU. They run on batteries or passive cooling, with budgets of 1–25W. This is precisely where FPGAs beat GPUs (recall Day 1: Kria KV260 at 80 FPS/W vs A100 at 13 FPS/W). But that efficiency isn't free — you design for it.

2. Dynamic vs Static Power

FPGA power splits into two fundamentally different parts. Knowing which dominates your design tells you which optimizations matter.

Total Power = Dynamic + Static DYNAMIC POWER (from switching): P_dyn = α · C · V² · f α = activity factor (fraction of nodes toggling) C = switched capacitance V = supply voltage f = clock frequency → dominates DURING active inference → V² term means voltage scaling is hugely effective STATIC POWER (leakage): P_static = V · I_leakage leakage flows even when idle; set by device, node, temperature → dominates for ALWAYS-ON, mostly-idle edge devices → only fixed by smaller device, power gating, or lower temp

Where the Power Goes (typical edge CNN accelerator)

3. Clock Gating — Stop Switching What You Don't Use

The single highest-impact dynamic-power technique. Since P_dyn ∝ activity × frequency, freezing the clock to idle logic drops its switching power to near zero. On bursty edge workloads (process a frame, then wait), gating the accelerator between frames cuts average power dramatically.

Clock Gating with BUFGCE

clock gating (HLS & RTL)

// --- HLS: gating is automatic ---
// If a DATAFLOW region or function has no valid input, HLS-generated
// logic naturally idles. Use ap_ctrl_chain so stages stall (no toggling)
// when starved — power drops without any manual gating.

// --- RTL: explicit global clock gating with BUFGCE ---
BUFGCE clk_gate (
  .I  (clk),           // source clock
  .CE (accel_active),  // 1 = run, 0 = freeze the whole accelerator
  .O  (clk_gated)      // drives the MAC array's clock
);
// accel_active is driven low between frames / when input FIFO empty.
//
// Fine-grained: gate per-layer engines independently so only the
// layer currently processing toggles. Idle layers cost only leakage.

4. Voltage & Frequency Scaling (DVFS)

The V² term makes voltage scaling the most powerful lever. Because dynamic power scales with the square of voltage, even a modest voltage drop yields large savings — and lowering frequency lets you lower voltage safely.

DVFS power scaling: P_dyn ∝ V² · f Example: drop V from 0.85V → 0.72V and f from 300 → 200 MHz Voltage factor: (0.72/0.85)² = 0.72× Frequency factor: 200/300 = 0.67× Combined dynamic power: 0.72 × 0.67 = 0.48× → ~52% saving! Trade-off: lower f → lower throughput (FPS). Use when the workload doesn't need peak speed (e.g. 15 FPS camera vs 60 FPS). Edge strategy — match performance to demand: Burst mode: full V/f for the active frame, then gate off Sustained: drop to the lowest V/f that still meets the FPS target

5. Partial Reconfiguration — Time-Share the Fabric

Partial reconfiguration (PR) reprograms a region of the FPGA at runtime while the rest keeps running. For power and area, it lets a small, low-power device run a model bigger than would otherwise fit — by swapping layer hardware in and out.

Partial Reconfiguration — Layer Time-Sharing

PR Has a Latency Cost

Reconfiguring a region takes time (microseconds to milliseconds depending on size). That overhead is paid between layers, so PR suits throughput-relaxed edge apps — not ultra-low-latency ones. The win is fitting a large model on cheap silicon with a tiny power envelope, not raw speed.

6. Precision Reduction = Power Reduction

Lower precision isn't just about memory (Day 2) — it directly cuts power. Smaller multipliers switch less capacitance, and INT4/INT8 packs more MACs per DSP, so you hit the same throughput at lower frequency/voltage.

Precision	Relative MAC Energy	DSP Packing	Accuracy Impact
FP32	1.0× (baseline)	0.25 MAC/DSP	Reference
FP16	~0.4×	0.5 MAC/DSP	<0.1%
INT8	~0.15×	2 MAC/DSP	~0.5–1%
INT4	~0.07×	4 MAC/DSP	1–3%
Binary/Ternary	~0.02×	many/DSP or LUT	5–15%

7. Putting It Together — Under 5W

Edge AI power budget worked example (MobileNetV2, Kria-class): Start: FP32, 300MHz, no gating ............. ~14 W ✗ (way over) + INT8 quantization (Day 2) ................ ~8 W (smaller MACs) + Clock gating idle engines ................ ~6 W (less switching) + DVFS to 0.72V / 200MHz (meets 30 FPS) .... ~4 W (V²·f saving) + Power-gate unused blocks ................. ~3.5 W ✓ under budget Result: 30 FPS MobileNetV2 at 3.5 W → ~8.5 FPS/W vs the same model on a mobile GPU at ~15 W → 2 FPS/W → 4× better efficiency at the edge

This Is Why FPGAs Own the Edge

Stacking these techniques — INT8, clock gating, DVFS, power gating, optional partial reconfig — is how an FPGA delivers real-time vision in a battery-powered drone or a fanless camera. A GPU can't be reshaped this way; its power floor is fixed. The FPGA's reconfigurability is its power advantage.

Day 12 — Key Takeaways

✅ Power = dynamic (α·C·V²·f) + static (leakage) — know which dominates
✅ Clock gating (BUFGCE) freezes idle logic → biggest dynamic-power win for bursty edge
✅ DVFS: the V² term makes voltage scaling huge — ~50% saving for a modest V/f drop
✅ Partial reconfiguration time-shares the fabric → big model on small low-power device
✅ Precision reduction cuts MAC energy directly: INT8 ~0.15× of FP32
✅ Stack them: INT8 + gating + DVFS + power gating → 14W down to ~3.5W
✅ The edge belongs to FPGAs because their reconfigurability is a power lever GPUs lack

Next — Day 13: Transformer Attention on FPGA — the self-attention mechanism in hardware, QKV matrix multiplies, softmax, and multi-head parallelism for BERT/ViT inference.

← Previous

Day 11: Vitis AI & DPU

Day 13: Transformer Attention

Power Optimizationfor Edge AI