Why FPGAs win at the edge. Dynamic vs static power, clock gating, partial reconfiguration, DVFS, precision reduction — and the techniques that bring a real CNN accelerator under 5W.
At the edge, power is the constraint that defines everything. A drone, a security camera, a wearable, a car's perception unit — none can dissipate 400W like a datacenter GPU. They run on batteries or passive cooling, with budgets of 1–25W. This is precisely where FPGAs beat GPUs (recall Day 1: Kria KV260 at 80 FPS/W vs A100 at 13 FPS/W). But that efficiency isn't free — you design for it.
FPGA power splits into two fundamentally different parts. Knowing which dominates your design tells you which optimizations matter.
The single highest-impact dynamic-power technique. Since P_dyn ∝ activity × frequency, freezing the clock to idle logic drops its switching power to near zero. On bursty edge workloads (process a frame, then wait), gating the accelerator between frames cuts average power dramatically.
// --- HLS: gating is automatic ---
// If a DATAFLOW region or function has no valid input, HLS-generated
// logic naturally idles. Use ap_ctrl_chain so stages stall (no toggling)
// when starved — power drops without any manual gating.
// --- RTL: explicit global clock gating with BUFGCE ---
BUFGCE clk_gate (
.I (clk), // source clock
.CE (accel_active), // 1 = run, 0 = freeze the whole accelerator
.O (clk_gated) // drives the MAC array's clock
);
// accel_active is driven low between frames / when input FIFO empty.
//
// Fine-grained: gate per-layer engines independently so only the
// layer currently processing toggles. Idle layers cost only leakage.The V² term makes voltage scaling the most powerful lever. Because dynamic power scales with the square of voltage, even a modest voltage drop yields large savings — and lowering frequency lets you lower voltage safely.
Partial reconfiguration (PR) reprograms a region of the FPGA at runtime while the rest keeps running. For power and area, it lets a small, low-power device run a model bigger than would otherwise fit — by swapping layer hardware in and out.
Reconfiguring a region takes time (microseconds to milliseconds depending on size). That overhead is paid between layers, so PR suits throughput-relaxed edge apps — not ultra-low-latency ones. The win is fitting a large model on cheap silicon with a tiny power envelope, not raw speed.
Lower precision isn't just about memory (Day 2) — it directly cuts power. Smaller multipliers switch less capacitance, and INT4/INT8 packs more MACs per DSP, so you hit the same throughput at lower frequency/voltage.
| Precision | Relative MAC Energy | DSP Packing | Accuracy Impact |
|---|---|---|---|
| FP32 | 1.0× (baseline) | 0.25 MAC/DSP | Reference |
| FP16 | ~0.4× | 0.5 MAC/DSP | <0.1% |
| INT8 | ~0.15× | 2 MAC/DSP | ~0.5–1% |
| INT4 | ~0.07× | 4 MAC/DSP | 1–3% |
| Binary/Ternary | ~0.02× | many/DSP or LUT | 5–15% |
Stacking these techniques — INT8, clock gating, DVFS, power gating, optional partial reconfig — is how an FPGA delivers real-time vision in a battery-powered drone or a fanless camera. A GPU can't be reshaped this way; its power floor is fixed. The FPGA's reconfigurability is its power advantage.
Next — Day 13: Transformer Attention on FPGA — the self-attention mechanism in hardware, QKV matrix multiplies, softmax, and multi-head parallelism for BERT/ViT inference.