HomeFPGA Neural NetworkDay 14 — Benchmarking & Profiling

Benchmarking & Profiling

Numbers don't lie. Measure real TOPS, latency, throughput, and power; compute efficiency; apply the roofline model; use MLPerf methodology; and find the true bottleneck in your accelerator.

By EcrioniX Engineering Team · Published June 16, 2026 · ~4,600 words · 14 min read

1. Why Benchmarking Honestly Matters

"It's fast" is not a number. An accelerator is only as good as its measured performance under real conditions — and the field is full of misleading claims (peak TOPS that's never reached, latency without pre-processing, throughput with impossible batch sizes). This lesson teaches you to measure what's real, compare fairly, and find what's actually slowing you down.

2. The Four Core Metrics

The Metrics That Matter
⏱️ Latency (ms)
Time for ONE inference, input→output. Real-time systems live or die here.
🚀 Throughput (FPS)
Inferences per second. Batch/server workloads optimize this.
⚡ Power (W)
Wall power under load. The edge constraint (Day 12).
🎯 Efficiency (FPS/W or TOPS/W)
The true figure of merit — performance per watt.

3. Peak vs Effective TOPS

Peak TOPS is marketing; effective TOPS is reality. The gap between them — your utilization — is the most revealing number in the whole profile.

Peak TOPS (theoretical maximum): Peak = num_MACs × clock_freq × 2 (1 MAC = 2 ops: mult + add) Example: 4096 MACs @ 300 MHz = 4096 × 300e6 × 2 = 2.46 TOPS peak Effective TOPS (what you actually achieve): Effective = total_ops_in_model / measured_runtime Example: ResNet-50 = 8.2 GOPs, runs in 5 ms = 8.2e9 / 5e-3 = 1.64 TOPS effective Utilization = Effective / Peak = 1.64 / 2.46 = 67% → 33% of cycles wasted (stalls, memory waits, idle PEs) → THIS number tells you how much headroom remains

Beware "Peak TOPS" Claims

A vendor saying "10 TOPS!" is quoting peak — the number you'd get if every MAC fired every cycle with zero stalls, which never happens. Real workloads hit 40–80% utilization. Always ask for effective TOPS on a named model, or better, latency and FPS on a standard benchmark.

4. Measuring Latency Correctly

Latency measurement is full of traps. Where you start and stop the timer changes the answer by 2–5×. Be explicit about what's included.

Latency — What's In the Timer?
Pre-process H→FPGA DPU/Accelerator compute FPGA→H Post-proc "compute latency" (optimistic) "end-to-end latency" (what the user feels) ← report this
benchmark.py — proper measurement
import time, numpy as np # WARM UP — first runs include one-time setup; never measure them for _ in range(20): runner.execute(input_data, output_data) # MEASURE — many iterations, report the distribution not just the mean N = 1000 lat = [] for _ in range(N): t0 = time.perf_counter() run_end_to_end(image) # incl. pre + transfer + compute + post lat.append((time.perf_counter() - t0) * 1000) # ms lat = np.array(lat) print(f"mean : {lat.mean():.2f} ms") print(f"p50 : {np.percentile(lat,50):.2f} ms") print(f"p99 : {np.percentile(lat,99):.2f} ms") # tail latency matters! print(f"throughput: {1000/lat.mean():.0f} FPS") # Report p99 for real-time SLAs — the mean hides the bad cases.

5. Measuring Power

Power must be measured under sustained load, not idle or peak-burst. Use the board's onboard sensors or an inline power meter.

MethodHowAccuracy
Onboard sensors (PMBus/INA)Read board power rails via sysfs / xbutilGood — chip + board
Inline DC power meterMeasure at the barrel/PCIe power inputBest — true wall power
Xilinx Power Estimator (XPE)Spreadsheet pre-silicon estimateRough — design-time only
Vivado power reportPost-implementation analysisGood — with real activity (SAIF)
read board power (Xilinx)
# Alveo: query power while a load runs in another shell xbutil examine --report electrical # → reports 12V aux/pex rails, total board power in watts # Kria / Zynq MPSoC: read INA260 sensor via sysfs cat /sys/class/hwmon/hwmon*/power1_input # microwatts # Efficiency = throughput / power # e.g. 400 FPS / 5.0 W = 80 FPS/W (your figure of merit)

6. The Roofline Model — Find the Bottleneck

The roofline (from Day 1 & 6) is the single best diagnostic. Plot your kernel's arithmetic intensity against the roofline: if it sits under the slanted memory line, you're memory-bound; under the flat line, compute-bound. This tells you exactly what to fix.

Roofline — Diagnosing Your Accelerator
Peak compute (2.46 TOPS) Memory BW ridge your layer (67%) gap to roof optimal memory-bound → fix data reuse / BW compute-bound Arithmetic Intensity (OPS/byte)
Diagnosis: a point below the slanted line → add bandwidth or reuse data on-chip (Day 6). A point below the flat line but right of the ridge → add MACs or raise utilization (Day 9).

7. Common Bottlenecks & Fixes

SymptomLikely CauseFix (course day)
Low utilization, DSPs idleMemory-bound — data starvationPing-pong buffers, HBM, reuse (Day 6)
II > 1 in the reportPort conflict / loop dependencyARRAY_PARTITION, partial sums (Day 9/10)
Latency high, FPS fineDeep pipeline fill (expected)Acceptable for throughput workloads
FPS low, latency fineNot enough parallelismUNROLL more, more DPU cores (Day 9/11)
Pre/post-proc dominatesCPU bottleneck, not the FPGAMulti-thread or offload to fabric (Day 11)
Power too highNo gating / high V/fClock gating, DVFS (Day 12)

8. MLPerf — Benchmarking Fairly

MLPerf Inference (from MLCommons) is the credible way to report numbers. It fixes the model, the dataset, an accuracy floor, and the measurement scenario — so results are comparable across CPU, GPU, FPGA, and ASIC.

MLPerf ScenarioMeasuresReal-World Analog
Single-Streamp90 latency, one query at a timePhone camera, AR glasses
Multi-Streamstreams at a latency boundMulti-camera car / NVR
ServerQPS at a latency SLA (Poisson arrivals)Datacenter inference API
Offlineraw throughput, no latency limitBatch photo tagging

The Three Rules of Honest Benchmarking

1. Name the model (ResNet-50 INT8, not "a CNN"). 2. State the accuracy — speed at degraded accuracy is meaningless; report top-1 alongside FPS. 3. Define the boundary — end-to-end vs compute-only, single-stream vs offline. Numbers without these three are noise.

9. A Complete Benchmark Report

EXAMPLE REPORT — ResNet-50 on Kria KV260 Model: ResNet-50 v1.5, INT8 (PTQ) Accuracy: 75.2% top-1 (FP32: 76.1% → −0.9%) Scenario: single-stream, end-to-end Hardware: Kria KV260, 1× B4096 DPU @ 300 MHz Latency p50: 28.4 ms Latency p99: 34.1 ms ← report the tail Throughput: ~352 FPS (batched, multi-thread) Power: 5.1 W (board, sustained) Efficiency: 69 FPS/W Peak TOPS: 2.46 | Effective: 1.64 | Utilization: 67% Bottleneck: image resize on ARM (18% of latency) Next step: offload resize to PL → est. p50 ~24 ms

Day 14 — Key Takeaways

Next — Day 15: Production Edge AI Systems — taking the accelerator from bench to field: real platforms, system integration, deployment, reliability, and the course capstone.

← Previous
Day 13: Transformer Attention
Next →
Day 15: Production Edge AI