How do you measure real TOPS on an FPGA accelerator?

Real (effective) TOPS = total operations executed / measured runtime, where one MAC counts as 2 operations. Peak TOPS = number of MACs × clock × 2. Effective TOPS is always lower than peak because of stalls, memory waits, and underutilized cycles; the ratio effective/peak is your hardware utilization — the single most important profiling number.

What is the difference between latency and throughput?

Latency is the time for one inference from input to output (ms), critical for real-time systems. Throughput is how many inferences complete per second (FPS or IPS), critical for batch processing. A pipelined design can have high latency (deep pipeline) yet very high throughput (one result per cycle after fill) — they are independent metrics and must both be measured.

What is MLPerf and does it cover FPGAs?

MLPerf is the industry-standard benchmark suite from MLCommons for measuring ML training and inference performance fairly across hardware. MLPerf Inference defines standard models, datasets, accuracy targets, and scenarios (single-stream, multi-stream, server, offline). FPGAs can and do submit results; using its methodology — fixed model, accuracy floor, defined scenario — makes your numbers credible and comparable.

Benchmarking & Profiling FPGA Neural Networks — TOPS, Latency, MLPerf

1. Why Benchmarking Honestly Matters

"It's fast" is not a number. An accelerator is only as good as its measured performance under real conditions — and the field is full of misleading claims (peak TOPS that's never reached, latency without pre-processing, throughput with impossible batch sizes). This lesson teaches you to measure what's real, compare fairly, and find what's actually slowing you down.

2. The Four Core Metrics

The Metrics That Matter

⏱️ Latency (ms)

Time for ONE inference, input→output. Real-time systems live or die here.

🚀 Throughput (FPS)

Inferences per second. Batch/server workloads optimize this.

⚡ Power (W)

Wall power under load. The edge constraint (Day 12).

🎯 Efficiency (FPS/W or TOPS/W)

The true figure of merit — performance per watt.

3. Peak vs Effective TOPS

Peak TOPS is marketing; effective TOPS is reality. The gap between them — your utilization — is the most revealing number in the whole profile.

Peak TOPS (theoretical maximum): Peak = num_MACs × clock_freq × 2 (1 MAC = 2 ops: mult + add) Example: 4096 MACs @ 300 MHz = 4096 × 300e6 × 2 = 2.46 TOPS peak Effective TOPS (what you actually achieve): Effective = total_ops_in_model / measured_runtime Example: ResNet-50 = 8.2 GOPs, runs in 5 ms = 8.2e9 / 5e-3 = 1.64 TOPS effective Utilization = Effective / Peak = 1.64 / 2.46 = 67% → 33% of cycles wasted (stalls, memory waits, idle PEs) → THIS number tells you how much headroom remains

Beware "Peak TOPS" Claims

A vendor saying "10 TOPS!" is quoting peak — the number you'd get if every MAC fired every cycle with zero stalls, which never happens. Real workloads hit 40–80% utilization. Always ask for effective TOPS on a named model, or better, latency and FPS on a standard benchmark.

4. Measuring Latency Correctly

Latency measurement is full of traps. Where you start and stop the timer changes the answer by 2–5×. Be explicit about what's included.

Latency — What's In the Timer?

benchmark.py — proper measurement

import time, numpy as np

# WARM UP — first runs include one-time setup; never measure them
for _ in range(20):
    runner.execute(input_data, output_data)

# MEASURE — many iterations, report the distribution not just the mean
N = 1000
lat = []
for _ in range(N):
    t0 = time.perf_counter()
    run_end_to_end(image)          # incl. pre + transfer + compute + post
    lat.append((time.perf_counter() - t0) * 1000)  # ms

lat = np.array(lat)
print(f"mean   : {lat.mean():.2f} ms")
print(f"p50    : {np.percentile(lat,50):.2f} ms")
print(f"p99    : {np.percentile(lat,99):.2f} ms")   # tail latency matters!
print(f"throughput: {1000/lat.mean():.0f} FPS")
# Report p99 for real-time SLAs — the mean hides the bad cases.

5. Measuring Power

Power must be measured under sustained load, not idle or peak-burst. Use the board's onboard sensors or an inline power meter.

Method	How	Accuracy
Onboard sensors (PMBus/INA)	Read board power rails via sysfs / xbutil	Good — chip + board
Inline DC power meter	Measure at the barrel/PCIe power input	Best — true wall power
Xilinx Power Estimator (XPE)	Spreadsheet pre-silicon estimate	Rough — design-time only
Vivado power report	Post-implementation analysis	Good — with real activity (SAIF)

read board power (Xilinx)

# Alveo: query power while a load runs in another shell
xbutil examine --report electrical
# → reports 12V aux/pex rails, total board power in watts

# Kria / Zynq MPSoC: read INA260 sensor via sysfs
cat /sys/class/hwmon/hwmon*/power1_input    # microwatts

# Efficiency = throughput / power
# e.g. 400 FPS / 5.0 W = 80 FPS/W  (your figure of merit)

6. The Roofline Model — Find the Bottleneck

The roofline (from Day 1 & 6) is the single best diagnostic. Plot your kernel's arithmetic intensity against the roofline: if it sits under the slanted memory line, you're memory-bound; under the flat line, compute-bound. This tells you exactly what to fix.

Roofline — Diagnosing Your Accelerator

Diagnosis: a point below the slanted line → add bandwidth or reuse data on-chip (Day 6). A point below the flat line but right of the ridge → add MACs or raise utilization (Day 9).

7. Common Bottlenecks & Fixes

Symptom	Likely Cause	Fix (course day)
Low utilization, DSPs idle	Memory-bound — data starvation	Ping-pong buffers, HBM, reuse (Day 6)
II > 1 in the report	Port conflict / loop dependency	ARRAY_PARTITION, partial sums (Day 9/10)
Latency high, FPS fine	Deep pipeline fill (expected)	Acceptable for throughput workloads
FPS low, latency fine	Not enough parallelism	UNROLL more, more DPU cores (Day 9/11)
Pre/post-proc dominates	CPU bottleneck, not the FPGA	Multi-thread or offload to fabric (Day 11)
Power too high	No gating / high V/f	Clock gating, DVFS (Day 12)

8. MLPerf — Benchmarking Fairly

MLPerf Inference (from MLCommons) is the credible way to report numbers. It fixes the model, the dataset, an accuracy floor, and the measurement scenario — so results are comparable across CPU, GPU, FPGA, and ASIC.

MLPerf Scenario	Measures	Real-World Analog
Single-Stream	p90 latency, one query at a time	Phone camera, AR glasses
Multi-Stream	streams at a latency bound	Multi-camera car / NVR
Server	QPS at a latency SLA (Poisson arrivals)	Datacenter inference API
Offline	raw throughput, no latency limit	Batch photo tagging

The Three Rules of Honest Benchmarking

1. Name the model (ResNet-50 INT8, not "a CNN"). 2. State the accuracy — speed at degraded accuracy is meaningless; report top-1 alongside FPS. 3. Define the boundary — end-to-end vs compute-only, single-stream vs offline. Numbers without these three are noise.

9. A Complete Benchmark Report

EXAMPLE REPORT — ResNet-50 on Kria KV260 Model: ResNet-50 v1.5, INT8 (PTQ) Accuracy: 75.2% top-1 (FP32: 76.1% → −0.9%) Scenario: single-stream, end-to-end Hardware: Kria KV260, 1× B4096 DPU @ 300 MHz Latency p50: 28.4 ms Latency p99: 34.1 ms ← report the tail Throughput: ~352 FPS (batched, multi-thread) Power: 5.1 W (board, sustained) Efficiency: 69 FPS/W Peak TOPS: 2.46 | Effective: 1.64 | Utilization: 67% Bottleneck: image resize on ARM (18% of latency) Next step: offload resize to PL → est. p50 ~24 ms

Day 14 — Key Takeaways

✅ Four metrics: latency, throughput, power, efficiency (FPS/W is the figure of merit)
✅ Peak vs effective TOPS — utilization (effective/peak) reveals wasted cycles
✅ Latency: warm up, measure end-to-end, report p99 not just the mean
✅ Power: measure sustained, via onboard sensors or inline meter
✅ Roofline diagnoses memory-bound vs compute-bound at a glance
✅ Bottleneck table maps each symptom to a fix from earlier days
✅ MLPerf scenarios (single/multi-stream, server, offline) make results comparable
✅ Honest benchmarking: name the model, state accuracy, define the boundary

Next — Day 15: Production Edge AI Systems — taking the accelerator from bench to field: real platforms, system integration, deployment, reliability, and the course capstone.

← Previous

Day 13: Transformer Attention

Day 15: Production Edge AI