Numbers don't lie. Measure real TOPS, latency, throughput, and power; compute efficiency; apply the roofline model; use MLPerf methodology; and find the true bottleneck in your accelerator.
"It's fast" is not a number. An accelerator is only as good as its measured performance under real conditions — and the field is full of misleading claims (peak TOPS that's never reached, latency without pre-processing, throughput with impossible batch sizes). This lesson teaches you to measure what's real, compare fairly, and find what's actually slowing you down.
Peak TOPS is marketing; effective TOPS is reality. The gap between them — your utilization — is the most revealing number in the whole profile.
A vendor saying "10 TOPS!" is quoting peak — the number you'd get if every MAC fired every cycle with zero stalls, which never happens. Real workloads hit 40–80% utilization. Always ask for effective TOPS on a named model, or better, latency and FPS on a standard benchmark.
Latency measurement is full of traps. Where you start and stop the timer changes the answer by 2–5×. Be explicit about what's included.
import time, numpy as np
# WARM UP — first runs include one-time setup; never measure them
for _ in range(20):
runner.execute(input_data, output_data)
# MEASURE — many iterations, report the distribution not just the mean
N = 1000
lat = []
for _ in range(N):
t0 = time.perf_counter()
run_end_to_end(image) # incl. pre + transfer + compute + post
lat.append((time.perf_counter() - t0) * 1000) # ms
lat = np.array(lat)
print(f"mean : {lat.mean():.2f} ms")
print(f"p50 : {np.percentile(lat,50):.2f} ms")
print(f"p99 : {np.percentile(lat,99):.2f} ms") # tail latency matters!
print(f"throughput: {1000/lat.mean():.0f} FPS")
# Report p99 for real-time SLAs — the mean hides the bad cases.Power must be measured under sustained load, not idle or peak-burst. Use the board's onboard sensors or an inline power meter.
| Method | How | Accuracy |
|---|---|---|
| Onboard sensors (PMBus/INA) | Read board power rails via sysfs / xbutil | Good — chip + board |
| Inline DC power meter | Measure at the barrel/PCIe power input | Best — true wall power |
| Xilinx Power Estimator (XPE) | Spreadsheet pre-silicon estimate | Rough — design-time only |
| Vivado power report | Post-implementation analysis | Good — with real activity (SAIF) |
# Alveo: query power while a load runs in another shell
xbutil examine --report electrical
# → reports 12V aux/pex rails, total board power in watts
# Kria / Zynq MPSoC: read INA260 sensor via sysfs
cat /sys/class/hwmon/hwmon*/power1_input # microwatts
# Efficiency = throughput / power
# e.g. 400 FPS / 5.0 W = 80 FPS/W (your figure of merit)The roofline (from Day 1 & 6) is the single best diagnostic. Plot your kernel's arithmetic intensity against the roofline: if it sits under the slanted memory line, you're memory-bound; under the flat line, compute-bound. This tells you exactly what to fix.
| Symptom | Likely Cause | Fix (course day) |
|---|---|---|
| Low utilization, DSPs idle | Memory-bound — data starvation | Ping-pong buffers, HBM, reuse (Day 6) |
| II > 1 in the report | Port conflict / loop dependency | ARRAY_PARTITION, partial sums (Day 9/10) |
| Latency high, FPS fine | Deep pipeline fill (expected) | Acceptable for throughput workloads |
| FPS low, latency fine | Not enough parallelism | UNROLL more, more DPU cores (Day 9/11) |
| Pre/post-proc dominates | CPU bottleneck, not the FPGA | Multi-thread or offload to fabric (Day 11) |
| Power too high | No gating / high V/f | Clock gating, DVFS (Day 12) |
MLPerf Inference (from MLCommons) is the credible way to report numbers. It fixes the model, the dataset, an accuracy floor, and the measurement scenario — so results are comparable across CPU, GPU, FPGA, and ASIC.
| MLPerf Scenario | Measures | Real-World Analog |
|---|---|---|
| Single-Stream | p90 latency, one query at a time | Phone camera, AR glasses |
| Multi-Stream | streams at a latency bound | Multi-camera car / NVR |
| Server | QPS at a latency SLA (Poisson arrivals) | Datacenter inference API |
| Offline | raw throughput, no latency limit | Batch photo tagging |
1. Name the model (ResNet-50 INT8, not "a CNN"). 2. State the accuracy — speed at degraded accuracy is meaningless; report top-1 alongside FPS. 3. Define the boundary — end-to-end vs compute-only, single-stream vs offline. Numbers without these three are noise.
Next — Day 15: Production Edge AI Systems — taking the accelerator from bench to field: real platforms, system integration, deployment, reliability, and the course capstone.