Why Measure Performance?
An accelerator that is never measured is an accelerator that is never optimised. Performance profiling answers the most important design question: "Where is the time being spent?" Is the systolic array stalled waiting for data? Is the DMA saturating the memory bus? Is the CPU overhead (tiling loop, cache flush) dominating the runtime? Without measurement, you're guessing.
Three Questions Performance Profiling Answers
1. How fast is it? — Absolute throughput (GFlops, GB/s). 2. How much faster than the baseline? — Speedup ratio vs CPU. 3. What's limiting it? — Compute-bound, memory-bound, or control overhead?
RISC-V CSR Performance Counters
| CSR Name | Address | Counts | Usage |
|---|---|---|---|
| mcycle | 0xB00 | Clock cycles (free-running) | Wall-clock time measurement |
| minstret | 0xB02 | Retired instructions | CPI = mcycle / minstret |
| mhpmcounter3–31 | 0xB03–0xB1F | Configurable hardware events | Cache misses, branch mispredicts, stalls |
| mhpmevent3–31 | 0x323–0x33F | Event selector for mhpmcounterN | Set to select which event to count |
| cycle (user) | 0xC00 | Same as mcycle, user-readable | Use in user-space programs |
| time | 0xC01 | Real-time clock (from CLINT) | Wall time in nanoseconds |
C — Read RISC-V CSR cycle counter
#include <stdint.h> #include <stdio.h> // Read cycle CSR using inline assembly static inline uint64_t read_cycle() { uint64_t c; __asm__ volatile("csrr %0, cycle" : "=r"(c)); return c; } static inline uint64_t read_instret() { uint64_t i; __asm__ volatile("csrr %0, instret" : "=r"(i)); return i; } typedef struct { uint64_t cycles, instret; } perf_t; static inline perf_t perf_start() { __asm__ volatile("" ::: "memory"); // compiler fence return (perf_t){ read_cycle(), read_instret() }; } static inline perf_t perf_end(perf_t s) { __asm__ volatile("" ::: "memory"); return (perf_t){ read_cycle()-s.cycles, read_instret()-s.instret }; } // Usage: benchmark matrix multiply void benchmark(int8_t *A, int8_t *B, int32_t *C_accel, int32_t *C_cpu, int N) { double gflops = 2.0 * N * N * N / 1e9; // 2N³ ops // Benchmark: accelerator perf_t s = perf_start(); uint32_t accel_cycles = accel_matmul(A, B, C_accel, N); perf_t accel = perf_end(s); // Benchmark: CPU reference s = perf_start(); cpu_matmul(A, B, C_cpu, N); // naive C triple-loop perf_t cpu = perf_end(s); double clk_mhz = 1000.0; // assume 1 GHz double accel_ms = accel.cycles / clk_mhz / 1e3; double cpu_ms = cpu.cycles / clk_mhz / 1e3; printf("N=%d | Accel: %llu cyc (%.2f ms, %.2f GFlops) | " "CPU: %llu cyc (%.2f ms, %.2f GFlops) | Speedup: %.1fx\n", N, accel.cycles, accel_ms, gflops/accel_ms*1e3, cpu.cycles, cpu_ms, gflops/cpu_ms*1e3, (double)cpu.cycles / accel.cycles); }
Hardware Event Counter in Verilog
Verilog — Accelerator performance counters (MMIO-accessible)
module perf_counters ( input clk, rst, clear, // Events from accelerator datapath input accel_active, // systolic array doing useful work input dma_active, // DMA transferring data input stall_dma, // accel stalled waiting for DMA input stall_resp, // accel stalled waiting for resp_ready // Readable via MMIO output reg [31:0] cnt_total, // total elapsed cycles output reg [31:0] cnt_compute, // cycles accel is computing output reg [31:0] cnt_dma, // cycles DMA is active output reg [31:0] cnt_stall // stall cycles ); always @(posedge clk) begin if (rst || clear) begin cnt_total<=0; cnt_compute<=0; cnt_dma<=0; cnt_stall<=0; end else begin cnt_total <= cnt_total + 1; if (accel_active) cnt_compute <= cnt_compute + 1; if (dma_active) cnt_dma <= cnt_dma + 1; if (stall_dma || stall_resp) cnt_stall <= cnt_stall + 1; end end endmodule
Amdahl's Law & Speedup Analysis
Amdahl's Law: Speedup = 1 / ((1-P) + P/S)
P = fraction of program that is parallelisable/accelerated
S = speedup of the accelerated portion
Example: matrix multiply = 80% of total program time, accelerator = 100× faster
Speedup = 1 / (0.20 + 0.80/100) = 1 / 0.208 = 4.8×
→ Even with 100× faster compute, total speedup is only 4.8× (20% is unaccelerated)
Key insight: optimise the bottleneck. If DMA overhead = 30% of accelerator time:
true_speedup = 0.70 × compute_speedup × 1/(1-0.70+0.70/speedup)
→ Improving compute beyond 10× returns diminishing benefit; DMA is the new limit
Utilisation Analysis
| Counter | Formula | Healthy Range | If Low → Fix |
|---|---|---|---|
| Compute utilisation | cnt_compute / cnt_total | >80% | Improve DMA prefetch, double-buffer |
| DMA efficiency | cnt_dma / cnt_total | <30% | Larger bursts, reduce transaction overhead |
| Stall fraction | cnt_stall / cnt_total | <10% | Fix backpressure, increase bandwidth |
| MAC efficiency | actual_MACs / (peak_MACs × time) | >85% | Reduce fill latency, increase tile size |
C — Read hardware counters and print utilisation report
#define REG_CNT_TOTAL (ACC_BASE + 0x30) #define REG_CNT_COMPUTE (ACC_BASE + 0x34) #define REG_CNT_DMA (ACC_BASE + 0x38) #define REG_CNT_STALL (ACC_BASE + 0x3C) void print_perf_report(int N, double freq_ghz) { uint32_t total = MMIO_R(REG_CNT_TOTAL); uint32_t compute = MMIO_R(REG_CNT_COMPUTE); uint32_t dma = MMIO_R(REG_CNT_DMA); uint32_t stall = MMIO_R(REG_CNT_STALL); double time_ms = total / (freq_ghz * 1e6); double gflops = 2.0 * N * N * N / 1e9; double tflops = gflops / time_ms * 1000; double peak_mac = N * N; // MACs/cycle for NxN array double mac_eff = (gflops*1e9/2) / ((double)total * peak_mac) * 100; printf("=== Accelerator Performance Report ===\n"); printf("Matrix: %d×%d | Time: %.3f ms | %.2f TFlops\n", N, N, time_ms, tflops); printf("Compute util: %5.1f%% | DMA util: %5.1f%% | Stall: %5.1f%%\n", 100.0*compute/total, 100.0*dma/total, 100.0*stall/total); printf("MAC efficiency: %.1f%% of peak\n", mac_eff); if (100.0*compute/total < 70) printf("BOTTLENECK: Compute starved — improve DMA prefetch or double-buffer\n"); if (100.0*stall/total > 15) printf("BOTTLENECK: Stall cycles high — check resp_ready backpressure\n"); }
Day 8 — Interview Questions
Q1How do you measure accelerator execution time on RISC-V?
Read the cycle CSR before and after the accelerator invocation using
csrr rd, cycle. The cycle counter increments every clock cycle and is a free-running 64-bit counter (wraps after 584 years at 1 GHz). Delta_cycles = cycle_end - cycle_start gives the wall-clock cycle count including all overhead: DMA, control, cache flush, and the actual accelerator compute. Divide by the clock frequency (in Hz) to get seconds. For fine-grained profiling, bracket individual phases (DMA, compute) separately. A compiler fence (asm volatile("" ::: "memory")) is needed before each read to prevent the compiler from reordering the CSR read past the measured code.Q2What is CPI and how does it relate to accelerator design?
CPI (Cycles Per Instruction) = mcycle / minstret. For the RISC-V CPU core, ideal CPI = 1 (in-order). When the CPU issues a RoCC instruction and the accelerator asserts busy=1, the pipeline stalls — mcycle increments but minstret does not (the instruction hasn't retired). The CPI during accelerator execution is effectively (total_wall_cycles) / (few_instruction_count), which can be hundreds or thousands. This is by design — the accelerator is doing work during these cycles. CPI is useful for measuring CPU-side overhead: if the CPU executes many instructions during the accelerator run (e.g., in the tiling loop), those should also be counted in the overhead analysis.
Q3What is Amdahl's Law and what does it say about accelerator design?
Amdahl's Law: Speedup = 1 / ((1−P) + P/S), where P is the fraction of the program accelerated and S is the speedup of that fraction. It says that the non-accelerated portion (1−P) sets an upper bound on total speedup: even if S→∞, max speedup = 1/(1−P). For P=80% (matrix multiply is 80% of runtime), max speedup = 1/0.20 = 5×, regardless of how fast the accelerator is. Practical implication: before building a faster systolic array, first ensure the non-accelerated portion (data movement, control, pre/post-processing) is as small as possible. Often the 20% non-accelerated tail is more important to optimise than doubling the array's compute throughput.
Q4What does "compute utilisation" mean for a systolic array, and what causes it to be low?
Compute utilisation = fraction of total cycles where the systolic array is actively performing MACs. 100% means the array never stalls. Causes of low utilisation: (1) DMA starvation — the array finishes one tile and waits for the DMA to load the next (fix: double-buffering), (2) Fill/drain latency — for small tiles (N < array_size), the 3N−1 fill+drain cycles dominate (fix: larger tiles), (3) Weight reload — changing the B matrix requires reloading all PE weights (fix: cache multiple weight sets or tile the B dimension), (4) Response backpressure — resp_ready=0 stalls the FSM. Good designs target >85% utilisation.
Q5What are mhpmcounters and how do you configure them?
mhpmcounter3–mhpmcounter31 are 29 user-configurable 64-bit hardware performance counters in the RISC-V privileged ISA. Each has an associated mhpmevent CSR that selects which microarchitectural event to count. For a custom accelerator, you define a set of event codes in your RTL (e.g., event 0x01 = DMA_STALL, 0x02 = RESP_STALL) and wire the event signals to the counter increment inputs. At runtime, write the event code to mhpmevent3 (CSR 0x323) then read mhpmcounter3 (CSR 0xB03). Custom accelerator events can also be exposed as MMIO registers (simpler and doesn't require M-mode privilege) — the choice depends on whether the counters need to be accessible from user-space or only in privileged software.
Q6How do you determine whether an accelerator is compute-bound or memory-bound?
Use the roofline model: plot measured performance (GFlops) against arithmetic intensity (ops/byte). The compute roof is peak_GFlops (N² MACs/cycle × freq); the memory roof is bandwidth × intensity. If measured performance ≈ compute roof → compute-bound (adding bandwidth won't help; need more PEs). If measured performance ≈ memory roof → memory-bound (adding compute won't help; need more bandwidth or better data reuse). In practice: measure compute utilisation (from perf counters). If compute utilisation < 70% and DMA is active most of the time → memory-bound. If compute utilisation > 90% but throughput is below theoretical peak → compute-bound (likely fill latency or small tiles). The fix differs entirely: memory-bound → scratchpad, double-buffering; compute-bound → larger arrays, higher clock, INT4 quantisation.