HomeRISC-V + AcceleratorDay 8 — Performance Profiling
RISC-V + Accelerator · Day 08 of 15

Performance Profiling
CSR Counters, Speedup & Bottleneck Analysis

By EcrioniX · Updated June 2026 · ~45 min read
mcycle / minstretmhpmcounterSpeedup Amdahl's LawRoofline ModelUtilisationBottleneck

Why Measure Performance?

An accelerator that is never measured is an accelerator that is never optimised. Performance profiling answers the most important design question: "Where is the time being spent?" Is the systolic array stalled waiting for data? Is the DMA saturating the memory bus? Is the CPU overhead (tiling loop, cache flush) dominating the runtime? Without measurement, you're guessing.

Three Questions Performance Profiling Answers

1. How fast is it? — Absolute throughput (GFlops, GB/s). 2. How much faster than the baseline? — Speedup ratio vs CPU. 3. What's limiting it? — Compute-bound, memory-bound, or control overhead?

RISC-V CSR Performance Counters

CSR NameAddressCountsUsage
mcycle0xB00Clock cycles (free-running)Wall-clock time measurement
minstret0xB02Retired instructionsCPI = mcycle / minstret
mhpmcounter3–310xB03–0xB1FConfigurable hardware eventsCache misses, branch mispredicts, stalls
mhpmevent3–310x323–0x33FEvent selector for mhpmcounterNSet to select which event to count
cycle (user)0xC00Same as mcycle, user-readableUse in user-space programs
time0xC01Real-time clock (from CLINT)Wall time in nanoseconds
C — Read RISC-V CSR cycle counter
#include <stdint.h>
#include <stdio.h>

// Read cycle CSR using inline assembly
static inline uint64_t read_cycle() {
  uint64_t c;
  __asm__ volatile("csrr %0, cycle" : "=r"(c));
  return c;
}

static inline uint64_t read_instret() {
  uint64_t i;
  __asm__ volatile("csrr %0, instret" : "=r"(i));
  return i;
}

typedef struct { uint64_t cycles, instret; } perf_t;

static inline perf_t perf_start() {
  __asm__ volatile("" ::: "memory");  // compiler fence
  return (perf_t){ read_cycle(), read_instret() };
}

static inline perf_t perf_end(perf_t s) {
  __asm__ volatile("" ::: "memory");
  return (perf_t){ read_cycle()-s.cycles, read_instret()-s.instret };
}

// Usage: benchmark matrix multiply
void benchmark(int8_t *A, int8_t *B, int32_t *C_accel, int32_t *C_cpu, int N) {
  double gflops = 2.0 * N * N * N / 1e9;  // 2N³ ops

  // Benchmark: accelerator
  perf_t s = perf_start();
  uint32_t accel_cycles = accel_matmul(A, B, C_accel, N);
  perf_t accel = perf_end(s);

  // Benchmark: CPU reference
  s = perf_start();
  cpu_matmul(A, B, C_cpu, N);         // naive C triple-loop
  perf_t cpu = perf_end(s);

  double clk_mhz = 1000.0;           // assume 1 GHz
  double accel_ms = accel.cycles / clk_mhz / 1e3;
  double cpu_ms   = cpu.cycles   / clk_mhz / 1e3;

  printf("N=%d | Accel: %llu cyc (%.2f ms, %.2f GFlops) | "
         "CPU: %llu cyc (%.2f ms, %.2f GFlops) | Speedup: %.1fx\n",
    N, accel.cycles, accel_ms, gflops/accel_ms*1e3,
       cpu.cycles,   cpu_ms,   gflops/cpu_ms*1e3,
       (double)cpu.cycles / accel.cycles);
}

Hardware Event Counter in Verilog

Verilog — Accelerator performance counters (MMIO-accessible)
module perf_counters (
  input        clk, rst, clear,
  // Events from accelerator datapath
  input        accel_active,    // systolic array doing useful work
  input        dma_active,      // DMA transferring data
  input        stall_dma,       // accel stalled waiting for DMA
  input        stall_resp,      // accel stalled waiting for resp_ready
  // Readable via MMIO
  output reg [31:0] cnt_total,   // total elapsed cycles
  output reg [31:0] cnt_compute, // cycles accel is computing
  output reg [31:0] cnt_dma,     // cycles DMA is active
  output reg [31:0] cnt_stall    // stall cycles
);
  always @(posedge clk) begin
    if (rst || clear) begin
      cnt_total<=0; cnt_compute<=0; cnt_dma<=0; cnt_stall<=0;
    end else begin
      cnt_total   <= cnt_total   + 1;
      if (accel_active) cnt_compute <= cnt_compute + 1;
      if (dma_active)   cnt_dma     <= cnt_dma     + 1;
      if (stall_dma || stall_resp) cnt_stall <= cnt_stall + 1;
    end
  end
endmodule

Amdahl's Law & Speedup Analysis

Amdahl's Law: Speedup = 1 / ((1-P) + P/S) P = fraction of program that is parallelisable/accelerated S = speedup of the accelerated portion Example: matrix multiply = 80% of total program time, accelerator = 100× faster Speedup = 1 / (0.20 + 0.80/100) = 1 / 0.208 = 4.8× → Even with 100× faster compute, total speedup is only 4.8× (20% is unaccelerated) Key insight: optimise the bottleneck. If DMA overhead = 30% of accelerator time: true_speedup = 0.70 × compute_speedup × 1/(1-0.70+0.70/speedup) → Improving compute beyond 10× returns diminishing benefit; DMA is the new limit

Utilisation Analysis

CounterFormulaHealthy RangeIf Low → Fix
Compute utilisationcnt_compute / cnt_total>80%Improve DMA prefetch, double-buffer
DMA efficiencycnt_dma / cnt_total<30%Larger bursts, reduce transaction overhead
Stall fractioncnt_stall / cnt_total<10%Fix backpressure, increase bandwidth
MAC efficiencyactual_MACs / (peak_MACs × time)>85%Reduce fill latency, increase tile size
C — Read hardware counters and print utilisation report
#define REG_CNT_TOTAL   (ACC_BASE + 0x30)
#define REG_CNT_COMPUTE (ACC_BASE + 0x34)
#define REG_CNT_DMA     (ACC_BASE + 0x38)
#define REG_CNT_STALL   (ACC_BASE + 0x3C)

void print_perf_report(int N, double freq_ghz) {
  uint32_t total   = MMIO_R(REG_CNT_TOTAL);
  uint32_t compute = MMIO_R(REG_CNT_COMPUTE);
  uint32_t dma     = MMIO_R(REG_CNT_DMA);
  uint32_t stall   = MMIO_R(REG_CNT_STALL);

  double time_ms  = total / (freq_ghz * 1e6);
  double gflops   = 2.0 * N * N * N / 1e9;
  double tflops   = gflops / time_ms * 1000;
  double peak_mac = N * N;                   // MACs/cycle for NxN array
  double mac_eff  = (gflops*1e9/2) / ((double)total * peak_mac) * 100;

  printf("=== Accelerator Performance Report ===\n");
  printf("Matrix: %d×%d | Time: %.3f ms | %.2f TFlops\n", N, N, time_ms, tflops);
  printf("Compute util: %5.1f%% | DMA util: %5.1f%% | Stall: %5.1f%%\n",
    100.0*compute/total, 100.0*dma/total, 100.0*stall/total);
  printf("MAC efficiency: %.1f%% of peak\n", mac_eff);

  if (100.0*compute/total < 70)
    printf("BOTTLENECK: Compute starved — improve DMA prefetch or double-buffer\n");
  if (100.0*stall/total > 15)
    printf("BOTTLENECK: Stall cycles high — check resp_ready backpressure\n");
}

Day 8 — Interview Questions

Q1How do you measure accelerator execution time on RISC-V?
Read the cycle CSR before and after the accelerator invocation using csrr rd, cycle. The cycle counter increments every clock cycle and is a free-running 64-bit counter (wraps after 584 years at 1 GHz). Delta_cycles = cycle_end - cycle_start gives the wall-clock cycle count including all overhead: DMA, control, cache flush, and the actual accelerator compute. Divide by the clock frequency (in Hz) to get seconds. For fine-grained profiling, bracket individual phases (DMA, compute) separately. A compiler fence (asm volatile("" ::: "memory")) is needed before each read to prevent the compiler from reordering the CSR read past the measured code.
Q2What is CPI and how does it relate to accelerator design?
CPI (Cycles Per Instruction) = mcycle / minstret. For the RISC-V CPU core, ideal CPI = 1 (in-order). When the CPU issues a RoCC instruction and the accelerator asserts busy=1, the pipeline stalls — mcycle increments but minstret does not (the instruction hasn't retired). The CPI during accelerator execution is effectively (total_wall_cycles) / (few_instruction_count), which can be hundreds or thousands. This is by design — the accelerator is doing work during these cycles. CPI is useful for measuring CPU-side overhead: if the CPU executes many instructions during the accelerator run (e.g., in the tiling loop), those should also be counted in the overhead analysis.
Q3What is Amdahl's Law and what does it say about accelerator design?
Amdahl's Law: Speedup = 1 / ((1−P) + P/S), where P is the fraction of the program accelerated and S is the speedup of that fraction. It says that the non-accelerated portion (1−P) sets an upper bound on total speedup: even if S→∞, max speedup = 1/(1−P). For P=80% (matrix multiply is 80% of runtime), max speedup = 1/0.20 = 5×, regardless of how fast the accelerator is. Practical implication: before building a faster systolic array, first ensure the non-accelerated portion (data movement, control, pre/post-processing) is as small as possible. Often the 20% non-accelerated tail is more important to optimise than doubling the array's compute throughput.
Q4What does "compute utilisation" mean for a systolic array, and what causes it to be low?
Compute utilisation = fraction of total cycles where the systolic array is actively performing MACs. 100% means the array never stalls. Causes of low utilisation: (1) DMA starvation — the array finishes one tile and waits for the DMA to load the next (fix: double-buffering), (2) Fill/drain latency — for small tiles (N < array_size), the 3N−1 fill+drain cycles dominate (fix: larger tiles), (3) Weight reload — changing the B matrix requires reloading all PE weights (fix: cache multiple weight sets or tile the B dimension), (4) Response backpressure — resp_ready=0 stalls the FSM. Good designs target >85% utilisation.
Q5What are mhpmcounters and how do you configure them?
mhpmcounter3–mhpmcounter31 are 29 user-configurable 64-bit hardware performance counters in the RISC-V privileged ISA. Each has an associated mhpmevent CSR that selects which microarchitectural event to count. For a custom accelerator, you define a set of event codes in your RTL (e.g., event 0x01 = DMA_STALL, 0x02 = RESP_STALL) and wire the event signals to the counter increment inputs. At runtime, write the event code to mhpmevent3 (CSR 0x323) then read mhpmcounter3 (CSR 0xB03). Custom accelerator events can also be exposed as MMIO registers (simpler and doesn't require M-mode privilege) — the choice depends on whether the counters need to be accessible from user-space or only in privileged software.
Q6How do you determine whether an accelerator is compute-bound or memory-bound?
Use the roofline model: plot measured performance (GFlops) against arithmetic intensity (ops/byte). The compute roof is peak_GFlops (N² MACs/cycle × freq); the memory roof is bandwidth × intensity. If measured performance ≈ compute roof → compute-bound (adding bandwidth won't help; need more PEs). If measured performance ≈ memory roof → memory-bound (adding compute won't help; need more bandwidth or better data reuse). In practice: measure compute utilisation (from perf counters). If compute utilisation < 70% and DMA is active most of the time → memory-bound. If compute utilisation > 90% but throughput is below theoretical peak → compute-bound (likely fill latency or small tiles). The fix differs entirely: memory-bound → scratchpad, double-buffering; compute-bound → larger arrays, higher clock, INT4 quantisation.
← Day 7: AXI4 Integration Course Index →