The 4 ways to wire a custom hardware accelerator into a RISC-V processor: memory-mapped MMIO, RoCC tightly-coupled coprocessor, AXI4 loosely-coupled, and custom ISA extension — with full block diagrams, latency/bandwidth tradeoffs, and real chip examples.
General-purpose processors are terrible at domain-specific compute. A RISC-V CPU running a matrix multiply spends most of its time moving data between registers and memory, managing loop counters, and executing scalar multiply-accumulate operations one at a time. A custom systolic array accelerator running the same matrix multiply can achieve 100–1000× higher throughput per watt by doing thousands of multiplications in parallel with no memory bottleneck.
But a pure accelerator without a general-purpose processor is inflexible — you cannot easily change the algorithm, handle exceptions, run control logic, or communicate over USB/UART without substantial additional hardware. RISC-V + Accelerator gives you the best of both worlds: a flexible, programmable CPU for control and irregular compute, and a hardwired engine for the performance-critical innermost loop.
Accelerator runs 100–1000× faster than CPU for its target computation (matrix multiply, FFT, AES, SHA)
RISC-V CPU handles control flow, OS, communication, and anything the accelerator can't do
Hardwired datapath eliminates fetch/decode overhead — 10–50× better TOPS/W vs CPU execution
Small accelerator (3–5% of die) can handle 80%+ of runtime for targeted workloads
Google's TPU added a 256×256 systolic array to a standard CPU — that single accelerator block delivers 92 TOPS while the host CPU provides programmability. The same principle applies to RISC-V SoCs: a small, well-targeted accelerator transforms the capability of the whole system for a fraction of the die area.
There are four fundamental ways to connect a hardware accelerator to a RISC-V processor. The choice depends on your latency requirement, data bandwidth need, software complexity budget, and which RISC-V core you're targeting.
MMIO is the simplest integration model and the one that works with any RISC-V core — or any processor at all. The accelerator appears as a peripheral at a specific address range in the CPU's address space. The CPU configures the accelerator by writing to its control registers, triggers a computation by writing to a start register, and polls a status register or waits for an interrupt to know when results are ready.
module mmio_accel_ctrl #(
parameter DATA_W = 32,
parameter ADDR_W = 8
)(
input logic clk, rst_n,
// Simple bus (expand to AXI4-Lite in Day 6)
input logic [ADDR_W-1:0] addr,
input logic [DATA_W-1:0] wdata,
input logic wen, ren,
output logic [DATA_W-1:0] rdata,
// To accelerator datapath
output logic start,
output logic [31:0] src_addr, dst_addr, length,
input logic busy, done, error
);
logic [DATA_W-1:0] ctrl_reg, src_reg, dst_reg, len_reg;
// Write path
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
ctrl_reg <= '0; src_reg <= '0; dst_reg <= '0; len_reg <= '0;
end else if (wen) begin
case (addr[3:0])
4'h0: ctrl_reg <= wdata;
4'h2: src_reg <= wdata;
4'h3: dst_reg <= wdata;
4'h4: len_reg <= wdata;
endcase
end else begin
ctrl_reg[0] <= 1'b0; // auto-clear start bit
end
end
// Read path
always_comb begin
case (addr[3:0])
4'h0: rdata = ctrl_reg;
4'h1: rdata = {29'b0, error, done, busy}; // STATUS
4'h2: rdata = src_reg;
4'h3: rdata = dst_reg;
4'h4: rdata = len_reg;
default: rdata = '0;
endcase
end
assign start = ctrl_reg[0];
assign src_addr = src_reg;
assign dst_addr = dst_reg;
assign length = len_reg;
endmodule#define ACCEL_BASE 0x10000000UL
#define CTRL_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x00))
#define STATUS_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x04))
#define SRC_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x08))
#define DST_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x0C))
#define LEN_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x10))
#define STATUS_BUSY (1u << 0)
#define STATUS_DONE (1u << 1)
#define CTRL_START (1u << 0)
void accel_run(uint32_t *src, uint32_t *dst, uint32_t len) {
SRC_REG = (uint32_t)src;
DST_REG = (uint32_t)dst;
LEN_REG = len;
CTRL_REG = CTRL_START; // kick off
while (!(STATUS_REG & STATUS_DONE)); // poll until done
}RoCC (Rocket Custom Coprocessor) is a protocol defined by the UC Berkeley Rocket chip generator for attaching custom accelerators inside the CPU pipeline. The key insight: custom RISC-V instructions (using the custom-0 to custom-3 opcode space) are decoded by the CPU and dispatched to the RoCC accelerator alongside the register file values — no memory access, no bus transaction, just direct register-level operand passing.
// Minimal RoCC coprocessor — receives custom-0 instruction, returns result
module rocc_accel (
input logic clk, rst_n,
// RoCC command channel (from CPU)
input logic cmd_valid,
output logic cmd_ready,
input logic [6:0] cmd_inst_funct7,
input logic [2:0] cmd_inst_funct3,
input logic [4:0] cmd_inst_rd,
input logic [63:0] cmd_rs1, cmd_rs2,
// RoCC response channel (to CPU)
output logic resp_valid,
input logic resp_ready,
output logic [4:0] resp_rd,
output logic [63:0] resp_data
);
// Simple single-cycle MAC: result = rs1 * rs2 + accumulator
logic [63:0] accumulator;
assign cmd_ready = 1'b1; // always ready (single-cycle for now)
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
accumulator <= '0;
resp_valid <= 1'b0;
end else if (cmd_valid && cmd_ready) begin
case ({cmd_inst_funct7, cmd_inst_funct3})
10'b0: accumulator <= cmd_rs1 * cmd_rs2 + accumulator; // MAC
10'b1: accumulator <= '0; // CLEAR
default: ;
endcase
resp_rd <= cmd_inst_rd;
resp_data <= accumulator;
resp_valid <= 1'b1;
end else if (resp_valid && resp_ready) begin
resp_valid <= 1'b0;
end
end
endmoduleAXI4 is the industry-standard on-chip bus protocol from ARM (now part of AMBA). An AXI4-connected accelerator appears as an AXI4 slave on the system bus — the CPU configures it via AXI4-Lite (simple register access), and the accelerator DMA-reads input data from memory and DMA-writes results back as an AXI4 master. This is the most widely used model in production SoCs because it works with any CPU, uses standard tools, and separates the compute from the data movement.
AXI4-Lite: Single-beat transfers only (no burst). Used for control registers — simple, small, and synthesizes to minimal logic. AXI4 full: Supports bursts up to 256 beats, multiple outstanding transactions, exclusive access. Used for data DMA where bandwidth matters — a 256-beat burst at 128-bit width = 4 KB per transaction at peak bus bandwidth.
RISC-V explicitly reserves 4 opcode spaces for user-defined custom instructions: custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), and custom-3 (0x7B). These opcodes will never be used by ratified standard extensions, making them permanently safe for custom use.
Adding custom instructions means modifying GCC or LLVM to recognise the new opcodes and generate them from C intrinsics or inline assembly. The RISC-V GNU toolchain supports .insn directives for generating arbitrary instruction encodings without full toolchain patches — ideal for prototyping. Production custom ISA extensions typically add compiler intrinsics (like __builtin_riscv_my_accel(a, b)) mapped to the custom instruction encoding.
| Property | MMIO | RoCC | AXI4 | Custom ISA |
|---|---|---|---|---|
| Invoke latency | 100s cycles (store+poll) | 1–10 cycles | 50–500 cycles | 1–5 cycles |
| Data bandwidth | Memory-limited | 2 × 64-bit/cycle (rs1,rs2) | Full AXI burst (GBps) | 2 × 64-bit/cycle |
| CPU compatibility | Any CPU | Rocket/Chipyard only | Any CPU with AXI4 | Custom pipeline only |
| HW complexity | Low (simple bus) | Medium (RoCC protocol) | Medium–High (AXI4) | High (pipeline modification) |
| SW complexity | Low (C MMIO driver) | Medium (custom asm) | Low–Medium (driver + DMA) | High (toolchain) |
| Interrupt support | Yes (standard IRQ) | Yes (via RoCC channel) | Yes (standard IRQ) | N/A (synchronous) |
| DMA support | External DMA needed | Built-in L1 cache access | Built-in AXI4 master | No (register operands) |
| Best for | Large data blocks, any core | Low-latency scalar ops | High-BW, standard SoC | New compute primitives |
| Real example | Most FPGA SoCs | Hwacha vector (Berkeley) | Xilinx Vitis AI DPU | RISC-V V extension (vectors) |
| System | RISC-V Core | Accelerator | Interface | Application |
|---|---|---|---|---|
| Hwacha | Rocket (Berkeley) | Vector accelerator | RoCC | HPC / scientific compute |
| Gemmini | Rocket + Chipyard | Systolic array (16×16) | RoCC + DMA | Deep learning inference (Berkeley) |
| CVA6 + NLP Accel | CVA6 (ETH Zurich) | NLP/BERT accelerator | AXI4 | Edge NLP inference |
| SiFive X280 | SiFive X280 | Custom vector extensions | Custom ISA (V-ext) | ML / DSP |
| Ara (ETH Zurich) | CVA6 | Vector coprocessor (4–16 lanes) | AXI4 + custom ISA | SIMD / vector compute |
| OpenTitan | Ibex (RV32) | AES / SHA / HMAC engines | MMIO (TL-UL bus) | Security / cryptography |
| CHIPS Alliance SoC | Rocket / BOOM | Custom DSP / ML blocks | RoCC + AXI4 | Open silicon research |
Gemmini from UC Berkeley is the closest open-source example to what we'll build in this course. It's a parametric systolic array generator that integrates via RoCC into the Rocket chip — configurable tile size, datatype (INT8/FP32), scratchpad size, and DMA bandwidth. It ships with a runtime library and can run full DNN inference. We'll use Gemmini's architecture as inspiration for our own Day 4–5 systolic array design.
| Day | Topic | What You Build |
|---|---|---|
| 1 | Architecture Overview (this page) | Understanding of all 4 models + which to use when |
| 2 | Custom ISA Extension | custom-0 instruction + GCC .insn assembler usage |
| 3 | RoCC Interface Deep Dive | RoCC MAC coprocessor in Verilog + protocol testbench |
| 4 | Systolic Array Design | 8×8 weight-stationary systolic array + testbench |
| 5 | Systolic Array via RoCC | Wire systolic array to RoCC + custom instruction test |
| 6 | AXI4-Lite MMIO Accelerator | Complete AXI4-Lite slave with CSR file |
| 7 | DMA Engine | AXI4 master DMA with burst + scatter-gather |
| 8 | INT8 NN Inference Engine | Quantized MAC array with ReLU and bias |
| 9 | Bare-Metal C Driver | Complete inference API in C for the NN engine |
| 10 | Full RISC-V SoC Integration | Top-level SoC with AXI4 crossbar + all components |
| 11 | Performance Optimization | Tiling large matrices, double-buffering |
| 12 | Verification | SVA properties on AXI4 + formal RoCC protocol check |
| 13 | FPGA Implementation | Bitstream for Arty A7 running real inference |
| 14 | Physical Design | Floorplan + power domains for the full SoC |
| 15 | Capstone | End-to-end RISC-V AI SoC with benchmarks |
| # | Question | Key Answer Points |
|---|---|---|
| 1 | What is the RoCC interface and what are its advantages? | Tightly-coupled coprocessor interface in Rocket chip. Passes CPU register values directly to accelerator via custom instructions. No memory roundtrip → 1–10 cycle latency. Disadvantage: Rocket/Chipyard-specific, not portable to CVA6 or BOOM without reimplementation. |
| 2 | Why would you choose AXI4 over RoCC for an accelerator? | AXI4 is core-agnostic (works with any CPU), supports high-bandwidth DMA bursts for large data, is industry-standard (tools, IP, verification), and scales to complex multi-master SoCs. RoCC is better when latency of individual operations is the constraint and you're in the Rocket ecosystem. |
| 3 | What RISC-V opcodes can you use for custom instructions? | custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), custom-3 (0x7B) — guaranteed never ratified by official RISC-V extensions. Each supports R/I/S/U encoding with funct3+funct7 for 1024 sub-operations per opcode = 4096 total custom instruction variants. |
| 4 | What is a systolic array and why is it good for matrix multiply? | A 2D grid of Processing Elements (PEs) where data flows in a pipelined wave — each PE does one MAC (a×b+c) and passes results to its neighbor. For matrix multiply, inputs stream in from left (activations) and top (weights), and partial sums accumulate through the array. Every PE operates every cycle → compute efficiency near 100%, unlike a CPU which loads/stores between MACs. |
| 5 | What is Gemmini? | Open-source parametric systolic array generator from UC Berkeley that integrates with Rocket chip via RoCC. Configurable tile size, INT8/FP32, DMA, scratchpad. Used for research and production RISC-V AI SoC prototyping. Inspiration for what we build in this course. |