What is the RoCC interface in RISC-V?

RoCC (Rocket Custom Coprocessor) is a standard tightly-coupled interface for attaching custom accelerators directly to a RISC-V pipeline. Custom instructions pass operands (CPU register values) directly to the coprocessor over the RoCC command channel, and results return directly to CPU registers — no memory access required. Latency is 1–10 cycles for simple operations. The interface uses valid/ready handshaking and is defined by the Rocket chip generator from UC Berkeley.

What is the difference between RoCC and AXI4 for accelerator integration?

RoCC is tightly coupled: the accelerator is inside the CPU pipeline, receives register operands directly via custom instructions, and writes results back to registers. Latency is very low (1–10 cycles) but the interface is complex and tied to the Rocket core ecosystem. AXI4 memory-mapped is loosely coupled: the CPU writes parameters to MMIO registers, the accelerator reads/writes from shared memory via DMA, and the CPU polls or waits for an interrupt. Latency is higher (100s of cycles for setup) but the interface is standard and works with any CPU.

Can I add custom instructions to RISC-V?

Yes. The RISC-V ISA specification reserves 4 opcode spaces for custom instructions: custom-0 (opcode 0x0B), custom-1 (0x2B), custom-2 (0x5B), and custom-3 (0x7B). These are guaranteed never to be used by the standard ISA extensions, so you can encode custom operations in these spaces. You decode them in the CPU pipeline (or via the RoCC interface) and implement the computation in hardware. You also need to add assembly intrinsics or compiler builtins to the GCC/LLVM toolchain to generate these instructions from C code.

RISC-V + Accelerator Architecture — 4 Ways to Add a Hardware Accelerator

1. Why RISC-V + Custom Accelerators?

General-purpose processors are terrible at domain-specific compute. A RISC-V CPU running a matrix multiply spends most of its time moving data between registers and memory, managing loop counters, and executing scalar multiply-accumulate operations one at a time. A custom systolic array accelerator running the same matrix multiply can achieve 100–1000× higher throughput per watt by doing thousands of multiplications in parallel with no memory bottleneck.

But a pure accelerator without a general-purpose processor is inflexible — you cannot easily change the algorithm, handle exceptions, run control logic, or communicate over USB/UART without substantial additional hardware. RISC-V + Accelerator gives you the best of both worlds: a flexible, programmable CPU for control and irregular compute, and a hardwired engine for the performance-critical innermost loop.

Performance

Accelerator runs 100–1000× faster than CPU for its target computation (matrix multiply, FFT, AES, SHA)

Flexibility

RISC-V CPU handles control flow, OS, communication, and anything the accelerator can't do

Power

Hardwired datapath eliminates fetch/decode overhead — 10–50× better TOPS/W vs CPU execution

Area

Small accelerator (3–5% of die) can handle 80%+ of runtime for targeted workloads

Real-World Impact

Google's TPU added a 256×256 systolic array to a standard CPU — that single accelerator block delivers 92 TOPS while the host CPU provides programmability. The same principle applies to RISC-V SoCs: a small, well-targeted accelerator transforms the capability of the whole system for a fraction of the die area.

2. The 4 Integration Models — Overview

There are four fundamental ways to connect a hardware accelerator to a RISC-V processor. The choice depends on your latency requirement, data bandwidth need, software complexity budget, and which RISC-V core you're targeting.

Fig 1: The 4 RISC-V accelerator integration models. MMIO and AXI4 are loosely coupled and work with any CPU. RoCC and Custom ISA are tightly coupled, offering lower latency at the cost of CPU-specific implementation.

3. Model 1 — Memory-Mapped I/O (MMIO)

MMIO is the simplest integration model and the one that works with any RISC-V core — or any processor at all. The accelerator appears as a peripheral at a specific address range in the CPU's address space. The CPU configures the accelerator by writing to its control registers, triggers a computation by writing to a start register, and polls a status register or waits for an interrupt to know when results are ready.

How MMIO Works

Address map example: BASE_ADDR = 0x1000_0000 BASE + 0x00 → CTRL (write: bit[0]=start, bit[1]=reset) BASE + 0x04 → STATUS (read: bit[0]=busy, bit[1]=done, bit[2]=error) BASE + 0x08 → SRC_ADDR (pointer to input data in SRAM) BASE + 0x0C → DST_ADDR (pointer to output buffer in SRAM) BASE + 0x10 → LENGTH (number of elements) BASE + 0x14 → RESULT (scalar result if small enough to fit a register) CPU workflow: 1. STORE src_addr → BASE+0x08 2. STORE dst_addr → BASE+0x0C 3. STORE length → BASE+0x10 4. STORE 0x1 → BASE+0x00 (start bit) 5. LOOP: LOAD BASE+0x04 until bit[1]=done (or wait for IRQ) 6. LOAD BASE+0x14 or read from dst_addr

Verilog — MMIO accelerator control register block

module mmio_accel_ctrl #(
  parameter DATA_W = 32,
  parameter ADDR_W = 8
)(
  input  logic        clk, rst_n,
  // Simple bus (expand to AXI4-Lite in Day 6)
  input  logic [ADDR_W-1:0] addr,
  input  logic [DATA_W-1:0] wdata,
  input  logic        wen, ren,
  output logic [DATA_W-1:0] rdata,
  // To accelerator datapath
  output logic        start,
  output logic [31:0] src_addr, dst_addr, length,
  input  logic        busy, done, error
);

  logic [DATA_W-1:0] ctrl_reg, src_reg, dst_reg, len_reg;

  // Write path
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      ctrl_reg <= '0; src_reg <= '0; dst_reg <= '0; len_reg <= '0;
    end else if (wen) begin
      case (addr[3:0])
        4'h0: ctrl_reg <= wdata;
        4'h2: src_reg  <= wdata;
        4'h3: dst_reg  <= wdata;
        4'h4: len_reg  <= wdata;
      endcase
    end else begin
      ctrl_reg[0] <= 1'b0; // auto-clear start bit
    end
  end

  // Read path
  always_comb begin
    case (addr[3:0])
      4'h0: rdata = ctrl_reg;
      4'h1: rdata = {29'b0, error, done, busy}; // STATUS
      4'h2: rdata = src_reg;
      4'h3: rdata = dst_reg;
      4'h4: rdata = len_reg;
      default: rdata = '0;
    endcase
  end

  assign start    = ctrl_reg[0];
  assign src_addr = src_reg;
  assign dst_addr = dst_reg;
  assign length   = len_reg;

endmodule

C — bare-metal MMIO driver

#define ACCEL_BASE   0x10000000UL
#define CTRL_REG     (*(volatile uint32_t*)(ACCEL_BASE + 0x00))
#define STATUS_REG   (*(volatile uint32_t*)(ACCEL_BASE + 0x04))
#define SRC_REG      (*(volatile uint32_t*)(ACCEL_BASE + 0x08))
#define DST_REG      (*(volatile uint32_t*)(ACCEL_BASE + 0x0C))
#define LEN_REG      (*(volatile uint32_t*)(ACCEL_BASE + 0x10))

#define STATUS_BUSY  (1u << 0)
#define STATUS_DONE  (1u << 1)
#define CTRL_START   (1u << 0)

void accel_run(uint32_t *src, uint32_t *dst, uint32_t len) {
    SRC_REG  = (uint32_t)src;
    DST_REG  = (uint32_t)dst;
    LEN_REG  = len;
    CTRL_REG = CTRL_START;           // kick off
    while (!(STATUS_REG & STATUS_DONE)); // poll until done
}

4. Model 2 — RoCC Tightly-Coupled Coprocessor

RoCC (Rocket Custom Coprocessor) is a protocol defined by the UC Berkeley Rocket chip generator for attaching custom accelerators inside the CPU pipeline. The key insight: custom RISC-V instructions (using the custom-0 to custom-3 opcode space) are decoded by the CPU and dispatched to the RoCC accelerator alongside the register file values — no memory access, no bus transaction, just direct register-level operand passing.

RoCC Command Channel

RoCC custom instruction encoding (R-type, custom-0 opcode 0x0B): 31 25 24 20 19 15 14 12 11 7 6 0 [ funct7 ][ rs2 ][ rs1 ][funct3][ rd ][ opcode ] [ 7 bits ][5 bits][5 bits][3 bits][5 bits][ 0x0B ] funct7 + funct3 = up to 10 bits to distinguish 1024 custom operations rs1, rs2 = source register indices (CPU sends their VALUES to RoCC) rd = destination register index (RoCC writes result here) RoCC command channel signals: cmd.valid → CPU sending instruction to RoCC cmd.ready ← RoCC can accept instruction cmd.bits.inst → the encoded instruction word cmd.bits.rs1 → value of register rs1 cmd.bits.rs2 → value of register rs2 RoCC response channel: resp.valid ← RoCC has a result ready resp.ready → CPU can accept result resp.bits.rd ← destination register index resp.bits.data ← result value to write to CPU register file

Fig 2: RoCC tightly-coupled interface. Custom RISC-V instructions pass register values directly to the coprocessor. Results write back to CPU registers with no memory roundtrip — 1–10 cycle latency.

Verilog — minimal RoCC accelerator skeleton

// Minimal RoCC coprocessor — receives custom-0 instruction, returns result
module rocc_accel (
  input  logic        clk, rst_n,
  // RoCC command channel (from CPU)
  input  logic        cmd_valid,
  output logic        cmd_ready,
  input  logic [6:0]  cmd_inst_funct7,
  input  logic [2:0]  cmd_inst_funct3,
  input  logic [4:0]  cmd_inst_rd,
  input  logic [63:0] cmd_rs1, cmd_rs2,
  // RoCC response channel (to CPU)
  output logic        resp_valid,
  input  logic        resp_ready,
  output logic [4:0]  resp_rd,
  output logic [63:0] resp_data
);

  // Simple single-cycle MAC: result = rs1 * rs2 + accumulator
  logic [63:0] accumulator;

  assign cmd_ready = 1'b1;   // always ready (single-cycle for now)

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      accumulator <= '0;
      resp_valid  <= 1'b0;
    end else if (cmd_valid && cmd_ready) begin
      case ({cmd_inst_funct7, cmd_inst_funct3})
        10'b0: accumulator <= cmd_rs1 * cmd_rs2 + accumulator; // MAC
        10'b1: accumulator <= '0;                               // CLEAR
        default: ;
      endcase
      resp_rd    <= cmd_inst_rd;
      resp_data  <= accumulator;
      resp_valid <= 1'b1;
    end else if (resp_valid && resp_ready) begin
      resp_valid <= 1'b0;
    end
  end

endmodule

5. Model 3 — AXI4 Loosely-Coupled Accelerator

AXI4 is the industry-standard on-chip bus protocol from ARM (now part of AMBA). An AXI4-connected accelerator appears as an AXI4 slave on the system bus — the CPU configures it via AXI4-Lite (simple register access), and the accelerator DMA-reads input data from memory and DMA-writes results back as an AXI4 master. This is the most widely used model in production SoCs because it works with any CPU, uses standard tools, and separates the compute from the data movement.

AXI4 Accelerator Anatomy

AXI4-Lite slave port: CPU writes to control/config registers (start, length, addresses, mode)
AXI4 master read port: Accelerator fetches input data from SRAM/DDR using burst transactions
AXI4 master write port: Accelerator writes output results to SRAM/DDR
Interrupt output: Signals the CPU when computation completes (avoids polling)

AXI4 vs AXI4-Lite — Know the Difference

AXI4-Lite: Single-beat transfers only (no burst). Used for control registers — simple, small, and synthesizes to minimal logic. AXI4 full: Supports bursts up to 256 beats, multiple outstanding transactions, exclusive access. Used for data DMA where bandwidth matters — a 256-beat burst at 128-bit width = 4 KB per transaction at peak bus bandwidth.

6. Model 4 — Custom ISA Extension

RISC-V explicitly reserves 4 opcode spaces for user-defined custom instructions: custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), and custom-3 (0x7B). These opcodes will never be used by ratified standard extensions, making them permanently safe for custom use.

RISC-V custom opcode spaces: custom-0: opcode = 0x0B (0000_1011) — non-standard, not 64-bit custom-1: opcode = 0x2B (0010_1011) — non-standard, not 64-bit custom-2: opcode = 0x5B (0101_1011) — non-standard/reserved custom-3: opcode = 0x7B (0111_1011) — non-standard/reserved Each opcode supports R/I/S/U type encodings: funct3 (3 bits) × funct7 (7 bits) = up to 1024 distinct operations per opcode space → 4096 total custom instruction variants Example: .insn r 0x0B, 0, 0, a0, a1, a2 → custom-0, funct3=0, funct7=0 → rs1=a1, rs2=a2, rd=a0 → CPU decodes and routes to your custom execution unit

Custom ISA Requires Toolchain Modification

Adding custom instructions means modifying GCC or LLVM to recognise the new opcodes and generate them from C intrinsics or inline assembly. The RISC-V GNU toolchain supports .insn directives for generating arbitrary instruction encodings without full toolchain patches — ideal for prototyping. Production custom ISA extensions typically add compiler intrinsics (like __builtin_riscv_my_accel(a, b)) mapped to the custom instruction encoding.

7. Complete Tradeoff Comparison

Property	MMIO	RoCC	AXI4	Custom ISA
Invoke latency	100s cycles (store+poll)	1–10 cycles	50–500 cycles	1–5 cycles
Data bandwidth	Memory-limited	2 × 64-bit/cycle (rs1,rs2)	Full AXI burst (GBps)	2 × 64-bit/cycle
CPU compatibility	Any CPU	Rocket/Chipyard only	Any CPU with AXI4	Custom pipeline only
HW complexity	Low (simple bus)	Medium (RoCC protocol)	Medium–High (AXI4)	High (pipeline modification)
SW complexity	Low (C MMIO driver)	Medium (custom asm)	Low–Medium (driver + DMA)	High (toolchain)
Interrupt support	Yes (standard IRQ)	Yes (via RoCC channel)	Yes (standard IRQ)	N/A (synchronous)
DMA support	External DMA needed	Built-in L1 cache access	Built-in AXI4 master	No (register operands)
Best for	Large data blocks, any core	Low-latency scalar ops	High-BW, standard SoC	New compute primitives
Real example	Most FPGA SoCs	Hwacha vector (Berkeley)	Xilinx Vitis AI DPU	RISC-V V extension (vectors)

8. Real-World Examples

System	RISC-V Core	Accelerator	Interface	Application
Hwacha	Rocket (Berkeley)	Vector accelerator	RoCC	HPC / scientific compute
Gemmini	Rocket + Chipyard	Systolic array (16×16)	RoCC + DMA	Deep learning inference (Berkeley)
CVA6 + NLP Accel	CVA6 (ETH Zurich)	NLP/BERT accelerator	AXI4	Edge NLP inference
SiFive X280	SiFive X280	Custom vector extensions	Custom ISA (V-ext)	ML / DSP
Ara (ETH Zurich)	CVA6	Vector coprocessor (4–16 lanes)	AXI4 + custom ISA	SIMD / vector compute
OpenTitan	Ibex (RV32)	AES / SHA / HMAC engines	MMIO (TL-UL bus)	Security / cryptography
CHIPS Alliance SoC	Rocket / BOOM	Custom DSP / ML blocks	RoCC + AXI4	Open silicon research

Gemmini — The Open-Source RISC-V Systolic Array

Gemmini from UC Berkeley is the closest open-source example to what we'll build in this course. It's a parametric systolic array generator that integrates via RoCC into the Rocket chip — configurable tile size, datatype (INT8/FP32), scratchpad size, and DMA bandwidth. It ships with a runtime library and can run full DNN inference. We'll use Gemmini's architecture as inspiration for our own Day 4–5 systolic array design.

9. Course Roadmap — What You'll Build in 15 Days

Day	Topic	What You Build
1	Architecture Overview (this page)	Understanding of all 4 models + which to use when
2	Custom ISA Extension	custom-0 instruction + GCC .insn assembler usage
3	RoCC Interface Deep Dive	RoCC MAC coprocessor in Verilog + protocol testbench
4	Systolic Array Design	8×8 weight-stationary systolic array + testbench
5	Systolic Array via RoCC	Wire systolic array to RoCC + custom instruction test
6	AXI4-Lite MMIO Accelerator	Complete AXI4-Lite slave with CSR file
7	DMA Engine	AXI4 master DMA with burst + scatter-gather
8	INT8 NN Inference Engine	Quantized MAC array with ReLU and bias
9	Bare-Metal C Driver	Complete inference API in C for the NN engine
10	Full RISC-V SoC Integration	Top-level SoC with AXI4 crossbar + all components
11	Performance Optimization	Tiling large matrices, double-buffering
12	Verification	SVA properties on AXI4 + formal RoCC protocol check
13	FPGA Implementation	Bitstream for Arty A7 running real inference
14	Physical Design	Floorplan + power domains for the full SoC
15	Capstone	End-to-end RISC-V AI SoC with benchmarks

10. Day 1 Interview Q&A

#	Question	Key Answer Points
1	What is the RoCC interface and what are its advantages?	Tightly-coupled coprocessor interface in Rocket chip. Passes CPU register values directly to accelerator via custom instructions. No memory roundtrip → 1–10 cycle latency. Disadvantage: Rocket/Chipyard-specific, not portable to CVA6 or BOOM without reimplementation.
2	Why would you choose AXI4 over RoCC for an accelerator?	AXI4 is core-agnostic (works with any CPU), supports high-bandwidth DMA bursts for large data, is industry-standard (tools, IP, verification), and scales to complex multi-master SoCs. RoCC is better when latency of individual operations is the constraint and you're in the Rocket ecosystem.
3	What RISC-V opcodes can you use for custom instructions?	custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), custom-3 (0x7B) — guaranteed never ratified by official RISC-V extensions. Each supports R/I/S/U encoding with funct3+funct7 for 1024 sub-operations per opcode = 4096 total custom instruction variants.
4	What is a systolic array and why is it good for matrix multiply?	A 2D grid of Processing Elements (PEs) where data flows in a pipelined wave — each PE does one MAC (a×b+c) and passes results to its neighbor. For matrix multiply, inputs stream in from left (activations) and top (weights), and partial sums accumulate through the array. Every PE operates every cycle → compute efficiency near 100%, unlike a CPU which loads/stores between MACs.
5	What is Gemmini?	Open-source parametric systolic array generator from UC Berkeley that integrates with Rocket chip via RoCC. Configurable tile size, INT8/FP32, DMA, scratchpad. Used for research and production RISC-V AI SoC prototyping. Inspiration for what we build in this course.

Day 1 Knowledge Checklist

☐ Can explain the 4 accelerator integration models and when to use each
☐ Know the RoCC command/response channel signal names and handshake
☐ Know the 4 RISC-V custom opcode values and their encoding space
☐ Understand why AXI4 is more portable than RoCC
☐ Can describe what a systolic array does at a high level
☐ Can write a minimal MMIO driver in C for a custom accelerator
☐ Know 3 real-world RISC-V + accelerator systems (Hwacha, Gemmini, Ara)

← Course IndexRISC-V + Accelerator Home Next →Day 2 — Custom ISA Extension

RISC-V + Accelerator Architecture Overview