HomeRISC-V + AcceleratorDay 1 — Architecture Overview

RISC-V + Accelerator Architecture Overview

The 4 ways to wire a custom hardware accelerator into a RISC-V processor: memory-mapped MMIO, RoCC tightly-coupled coprocessor, AXI4 loosely-coupled, and custom ISA extension — with full block diagrams, latency/bandwidth tradeoffs, and real chip examples.

By EcrioniX Engineering Team · Published June 19, 2026 · ~4,800 words · 16 min read

1. Why RISC-V + Custom Accelerators?

General-purpose processors are terrible at domain-specific compute. A RISC-V CPU running a matrix multiply spends most of its time moving data between registers and memory, managing loop counters, and executing scalar multiply-accumulate operations one at a time. A custom systolic array accelerator running the same matrix multiply can achieve 100–1000× higher throughput per watt by doing thousands of multiplications in parallel with no memory bottleneck.

But a pure accelerator without a general-purpose processor is inflexible — you cannot easily change the algorithm, handle exceptions, run control logic, or communicate over USB/UART without substantial additional hardware. RISC-V + Accelerator gives you the best of both worlds: a flexible, programmable CPU for control and irregular compute, and a hardwired engine for the performance-critical innermost loop.

Performance

Accelerator runs 100–1000× faster than CPU for its target computation (matrix multiply, FFT, AES, SHA)

Flexibility

RISC-V CPU handles control flow, OS, communication, and anything the accelerator can't do

Power

Hardwired datapath eliminates fetch/decode overhead — 10–50× better TOPS/W vs CPU execution

Area

Small accelerator (3–5% of die) can handle 80%+ of runtime for targeted workloads

Real-World Impact

Google's TPU added a 256×256 systolic array to a standard CPU — that single accelerator block delivers 92 TOPS while the host CPU provides programmability. The same principle applies to RISC-V SoCs: a small, well-targeted accelerator transforms the capability of the whole system for a fraction of the die area.

2. The 4 Integration Models — Overview

There are four fundamental ways to connect a hardware accelerator to a RISC-V processor. The choice depends on your latency requirement, data bandwidth need, software complexity budget, and which RISC-V core you're targeting.

4 RISC-V Accelerator Integration Models ① MMIO RISC-V CPU Accel MMIO regs bus Shared SRAM Latency: 100s cycles BW: DMA-limited ② RoCC RISC-V Pipeline RoCC Accel RoCC cmd RoCC resp Shared reg file Latency: 1–10 cycles BW: reg-file speed ③ AXI4 Slave RISC-V CPU AXI4 Accel AXI4 DDR / SRAM Latency: 50–500 cycles BW: AXI bus width ④ Custom ISA RISC-V Decode stage EX Unit Custom opcode decode Pipeline registers Latency: 1–5 cycles BW: reg-file speed ← Simpler HW interface ——————————————————————————— More complex HW → ← Higher latency ——————————————————————————————— Lower latency → ← Any RISC-V core —————————————————————— Rocket/Chipyard-specific → Use when: Any CPU, max portability Large data blocks Existing AXI ecosystem Use when: Rocket/Chipyard Low-latency scalar ops Register-level data Use when: Any CPU, standard bus High-BW DMA transfer Industry-standard Use when: Custom CPU pipeline Full toolchain access New compute primitive
Fig 1: The 4 RISC-V accelerator integration models. MMIO and AXI4 are loosely coupled and work with any CPU. RoCC and Custom ISA are tightly coupled, offering lower latency at the cost of CPU-specific implementation.

3. Model 1 — Memory-Mapped I/O (MMIO)

MMIO is the simplest integration model and the one that works with any RISC-V core — or any processor at all. The accelerator appears as a peripheral at a specific address range in the CPU's address space. The CPU configures the accelerator by writing to its control registers, triggers a computation by writing to a start register, and polls a status register or waits for an interrupt to know when results are ready.

How MMIO Works

Address map example: BASE_ADDR = 0x1000_0000 BASE + 0x00 → CTRL (write: bit[0]=start, bit[1]=reset) BASE + 0x04 → STATUS (read: bit[0]=busy, bit[1]=done, bit[2]=error) BASE + 0x08 → SRC_ADDR (pointer to input data in SRAM) BASE + 0x0C → DST_ADDR (pointer to output buffer in SRAM) BASE + 0x10 → LENGTH (number of elements) BASE + 0x14 → RESULT (scalar result if small enough to fit a register) CPU workflow: 1. STORE src_addr → BASE+0x08 2. STORE dst_addr → BASE+0x0C 3. STORE length → BASE+0x10 4. STORE 0x1 → BASE+0x00 (start bit) 5. LOOP: LOAD BASE+0x04 until bit[1]=done (or wait for IRQ) 6. LOAD BASE+0x14 or read from dst_addr
Verilog — MMIO accelerator control register block
module mmio_accel_ctrl #( parameter DATA_W = 32, parameter ADDR_W = 8 )( input logic clk, rst_n, // Simple bus (expand to AXI4-Lite in Day 6) input logic [ADDR_W-1:0] addr, input logic [DATA_W-1:0] wdata, input logic wen, ren, output logic [DATA_W-1:0] rdata, // To accelerator datapath output logic start, output logic [31:0] src_addr, dst_addr, length, input logic busy, done, error ); logic [DATA_W-1:0] ctrl_reg, src_reg, dst_reg, len_reg; // Write path always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin ctrl_reg <= '0; src_reg <= '0; dst_reg <= '0; len_reg <= '0; end else if (wen) begin case (addr[3:0]) 4'h0: ctrl_reg <= wdata; 4'h2: src_reg <= wdata; 4'h3: dst_reg <= wdata; 4'h4: len_reg <= wdata; endcase end else begin ctrl_reg[0] <= 1'b0; // auto-clear start bit end end // Read path always_comb begin case (addr[3:0]) 4'h0: rdata = ctrl_reg; 4'h1: rdata = {29'b0, error, done, busy}; // STATUS 4'h2: rdata = src_reg; 4'h3: rdata = dst_reg; 4'h4: rdata = len_reg; default: rdata = '0; endcase end assign start = ctrl_reg[0]; assign src_addr = src_reg; assign dst_addr = dst_reg; assign length = len_reg; endmodule
C — bare-metal MMIO driver
#define ACCEL_BASE 0x10000000UL #define CTRL_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x00)) #define STATUS_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x04)) #define SRC_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x08)) #define DST_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x0C)) #define LEN_REG (*(volatile uint32_t*)(ACCEL_BASE + 0x10)) #define STATUS_BUSY (1u << 0) #define STATUS_DONE (1u << 1) #define CTRL_START (1u << 0) void accel_run(uint32_t *src, uint32_t *dst, uint32_t len) { SRC_REG = (uint32_t)src; DST_REG = (uint32_t)dst; LEN_REG = len; CTRL_REG = CTRL_START; // kick off while (!(STATUS_REG & STATUS_DONE)); // poll until done }

4. Model 2 — RoCC Tightly-Coupled Coprocessor

RoCC (Rocket Custom Coprocessor) is a protocol defined by the UC Berkeley Rocket chip generator for attaching custom accelerators inside the CPU pipeline. The key insight: custom RISC-V instructions (using the custom-0 to custom-3 opcode space) are decoded by the CPU and dispatched to the RoCC accelerator alongside the register file values — no memory access, no bus transaction, just direct register-level operand passing.

RoCC Command Channel

RoCC custom instruction encoding (R-type, custom-0 opcode 0x0B): 31 25 24 20 19 15 14 12 11 7 6 0 [ funct7 ][ rs2 ][ rs1 ][funct3][ rd ][ opcode ] [ 7 bits ][5 bits][5 bits][3 bits][5 bits][ 0x0B ] funct7 + funct3 = up to 10 bits to distinguish 1024 custom operations rs1, rs2 = source register indices (CPU sends their VALUES to RoCC) rd = destination register index (RoCC writes result here) RoCC command channel signals: cmd.valid → CPU sending instruction to RoCC cmd.ready ← RoCC can accept instruction cmd.bits.inst → the encoded instruction word cmd.bits.rs1 → value of register rs1 cmd.bits.rs2 → value of register rs2 RoCC response channel: resp.valid ← RoCC has a result ready resp.ready → CPU can accept result resp.bits.rd ← destination register index resp.bits.data ← result value to write to CPU register file
RoCC Interface — Tightly Coupled to RISC-V Pipeline RISC-V Pipeline Fetch Decode custom-0 → RoCC! Execute Writeback (from RoCC) Register File x0 ... x31 rs1=x5=0x200 rs2=x6=0x300 rd=x7 ← result RoCC Accelerator Decode funct7/3 Datapath MAC / Systolic Result Register Memory interface (optional: RoCC can also access L1/L2 cache) cmd: inst + rs1 + rs2 resp: rd + result data Custom instruction → RoCC → Result in rd register: as fast as 1–10 clock cycles
Fig 2: RoCC tightly-coupled interface. Custom RISC-V instructions pass register values directly to the coprocessor. Results write back to CPU registers with no memory roundtrip — 1–10 cycle latency.
Verilog — minimal RoCC accelerator skeleton
// Minimal RoCC coprocessor — receives custom-0 instruction, returns result module rocc_accel ( input logic clk, rst_n, // RoCC command channel (from CPU) input logic cmd_valid, output logic cmd_ready, input logic [6:0] cmd_inst_funct7, input logic [2:0] cmd_inst_funct3, input logic [4:0] cmd_inst_rd, input logic [63:0] cmd_rs1, cmd_rs2, // RoCC response channel (to CPU) output logic resp_valid, input logic resp_ready, output logic [4:0] resp_rd, output logic [63:0] resp_data ); // Simple single-cycle MAC: result = rs1 * rs2 + accumulator logic [63:0] accumulator; assign cmd_ready = 1'b1; // always ready (single-cycle for now) always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin accumulator <= '0; resp_valid <= 1'b0; end else if (cmd_valid && cmd_ready) begin case ({cmd_inst_funct7, cmd_inst_funct3}) 10'b0: accumulator <= cmd_rs1 * cmd_rs2 + accumulator; // MAC 10'b1: accumulator <= '0; // CLEAR default: ; endcase resp_rd <= cmd_inst_rd; resp_data <= accumulator; resp_valid <= 1'b1; end else if (resp_valid && resp_ready) begin resp_valid <= 1'b0; end end endmodule

5. Model 3 — AXI4 Loosely-Coupled Accelerator

AXI4 is the industry-standard on-chip bus protocol from ARM (now part of AMBA). An AXI4-connected accelerator appears as an AXI4 slave on the system bus — the CPU configures it via AXI4-Lite (simple register access), and the accelerator DMA-reads input data from memory and DMA-writes results back as an AXI4 master. This is the most widely used model in production SoCs because it works with any CPU, uses standard tools, and separates the compute from the data movement.

AXI4 Accelerator Anatomy

AXI4 vs AXI4-Lite — Know the Difference

AXI4-Lite: Single-beat transfers only (no burst). Used for control registers — simple, small, and synthesizes to minimal logic. AXI4 full: Supports bursts up to 256 beats, multiple outstanding transactions, exclusive access. Used for data DMA where bandwidth matters — a 256-beat burst at 128-bit width = 4 KB per transaction at peak bus bandwidth.

6. Model 4 — Custom ISA Extension

RISC-V explicitly reserves 4 opcode spaces for user-defined custom instructions: custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), and custom-3 (0x7B). These opcodes will never be used by ratified standard extensions, making them permanently safe for custom use.

RISC-V custom opcode spaces: custom-0: opcode = 0x0B (0000_1011) — non-standard, not 64-bit custom-1: opcode = 0x2B (0010_1011) — non-standard, not 64-bit custom-2: opcode = 0x5B (0101_1011) — non-standard/reserved custom-3: opcode = 0x7B (0111_1011) — non-standard/reserved Each opcode supports R/I/S/U type encodings: funct3 (3 bits) × funct7 (7 bits) = up to 1024 distinct operations per opcode space → 4096 total custom instruction variants Example: .insn r 0x0B, 0, 0, a0, a1, a2 → custom-0, funct3=0, funct7=0 → rs1=a1, rs2=a2, rd=a0 → CPU decodes and routes to your custom execution unit

Custom ISA Requires Toolchain Modification

Adding custom instructions means modifying GCC or LLVM to recognise the new opcodes and generate them from C intrinsics or inline assembly. The RISC-V GNU toolchain supports .insn directives for generating arbitrary instruction encodings without full toolchain patches — ideal for prototyping. Production custom ISA extensions typically add compiler intrinsics (like __builtin_riscv_my_accel(a, b)) mapped to the custom instruction encoding.

7. Complete Tradeoff Comparison

PropertyMMIORoCCAXI4Custom ISA
Invoke latency100s cycles (store+poll)1–10 cycles50–500 cycles1–5 cycles
Data bandwidthMemory-limited2 × 64-bit/cycle (rs1,rs2)Full AXI burst (GBps)2 × 64-bit/cycle
CPU compatibilityAny CPURocket/Chipyard onlyAny CPU with AXI4Custom pipeline only
HW complexityLow (simple bus)Medium (RoCC protocol)Medium–High (AXI4)High (pipeline modification)
SW complexityLow (C MMIO driver)Medium (custom asm)Low–Medium (driver + DMA)High (toolchain)
Interrupt supportYes (standard IRQ)Yes (via RoCC channel)Yes (standard IRQ)N/A (synchronous)
DMA supportExternal DMA neededBuilt-in L1 cache accessBuilt-in AXI4 masterNo (register operands)
Best forLarge data blocks, any coreLow-latency scalar opsHigh-BW, standard SoCNew compute primitives
Real exampleMost FPGA SoCsHwacha vector (Berkeley)Xilinx Vitis AI DPURISC-V V extension (vectors)

8. Real-World Examples

SystemRISC-V CoreAcceleratorInterfaceApplication
HwachaRocket (Berkeley)Vector acceleratorRoCCHPC / scientific compute
GemminiRocket + ChipyardSystolic array (16×16)RoCC + DMADeep learning inference (Berkeley)
CVA6 + NLP AccelCVA6 (ETH Zurich)NLP/BERT acceleratorAXI4Edge NLP inference
SiFive X280SiFive X280Custom vector extensionsCustom ISA (V-ext)ML / DSP
Ara (ETH Zurich)CVA6Vector coprocessor (4–16 lanes)AXI4 + custom ISASIMD / vector compute
OpenTitanIbex (RV32)AES / SHA / HMAC enginesMMIO (TL-UL bus)Security / cryptography
CHIPS Alliance SoCRocket / BOOMCustom DSP / ML blocksRoCC + AXI4Open silicon research

Gemmini — The Open-Source RISC-V Systolic Array

Gemmini from UC Berkeley is the closest open-source example to what we'll build in this course. It's a parametric systolic array generator that integrates via RoCC into the Rocket chip — configurable tile size, datatype (INT8/FP32), scratchpad size, and DMA bandwidth. It ships with a runtime library and can run full DNN inference. We'll use Gemmini's architecture as inspiration for our own Day 4–5 systolic array design.

9. Course Roadmap — What You'll Build in 15 Days

DayTopicWhat You Build
1Architecture Overview (this page)Understanding of all 4 models + which to use when
2Custom ISA Extensioncustom-0 instruction + GCC .insn assembler usage
3RoCC Interface Deep DiveRoCC MAC coprocessor in Verilog + protocol testbench
4Systolic Array Design8×8 weight-stationary systolic array + testbench
5Systolic Array via RoCCWire systolic array to RoCC + custom instruction test
6AXI4-Lite MMIO AcceleratorComplete AXI4-Lite slave with CSR file
7DMA EngineAXI4 master DMA with burst + scatter-gather
8INT8 NN Inference EngineQuantized MAC array with ReLU and bias
9Bare-Metal C DriverComplete inference API in C for the NN engine
10Full RISC-V SoC IntegrationTop-level SoC with AXI4 crossbar + all components
11Performance OptimizationTiling large matrices, double-buffering
12VerificationSVA properties on AXI4 + formal RoCC protocol check
13FPGA ImplementationBitstream for Arty A7 running real inference
14Physical DesignFloorplan + power domains for the full SoC
15CapstoneEnd-to-end RISC-V AI SoC with benchmarks

10. Day 1 Interview Q&A

#QuestionKey Answer Points
1What is the RoCC interface and what are its advantages?Tightly-coupled coprocessor interface in Rocket chip. Passes CPU register values directly to accelerator via custom instructions. No memory roundtrip → 1–10 cycle latency. Disadvantage: Rocket/Chipyard-specific, not portable to CVA6 or BOOM without reimplementation.
2Why would you choose AXI4 over RoCC for an accelerator?AXI4 is core-agnostic (works with any CPU), supports high-bandwidth DMA bursts for large data, is industry-standard (tools, IP, verification), and scales to complex multi-master SoCs. RoCC is better when latency of individual operations is the constraint and you're in the Rocket ecosystem.
3What RISC-V opcodes can you use for custom instructions?custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), custom-3 (0x7B) — guaranteed never ratified by official RISC-V extensions. Each supports R/I/S/U encoding with funct3+funct7 for 1024 sub-operations per opcode = 4096 total custom instruction variants.
4What is a systolic array and why is it good for matrix multiply?A 2D grid of Processing Elements (PEs) where data flows in a pipelined wave — each PE does one MAC (a×b+c) and passes results to its neighbor. For matrix multiply, inputs stream in from left (activations) and top (weights), and partial sums accumulate through the array. Every PE operates every cycle → compute efficiency near 100%, unlike a CPU which loads/stores between MACs.
5What is Gemmini?Open-source parametric systolic array generator from UC Berkeley that integrates with Rocket chip via RoCC. Configurable tile size, INT8/FP32, DMA, scratchpad. Used for research and production RISC-V AI SoC prototyping. Inspiration for what we build in this course.

Day 1 Knowledge Checklist

← Course IndexRISC-V + Accelerator Home Next →Day 2 — Custom ISA Extension