HomeRISC-V + AcceleratorDay 3 — RoCC Interface Deep Dive

RoCC Interface Deep Dive

Every signal on the Rocket Custom Coprocessor interface, the valid/ready handshake timing, optional memory channel, a complete pipelined MAC coprocessor in Verilog, Chipyard integration steps, and a full protocol testbench.

By EcrioniX Engineering Team · Published June 19, 2026 · ~4,800 words · 16 min read

1. What Is RoCC?

RoCC (Rocket Custom Coprocessor) is a standardised tightly-coupled accelerator interface defined by the UC Berkeley Rocket chip generator. It allows a custom accelerator to be attached inside the Rocket CPU pipeline — instructions are decoded by the CPU and dispatched to the coprocessor with register values piggybacked, results return directly to CPU registers, and (optionally) the coprocessor can issue its own load/store requests into the L1/L2 cache hierarchy.

The critical property of "tightly coupled" is that no memory roundtrip is required for operand passing. The CPU reads rs1 and rs2 from the integer register file and forwards their values over the RoCC command channel — the coprocessor never touches memory to get its inputs. This is what achieves 1–10 cycle invoke latency.

RoCC vs AXI4 — one-line summary

RoCC: operands arrive as register values over a direct pipeline connection — zero memory latency, Rocket-specific. AXI4: operands must be written to SRAM first, then the accelerator DMA-reads them — adds 50–500 cycles but works with any CPU and scales to gigabytes of data.

2. Complete RoCC Signal Reference

SignalDirWidthDescription
Command Channel (CPU → RoCC)
cmd.validCPU→RoCC1CPU has a command ready to dispatch
cmd.readyRoCC→CPU1RoCC can accept a new command this cycle
cmd.bits.instCPU→RoCC32Full 32-bit instruction word (opcode, funct3, funct7, rd, rs1, rs2)
cmd.bits.rs1CPU→RoCC64Value of register rs1 from integer register file
cmd.bits.rs2CPU→RoCC64Value of register rs2 from integer register file
cmd.bits.statusCPU→RoCCvariesProcessor privilege/mode status (M/S/U, interrupt enables)
Response Channel (RoCC → CPU)
resp.validRoCC→CPU1RoCC has a result ready to write back
resp.readyCPU→RoCC1CPU register file can accept a write this cycle
resp.bits.rdRoCC→CPU5Destination register index to write result into
resp.bits.dataRoCC→CPU64Result value to write to rd in register file
Memory Channel (RoCC ↔ L1 Cache, optional)
mem.req.validRoCC→Cache1RoCC issuing a load/store request
mem.req.readyCache→RoCC1Cache can accept the request
mem.req.bits.addrRoCC→Cache40Physical address for the load/store
mem.req.bits.cmdRoCC→Cache5M_XRD (load) or M_XWR (store)
mem.req.bits.dataRoCC→Cache64Store data (ignored for loads)
mem.resp.validCache→RoCC1Cache returning load data
mem.resp.bits.dataCache→RoCC64Loaded data from cache/memory
Control Signals
busyRoCC→CPU1RoCC is processing — CPU won't context-switch
interruptRoCC→CPU1RoCC requesting an interrupt (error, completion)
exceptionCPU→RoCC1CPU has taken an exception — RoCC should flush

3. Valid/Ready Handshake — Timing Diagrams

RoCC handshake rule (same as AXI4 decoupled): Transaction occurs when: valid == 1 AND ready == 1 on the SAME rising clock edge cmd channel (CPU→RoCC): CPU asserts cmd.valid with cmd.bits.* stable RoCC asserts cmd.ready when it can accept → transfer on cycle where both are high resp channel (RoCC→CPU): RoCC asserts resp.valid with resp.bits.* stable CPU asserts resp.ready when register file is available → transfer on cycle where both are high Back-pressure: If RoCC deasserts cmd.ready → CPU pipeline stalls (inserts bubbles) If CPU deasserts resp.ready → RoCC must hold resp.valid + resp.bits stable Multi-cycle coprocessor: Cycle 0: cmd.valid=1, cmd.ready=1 → command accepted Cycle 1–N: RoCC computing, cmd.ready=0 (busy), resp.valid=0 Cycle N: resp.valid=1, resp.bits=result Cycle N+1: resp.ready=1 from CPU → transfer complete
RoCC Command + Response Handshake Timing CLK C0 C1 C2 C3 C4 C5 C6 cmd.valid cmd.ready CMD accepted resp.valid resp.ready RESP accepted RoCC computing (cmd.ready=0) CPU pipeline stalled
Fig 1: RoCC command accepted at C0 (cmd.valid ∧ cmd.ready). RoCC deasserts cmd.ready while computing. Response accepted at C5 (resp.valid ∧ resp.ready). CPU stalls between C0 and C5.

4. RoCC Memory Channel

The optional RoCC memory interface lets the coprocessor issue its own load and store requests directly into the L1 data cache, without any software involvement. This is how accelerators like Hwacha and Gemmini transfer large data arrays — the CPU kicks off the operation with a single custom instruction, and the coprocessor DMA-reads hundreds of cache lines by itself.

RoCC memory request flow: 1. Coprocessor asserts mem.req.valid with: mem.req.bits.addr = physical address mem.req.bits.cmd = M_XRD (load) or M_XWR (store) mem.req.bits.typ = MT_D (doubleword), MT_W (word), etc. mem.req.bits.data = store data (for writes) 2. L1 cache asserts mem.req.ready when it can accept 3. Transaction occurs (valid ∧ ready) 4. For loads: cache returns mem.resp.valid + mem.resp.bits.data after L1 hit latency (~4 cycles) or L2 miss (~20 cycles) Address translation: RoCC must use PHYSICAL addresses when mem.req.phys=1 If mem.req.phys=0, the L1 TLB handles virtual→physical translation → simpler: set phys=1 and use physical addresses directly (requires that software maps buffers to known physical addresses) Memory ordering: RoCC memory requests are ordered relative to each other by the coprocessor They are NOT ordered relative to CPU stores unless the software uses fence instructions or the coprocessor checks the busy signal

5. Verilog — Complete RoCC MAC Coprocessor

Verilog — RoCC MAC coprocessor with pipelined multiply
// RoCC-compatible MAC coprocessor // Supports 3 custom-0 operations: // funct7=0, funct3=0 → MAC: accum += rs1 * rs2 // funct7=0, funct3=1 → CLEAR: accum = 0 // funct7=0, funct3=2 → READ: rd = accum[63:0] // // Pipelined: 2-cycle multiply latency // Accepts back-to-back instructions when pipeline is not full module rocc_mac #( parameter XLEN = 64 // 64 for RV64, 32 for RV32 )( input logic clk, rst_n, // ─── RoCC Command Channel ─────────────────────────────── input logic cmd_valid, output logic cmd_ready, input logic [31:0] cmd_inst, // full instruction word input logic [XLEN-1:0] cmd_rs1, input logic [XLEN-1:0] cmd_rs2, // ─── RoCC Response Channel ────────────────────────────── output logic resp_valid, input logic resp_ready, output logic [4:0] resp_rd, output logic [XLEN-1:0] resp_data, // ─── Control ──────────────────────────────────────────── output logic busy, output logic interrupt ); // Instruction field extraction logic [6:0] funct7; logic [2:0] funct3; logic [4:0] rd; assign funct7 = cmd_inst[31:25]; assign funct3 = cmd_inst[14:12]; assign rd = cmd_inst[11:7]; // Accumulator (128-bit to handle 64×64 overflow) logic [127:0] accumulator; // 2-stage multiply pipeline logic pipe1_valid; logic [4:0] pipe1_rd; logic [2*XLEN-1:0] pipe1_product; logic pipe2_valid; logic [4:0] pipe2_rd; logic [2*XLEN-1:0] pipe2_product; // Accept new cmd when pipeline stage 1 is free assign cmd_ready = !pipe1_valid || (pipe1_valid && !pipe2_valid); assign busy = pipe1_valid || pipe2_valid || resp_valid; assign interrupt = 1'b0; // no interrupt support in this impl always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin accumulator <= '0; pipe1_valid <= 0; pipe1_rd <= '0; pipe1_product <= '0; pipe2_valid <= 0; pipe2_rd <= '0; pipe2_product <= '0; resp_valid <= 0; resp_rd <= '0; resp_data <= '0; end else begin // ── Stage 1: accept command, start multiply ────────── if (cmd_valid && cmd_ready) begin case (funct3) 3'd0: begin // MAC — pipeline the multiply pipe1_valid <= 1'b1; pipe1_rd <= rd; pipe1_product <= $signed(cmd_rs1) * $signed(cmd_rs2); end 3'd1: begin // CLEAR accumulator <= '0; pipe1_valid <= 1'b0; end 3'd2: begin // READ accumulator pipe1_valid <= 1'b0; // Direct response (bypass pipeline for read) resp_valid <= 1'b1; resp_rd <= rd; resp_data <= accumulator[XLEN-1:0]; end default: pipe1_valid <= 1'b0; endcase end else if (pipe2_valid) begin pipe1_valid <= 1'b0; // clear after advancing end // ── Stage 2: add to accumulator ───────────────────── pipe2_valid <= pipe1_valid; pipe2_rd <= pipe1_rd; pipe2_product <= pipe1_product; if (pipe2_valid) begin accumulator <= accumulator + pipe2_product; resp_valid <= 1'b1; resp_rd <= pipe2_rd; resp_data <= accumulator[XLEN-1:0]; // return old accum value end // ── Clear response after CPU accepts ──────────────── if (resp_valid && resp_ready) begin resp_valid <= 1'b0; end end end endmodule

6. Chipyard Integration

Chipyard is the UC Berkeley SoC generator that uses Rocket core with RoCC support. Integrating a custom RoCC accelerator requires a Scala config mixin that instantiates your accelerator and wires it to the Rocket tile.

Scala — Chipyard RoCC integration (LazyRoCC pattern)
// In src/main/scala/rocc/MacAccelerator.scala package chipyard.rocc import chisel3._ import chisel3.util._ import freechips.rocketchip.tile._ import freechips.rocketchip.config._ import freechips.rocketchip.diplomacy._ // LazyRoCC: Diplomacy wrapper for the RoCC accelerator class MacAccelerator(opcodes: OpcodeSet)(implicit p: Parameters) extends LazyRoCC(opcodes, nPTWPorts = 0) { override lazy val module = new MacAcceleratorModule(this) } class MacAcceleratorModule(outer: MacAccelerator)(implicit p: Parameters) extends LazyRoCCModuleImp(outer) { // Wire to your Verilog (or write directly in Chisel) val mac = Module(new rocc_mac(xLen = 64)) // BlackBox for Verilog // Connect RoCC command channel mac.io.clk := clock mac.io.rst_n := !reset.asBool mac.io.cmd_valid := io.cmd.valid io.cmd.ready := mac.io.cmd_ready mac.io.cmd_inst := io.cmd.bits.inst.asUInt mac.io.cmd_rs1 := io.cmd.bits.rs1 mac.io.cmd_rs2 := io.cmd.bits.rs2 // Connect RoCC response channel io.resp.valid := mac.io.resp_valid mac.io.resp_ready := io.resp.ready io.resp.bits.rd := mac.io.resp_rd io.resp.bits.data := mac.io.resp_data // Control io.busy := mac.io.busy io.interrupt := mac.io.interrupt } // Config mixin — add to your SoC config class WithMacAccelerator extends Config((site, here, up) => { case BuildRoCC => up(BuildRoCC, site) ++ Seq( (p: Parameters) => { val mac = LazyModule(new MacAccelerator( OpcodeSet.custom0)(p)) // custom-0 opcode (0x0B) mac } ) }) // Final SoC config class MacRocketConfig extends Config( new WithMacAccelerator ++ new freechips.rocketchip.system.DefaultConfig )

7. Protocol Testbench

Verilog — RoCC MAC protocol testbench
`timescale 1ns/1ps module tb_rocc_mac; localparam XLEN = 64; logic clk=0, rst_n; logic cmd_valid, resp_ready; logic [31:0] cmd_inst; logic [XLEN-1:0] cmd_rs1, cmd_rs2; logic cmd_ready, resp_valid, busy; logic [4:0] resp_rd; logic [XLEN-1:0] resp_data; rocc_mac #(.XLEN(XLEN)) dut (.*); always #5 clk = ~clk; // Build instruction word: funct7|rs2|rs1|funct3|rd|opcode function automatic [31:0] make_insn( input [6:0] f7, input [4:0] rs2, input [4:0] rs1, input [2:0] f3, input [4:0] rd ); make_insn = {f7, rs2, rs1, f3, rd, 7'h0B}; endfunction // Send one command, wait for response task automatic send_cmd( input [31:0] inst, input [XLEN-1:0] rs1, rs2, output [XLEN-1:0] result ); @(posedge clk); cmd_valid <= 1; cmd_inst <= inst; cmd_rs1 <= rs1; cmd_rs2 <= rs2; // Wait for cmd_ready while (!cmd_ready) @(posedge clk); @(posedge clk); cmd_valid <= 0; // Wait for response resp_ready <= 1; while (!resp_valid) @(posedge clk); result = resp_data; @(posedge clk); resp_ready <= 0; endtask initial begin rst_n=0; cmd_valid=0; resp_ready=0; repeat(4) @(posedge clk); rst_n = 1; // Test 1: CLEAR (funct3=1) logic [XLEN-1:0] res; send_cmd(make_insn(0,0,0,1,10), 0, 0, res); $display("CLEAR done"); // Test 2: MAC 3*4 (funct3=0, rd=x10, rs1=x11, rs2=x12) send_cmd(make_insn(0,12,11,0,10), 3, 4, res); $display("MAC(3,4): resp_data(prev accum)=%0d (expect 0)", res); assert(res==0) else $error("MAC1 wrong"); // Test 3: MAC 5*6 = 30, accum should now be 12+30=42 send_cmd(make_insn(0,12,11,0,10), 5, 6, res); $display("MAC(5,6): resp_data=%0d (expect 12)", res); assert(res==12) else $error("MAC2 wrong"); // Test 4: READ accumulator (funct3=2) send_cmd(make_insn(0,0,0,2,10), 0, 0, res); $display("READ accum=%0d (expect 42)", res); assert(res==42) else $error("READ wrong"); $display("All RoCC MAC tests PASSED"); $finish; end endmodule

8. Interview Q&A

#QuestionAnswer Points
1What happens when a RoCC coprocessor deasserts cmd.ready?The CPU pipeline stalls — it inserts pipeline bubbles and holds the custom instruction in the decode/issue stage until cmd.ready goes high again. This is the standard back-pressure mechanism. The CPU does not proceed to the next instruction while a RoCC command is pending and not yet accepted.
2Can a RoCC coprocessor access memory without going through the CPU?Yes, via the optional RoCC memory channel. The coprocessor issues load/store requests directly to the L1 data cache using mem.req.valid/ready/bits. The cache processes these independently of the CPU's own memory requests. The coprocessor can load a full cache line (64 bytes) per request and issue multiple outstanding requests for high bandwidth.
3What is the busy signal used for in RoCC?The busy signal tells the CPU that the coprocessor has outstanding work. When busy=1, the CPU will not context-switch (save/restore architectural state for another thread) because the coprocessor's internal state (e.g., the accumulator) is not part of the standard architectural register file and would be lost. The CPU OS will only allow context switching after busy deasserts.
4How does the RoCC response channel differ from a simple wire back to the register file?The response uses a valid/ready handshake because the register file write port may be busy on the cycle the result is ready (another instruction might be using the write port). The RoCC coprocessor holds resp.valid + resp.bits stable until the CPU asserts resp.ready, at which point the register file write happens. This decouples the coprocessor's compute latency from the register file's availability.
5What is LazyRoCC in Chipyard/Rocket-chip?LazyRoCC is the Diplomacy-based wrapper class for RoCC accelerators in the Rocket-chip/Chipyard framework. The "lazy" refers to Chisel's lazy module evaluation used by Diplomacy for parameter negotiation. You extend LazyRoCC, specify which opcode set to intercept (e.g. OpcodeSet.custom0), and implement the module logic in LazyRoCCModuleImp. The config mixin adds your accelerator to the Rocket tile's BuildRoCC sequence.

Day 3 Knowledge Checklist

← Day 2Custom ISA Extension Next → Day 4Systolic Array Design