Every signal on the Rocket Custom Coprocessor interface, the valid/ready handshake timing, optional memory channel, a complete pipelined MAC coprocessor in Verilog, Chipyard integration steps, and a full protocol testbench.
RoCC (Rocket Custom Coprocessor) is a standardised tightly-coupled accelerator interface defined by the UC Berkeley Rocket chip generator. It allows a custom accelerator to be attached inside the Rocket CPU pipeline — instructions are decoded by the CPU and dispatched to the coprocessor with register values piggybacked, results return directly to CPU registers, and (optionally) the coprocessor can issue its own load/store requests into the L1/L2 cache hierarchy.
The critical property of "tightly coupled" is that no memory roundtrip is required for operand passing. The CPU reads rs1 and rs2 from the integer register file and forwards their values over the RoCC command channel — the coprocessor never touches memory to get its inputs. This is what achieves 1–10 cycle invoke latency.
RoCC: operands arrive as register values over a direct pipeline connection — zero memory latency, Rocket-specific. AXI4: operands must be written to SRAM first, then the accelerator DMA-reads them — adds 50–500 cycles but works with any CPU and scales to gigabytes of data.
| Signal | Dir | Width | Description |
|---|---|---|---|
| Command Channel (CPU → RoCC) | |||
| cmd.valid | CPU→RoCC | 1 | CPU has a command ready to dispatch |
| cmd.ready | RoCC→CPU | 1 | RoCC can accept a new command this cycle |
| cmd.bits.inst | CPU→RoCC | 32 | Full 32-bit instruction word (opcode, funct3, funct7, rd, rs1, rs2) |
| cmd.bits.rs1 | CPU→RoCC | 64 | Value of register rs1 from integer register file |
| cmd.bits.rs2 | CPU→RoCC | 64 | Value of register rs2 from integer register file |
| cmd.bits.status | CPU→RoCC | varies | Processor privilege/mode status (M/S/U, interrupt enables) |
| Response Channel (RoCC → CPU) | |||
| resp.valid | RoCC→CPU | 1 | RoCC has a result ready to write back |
| resp.ready | CPU→RoCC | 1 | CPU register file can accept a write this cycle |
| resp.bits.rd | RoCC→CPU | 5 | Destination register index to write result into |
| resp.bits.data | RoCC→CPU | 64 | Result value to write to rd in register file |
| Memory Channel (RoCC ↔ L1 Cache, optional) | |||
| mem.req.valid | RoCC→Cache | 1 | RoCC issuing a load/store request |
| mem.req.ready | Cache→RoCC | 1 | Cache can accept the request |
| mem.req.bits.addr | RoCC→Cache | 40 | Physical address for the load/store |
| mem.req.bits.cmd | RoCC→Cache | 5 | M_XRD (load) or M_XWR (store) |
| mem.req.bits.data | RoCC→Cache | 64 | Store data (ignored for loads) |
| mem.resp.valid | Cache→RoCC | 1 | Cache returning load data |
| mem.resp.bits.data | Cache→RoCC | 64 | Loaded data from cache/memory |
| Control Signals | |||
| busy | RoCC→CPU | 1 | RoCC is processing — CPU won't context-switch |
| interrupt | RoCC→CPU | 1 | RoCC requesting an interrupt (error, completion) |
| exception | CPU→RoCC | 1 | CPU has taken an exception — RoCC should flush |
The optional RoCC memory interface lets the coprocessor issue its own load and store requests directly into the L1 data cache, without any software involvement. This is how accelerators like Hwacha and Gemmini transfer large data arrays — the CPU kicks off the operation with a single custom instruction, and the coprocessor DMA-reads hundreds of cache lines by itself.
// RoCC-compatible MAC coprocessor
// Supports 3 custom-0 operations:
// funct7=0, funct3=0 → MAC: accum += rs1 * rs2
// funct7=0, funct3=1 → CLEAR: accum = 0
// funct7=0, funct3=2 → READ: rd = accum[63:0]
//
// Pipelined: 2-cycle multiply latency
// Accepts back-to-back instructions when pipeline is not full
module rocc_mac #(
parameter XLEN = 64 // 64 for RV64, 32 for RV32
)(
input logic clk, rst_n,
// ─── RoCC Command Channel ───────────────────────────────
input logic cmd_valid,
output logic cmd_ready,
input logic [31:0] cmd_inst, // full instruction word
input logic [XLEN-1:0] cmd_rs1,
input logic [XLEN-1:0] cmd_rs2,
// ─── RoCC Response Channel ──────────────────────────────
output logic resp_valid,
input logic resp_ready,
output logic [4:0] resp_rd,
output logic [XLEN-1:0] resp_data,
// ─── Control ────────────────────────────────────────────
output logic busy,
output logic interrupt
);
// Instruction field extraction
logic [6:0] funct7;
logic [2:0] funct3;
logic [4:0] rd;
assign funct7 = cmd_inst[31:25];
assign funct3 = cmd_inst[14:12];
assign rd = cmd_inst[11:7];
// Accumulator (128-bit to handle 64×64 overflow)
logic [127:0] accumulator;
// 2-stage multiply pipeline
logic pipe1_valid;
logic [4:0] pipe1_rd;
logic [2*XLEN-1:0] pipe1_product;
logic pipe2_valid;
logic [4:0] pipe2_rd;
logic [2*XLEN-1:0] pipe2_product;
// Accept new cmd when pipeline stage 1 is free
assign cmd_ready = !pipe1_valid || (pipe1_valid && !pipe2_valid);
assign busy = pipe1_valid || pipe2_valid || resp_valid;
assign interrupt = 1'b0; // no interrupt support in this impl
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
accumulator <= '0;
pipe1_valid <= 0; pipe1_rd <= '0; pipe1_product <= '0;
pipe2_valid <= 0; pipe2_rd <= '0; pipe2_product <= '0;
resp_valid <= 0; resp_rd <= '0; resp_data <= '0;
end else begin
// ── Stage 1: accept command, start multiply ──────────
if (cmd_valid && cmd_ready) begin
case (funct3)
3'd0: begin // MAC — pipeline the multiply
pipe1_valid <= 1'b1;
pipe1_rd <= rd;
pipe1_product <= $signed(cmd_rs1) * $signed(cmd_rs2);
end
3'd1: begin // CLEAR
accumulator <= '0;
pipe1_valid <= 1'b0;
end
3'd2: begin // READ accumulator
pipe1_valid <= 1'b0;
// Direct response (bypass pipeline for read)
resp_valid <= 1'b1;
resp_rd <= rd;
resp_data <= accumulator[XLEN-1:0];
end
default: pipe1_valid <= 1'b0;
endcase
end else if (pipe2_valid) begin
pipe1_valid <= 1'b0; // clear after advancing
end
// ── Stage 2: add to accumulator ─────────────────────
pipe2_valid <= pipe1_valid;
pipe2_rd <= pipe1_rd;
pipe2_product <= pipe1_product;
if (pipe2_valid) begin
accumulator <= accumulator + pipe2_product;
resp_valid <= 1'b1;
resp_rd <= pipe2_rd;
resp_data <= accumulator[XLEN-1:0]; // return old accum value
end
// ── Clear response after CPU accepts ────────────────
if (resp_valid && resp_ready) begin
resp_valid <= 1'b0;
end
end
end
endmoduleChipyard is the UC Berkeley SoC generator that uses Rocket core with RoCC support. Integrating a custom RoCC accelerator requires a Scala config mixin that instantiates your accelerator and wires it to the Rocket tile.
// In src/main/scala/rocc/MacAccelerator.scala
package chipyard.rocc
import chisel3._
import chisel3.util._
import freechips.rocketchip.tile._
import freechips.rocketchip.config._
import freechips.rocketchip.diplomacy._
// LazyRoCC: Diplomacy wrapper for the RoCC accelerator
class MacAccelerator(opcodes: OpcodeSet)(implicit p: Parameters)
extends LazyRoCC(opcodes, nPTWPorts = 0) {
override lazy val module = new MacAcceleratorModule(this)
}
class MacAcceleratorModule(outer: MacAccelerator)(implicit p: Parameters)
extends LazyRoCCModuleImp(outer) {
// Wire to your Verilog (or write directly in Chisel)
val mac = Module(new rocc_mac(xLen = 64)) // BlackBox for Verilog
// Connect RoCC command channel
mac.io.clk := clock
mac.io.rst_n := !reset.asBool
mac.io.cmd_valid := io.cmd.valid
io.cmd.ready := mac.io.cmd_ready
mac.io.cmd_inst := io.cmd.bits.inst.asUInt
mac.io.cmd_rs1 := io.cmd.bits.rs1
mac.io.cmd_rs2 := io.cmd.bits.rs2
// Connect RoCC response channel
io.resp.valid := mac.io.resp_valid
mac.io.resp_ready := io.resp.ready
io.resp.bits.rd := mac.io.resp_rd
io.resp.bits.data := mac.io.resp_data
// Control
io.busy := mac.io.busy
io.interrupt := mac.io.interrupt
}
// Config mixin — add to your SoC config
class WithMacAccelerator extends Config((site, here, up) => {
case BuildRoCC => up(BuildRoCC, site) ++ Seq(
(p: Parameters) => {
val mac = LazyModule(new MacAccelerator(
OpcodeSet.custom0)(p)) // custom-0 opcode (0x0B)
mac
}
)
})
// Final SoC config
class MacRocketConfig extends Config(
new WithMacAccelerator ++
new freechips.rocketchip.system.DefaultConfig
)`timescale 1ns/1ps
module tb_rocc_mac;
localparam XLEN = 64;
logic clk=0, rst_n;
logic cmd_valid, resp_ready;
logic [31:0] cmd_inst;
logic [XLEN-1:0] cmd_rs1, cmd_rs2;
logic cmd_ready, resp_valid, busy;
logic [4:0] resp_rd;
logic [XLEN-1:0] resp_data;
rocc_mac #(.XLEN(XLEN)) dut (.*);
always #5 clk = ~clk;
// Build instruction word: funct7|rs2|rs1|funct3|rd|opcode
function automatic [31:0] make_insn(
input [6:0] f7, input [4:0] rs2, input [4:0] rs1,
input [2:0] f3, input [4:0] rd
);
make_insn = {f7, rs2, rs1, f3, rd, 7'h0B};
endfunction
// Send one command, wait for response
task automatic send_cmd(
input [31:0] inst,
input [XLEN-1:0] rs1, rs2,
output [XLEN-1:0] result
);
@(posedge clk);
cmd_valid <= 1; cmd_inst <= inst;
cmd_rs1 <= rs1; cmd_rs2 <= rs2;
// Wait for cmd_ready
while (!cmd_ready) @(posedge clk);
@(posedge clk);
cmd_valid <= 0;
// Wait for response
resp_ready <= 1;
while (!resp_valid) @(posedge clk);
result = resp_data;
@(posedge clk);
resp_ready <= 0;
endtask
initial begin
rst_n=0; cmd_valid=0; resp_ready=0;
repeat(4) @(posedge clk);
rst_n = 1;
// Test 1: CLEAR (funct3=1)
logic [XLEN-1:0] res;
send_cmd(make_insn(0,0,0,1,10), 0, 0, res);
$display("CLEAR done");
// Test 2: MAC 3*4 (funct3=0, rd=x10, rs1=x11, rs2=x12)
send_cmd(make_insn(0,12,11,0,10), 3, 4, res);
$display("MAC(3,4): resp_data(prev accum)=%0d (expect 0)", res);
assert(res==0) else $error("MAC1 wrong");
// Test 3: MAC 5*6 = 30, accum should now be 12+30=42
send_cmd(make_insn(0,12,11,0,10), 5, 6, res);
$display("MAC(5,6): resp_data=%0d (expect 12)", res);
assert(res==12) else $error("MAC2 wrong");
// Test 4: READ accumulator (funct3=2)
send_cmd(make_insn(0,0,0,2,10), 0, 0, res);
$display("READ accum=%0d (expect 42)", res);
assert(res==42) else $error("READ wrong");
$display("All RoCC MAC tests PASSED"); $finish;
end
endmodule| # | Question | Answer Points |
|---|---|---|
| 1 | What happens when a RoCC coprocessor deasserts cmd.ready? | The CPU pipeline stalls — it inserts pipeline bubbles and holds the custom instruction in the decode/issue stage until cmd.ready goes high again. This is the standard back-pressure mechanism. The CPU does not proceed to the next instruction while a RoCC command is pending and not yet accepted. |
| 2 | Can a RoCC coprocessor access memory without going through the CPU? | Yes, via the optional RoCC memory channel. The coprocessor issues load/store requests directly to the L1 data cache using mem.req.valid/ready/bits. The cache processes these independently of the CPU's own memory requests. The coprocessor can load a full cache line (64 bytes) per request and issue multiple outstanding requests for high bandwidth. |
| 3 | What is the busy signal used for in RoCC? | The busy signal tells the CPU that the coprocessor has outstanding work. When busy=1, the CPU will not context-switch (save/restore architectural state for another thread) because the coprocessor's internal state (e.g., the accumulator) is not part of the standard architectural register file and would be lost. The CPU OS will only allow context switching after busy deasserts. |
| 4 | How does the RoCC response channel differ from a simple wire back to the register file? | The response uses a valid/ready handshake because the register file write port may be busy on the cycle the result is ready (another instruction might be using the write port). The RoCC coprocessor holds resp.valid + resp.bits stable until the CPU asserts resp.ready, at which point the register file write happens. This decouples the coprocessor's compute latency from the register file's availability. |
| 5 | What is LazyRoCC in Chipyard/Rocket-chip? | LazyRoCC is the Diplomacy-based wrapper class for RoCC accelerators in the Rocket-chip/Chipyard framework. The "lazy" refers to Chisel's lazy module evaluation used by Diplomacy for parameter negotiation. You extend LazyRoCC, specify which opcode set to intercept (e.g. OpcodeSet.custom0), and implement the module logic in LazyRoCCModuleImp. The config mixin adds your accelerator to the Rocket tile's BuildRoCC sequence. |