What are the RoCC interface signals?

The RoCC interface has three channels. Command channel (CPU→RoCC): cmd.valid, cmd.ready, cmd.bits.inst (32-bit instruction word), cmd.bits.rs1 (register value), cmd.bits.rs2 (register value), cmd.bits.status (processor privilege/status). Response channel (RoCC→CPU): resp.valid, resp.ready, resp.bits.rd (destination register index), resp.bits.data (result value). Memory channel (RoCC→L1/L2 cache, optional): mem.req.valid/ready/bits and mem.resp.valid/bits — allows the coprocessor to issue its own load/store operations into the cache hierarchy.

How does the RoCC valid/ready handshake work?

RoCC uses standard decoupled (valid/ready) handshake on both command and response channels. A transaction occurs when both valid AND ready are high on the same rising clock edge. For commands: the CPU asserts cmd.valid and the RoCC coprocessor asserts cmd.ready when it can accept the instruction. If the RoCC is busy, it deasserts cmd.ready and the CPU stalls (pipeline hazard). For responses: the RoCC asserts resp.valid when a result is ready. The CPU asserts resp.ready when it can write the result to the register file. The RoCC must hold resp.valid high until the CPU accepts with resp.ready.

RoCC Interface Deep Dive — RISC-V Tightly Coupled Coprocessor | Day 3

1. What Is RoCC?

RoCC (Rocket Custom Coprocessor) is a standardised tightly-coupled accelerator interface defined by the UC Berkeley Rocket chip generator. It allows a custom accelerator to be attached inside the Rocket CPU pipeline — instructions are decoded by the CPU and dispatched to the coprocessor with register values piggybacked, results return directly to CPU registers, and (optionally) the coprocessor can issue its own load/store requests into the L1/L2 cache hierarchy.

The critical property of "tightly coupled" is that no memory roundtrip is required for operand passing. The CPU reads rs1 and rs2 from the integer register file and forwards their values over the RoCC command channel — the coprocessor never touches memory to get its inputs. This is what achieves 1–10 cycle invoke latency.

RoCC vs AXI4 — one-line summary

RoCC: operands arrive as register values over a direct pipeline connection — zero memory latency, Rocket-specific. AXI4: operands must be written to SRAM first, then the accelerator DMA-reads them — adds 50–500 cycles but works with any CPU and scales to gigabytes of data.

2. Complete RoCC Signal Reference

Signal	Dir	Width	Description
Command Channel (CPU → RoCC)
cmd.valid	CPU→RoCC	1	CPU has a command ready to dispatch
cmd.ready	RoCC→CPU	1	RoCC can accept a new command this cycle
cmd.bits.inst	CPU→RoCC	32	Full 32-bit instruction word (opcode, funct3, funct7, rd, rs1, rs2)
cmd.bits.rs1	CPU→RoCC	64	Value of register rs1 from integer register file
cmd.bits.rs2	CPU→RoCC	64	Value of register rs2 from integer register file
cmd.bits.status	CPU→RoCC	varies	Processor privilege/mode status (M/S/U, interrupt enables)
Response Channel (RoCC → CPU)
resp.valid	RoCC→CPU	1	RoCC has a result ready to write back
resp.ready	CPU→RoCC	1	CPU register file can accept a write this cycle
resp.bits.rd	RoCC→CPU	5	Destination register index to write result into
resp.bits.data	RoCC→CPU	64	Result value to write to rd in register file
Memory Channel (RoCC ↔ L1 Cache, optional)
mem.req.valid	RoCC→Cache	1	RoCC issuing a load/store request
mem.req.ready	Cache→RoCC	1	Cache can accept the request
mem.req.bits.addr	RoCC→Cache	40	Physical address for the load/store
mem.req.bits.cmd	RoCC→Cache	5	M_XRD (load) or M_XWR (store)
mem.req.bits.data	RoCC→Cache	64	Store data (ignored for loads)
mem.resp.valid	Cache→RoCC	1	Cache returning load data
mem.resp.bits.data	Cache→RoCC	64	Loaded data from cache/memory
Control Signals
busy	RoCC→CPU	1	RoCC is processing — CPU won't context-switch
interrupt	RoCC→CPU	1	RoCC requesting an interrupt (error, completion)
exception	CPU→RoCC	1	CPU has taken an exception — RoCC should flush

3. Valid/Ready Handshake — Timing Diagrams

RoCC handshake rule (same as AXI4 decoupled): Transaction occurs when: valid == 1 AND ready == 1 on the SAME rising clock edge cmd channel (CPU→RoCC): CPU asserts cmd.valid with cmd.bits.* stable RoCC asserts cmd.ready when it can accept → transfer on cycle where both are high resp channel (RoCC→CPU): RoCC asserts resp.valid with resp.bits.* stable CPU asserts resp.ready when register file is available → transfer on cycle where both are high Back-pressure: If RoCC deasserts cmd.ready → CPU pipeline stalls (inserts bubbles) If CPU deasserts resp.ready → RoCC must hold resp.valid + resp.bits stable Multi-cycle coprocessor: Cycle 0: cmd.valid=1, cmd.ready=1 → command accepted Cycle 1–N: RoCC computing, cmd.ready=0 (busy), resp.valid=0 Cycle N: resp.valid=1, resp.bits=result Cycle N+1: resp.ready=1 from CPU → transfer complete

Fig 1: RoCC command accepted at C0 (cmd.valid ∧ cmd.ready). RoCC deasserts cmd.ready while computing. Response accepted at C5 (resp.valid ∧ resp.ready). CPU stalls between C0 and C5.

4. RoCC Memory Channel

The optional RoCC memory interface lets the coprocessor issue its own load and store requests directly into the L1 data cache, without any software involvement. This is how accelerators like Hwacha and Gemmini transfer large data arrays — the CPU kicks off the operation with a single custom instruction, and the coprocessor DMA-reads hundreds of cache lines by itself.

RoCC memory request flow: 1. Coprocessor asserts mem.req.valid with: mem.req.bits.addr = physical address mem.req.bits.cmd = M_XRD (load) or M_XWR (store) mem.req.bits.typ = MT_D (doubleword), MT_W (word), etc. mem.req.bits.data = store data (for writes) 2. L1 cache asserts mem.req.ready when it can accept 3. Transaction occurs (valid ∧ ready) 4. For loads: cache returns mem.resp.valid + mem.resp.bits.data after L1 hit latency (~4 cycles) or L2 miss (~20 cycles) Address translation: RoCC must use PHYSICAL addresses when mem.req.phys=1 If mem.req.phys=0, the L1 TLB handles virtual→physical translation → simpler: set phys=1 and use physical addresses directly (requires that software maps buffers to known physical addresses) Memory ordering: RoCC memory requests are ordered relative to each other by the coprocessor They are NOT ordered relative to CPU stores unless the software uses fence instructions or the coprocessor checks the busy signal

5. Verilog — Complete RoCC MAC Coprocessor

Verilog — RoCC MAC coprocessor with pipelined multiply

// RoCC-compatible MAC coprocessor
// Supports 3 custom-0 operations:
//   funct7=0, funct3=0 → MAC:   accum += rs1 * rs2
//   funct7=0, funct3=1 → CLEAR: accum = 0
//   funct7=0, funct3=2 → READ:  rd    = accum[63:0]
//
// Pipelined: 2-cycle multiply latency
// Accepts back-to-back instructions when pipeline is not full

module rocc_mac #(
  parameter XLEN = 64   // 64 for RV64, 32 for RV32
)(
  input  logic              clk, rst_n,
  // ─── RoCC Command Channel ───────────────────────────────
  input  logic              cmd_valid,
  output logic              cmd_ready,
  input  logic [31:0]       cmd_inst,          // full instruction word
  input  logic [XLEN-1:0]   cmd_rs1,
  input  logic [XLEN-1:0]   cmd_rs2,
  // ─── RoCC Response Channel ──────────────────────────────
  output logic              resp_valid,
  input  logic              resp_ready,
  output logic [4:0]        resp_rd,
  output logic [XLEN-1:0]   resp_data,
  // ─── Control ────────────────────────────────────────────
  output logic              busy,
  output logic              interrupt
);

  // Instruction field extraction
  logic [6:0] funct7;
  logic [2:0] funct3;
  logic [4:0] rd;
  assign funct7 = cmd_inst[31:25];
  assign funct3 = cmd_inst[14:12];
  assign rd     = cmd_inst[11:7];

  // Accumulator (128-bit to handle 64×64 overflow)
  logic [127:0] accumulator;

  // 2-stage multiply pipeline
  logic              pipe1_valid;
  logic [4:0]        pipe1_rd;
  logic [2*XLEN-1:0] pipe1_product;
  logic              pipe2_valid;
  logic [4:0]        pipe2_rd;
  logic [2*XLEN-1:0] pipe2_product;

  // Accept new cmd when pipeline stage 1 is free
  assign cmd_ready = !pipe1_valid || (pipe1_valid && !pipe2_valid);
  assign busy      = pipe1_valid || pipe2_valid || resp_valid;
  assign interrupt = 1'b0;  // no interrupt support in this impl

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      accumulator <= '0;
      pipe1_valid <= 0; pipe1_rd <= '0; pipe1_product <= '0;
      pipe2_valid <= 0; pipe2_rd <= '0; pipe2_product <= '0;
      resp_valid  <= 0; resp_rd  <= '0; resp_data     <= '0;
    end else begin

      // ── Stage 1: accept command, start multiply ──────────
      if (cmd_valid && cmd_ready) begin
        case (funct3)
          3'd0: begin   // MAC — pipeline the multiply
            pipe1_valid   <= 1'b1;
            pipe1_rd      <= rd;
            pipe1_product <= $signed(cmd_rs1) * $signed(cmd_rs2);
          end
          3'd1: begin   // CLEAR
            accumulator <= '0;
            pipe1_valid <= 1'b0;
          end
          3'd2: begin   // READ accumulator
            pipe1_valid <= 1'b0;
            // Direct response (bypass pipeline for read)
            resp_valid  <= 1'b1;
            resp_rd     <= rd;
            resp_data   <= accumulator[XLEN-1:0];
          end
          default: pipe1_valid <= 1'b0;
        endcase
      end else if (pipe2_valid) begin
        pipe1_valid <= 1'b0;  // clear after advancing
      end

      // ── Stage 2: add to accumulator ─────────────────────
      pipe2_valid   <= pipe1_valid;
      pipe2_rd      <= pipe1_rd;
      pipe2_product <= pipe1_product;

      if (pipe2_valid) begin
        accumulator <= accumulator + pipe2_product;
        resp_valid  <= 1'b1;
        resp_rd     <= pipe2_rd;
        resp_data   <= accumulator[XLEN-1:0];  // return old accum value
      end

      // ── Clear response after CPU accepts ────────────────
      if (resp_valid && resp_ready) begin
        resp_valid <= 1'b0;
      end

    end
  end

endmodule

6. Chipyard Integration

Chipyard is the UC Berkeley SoC generator that uses Rocket core with RoCC support. Integrating a custom RoCC accelerator requires a Scala config mixin that instantiates your accelerator and wires it to the Rocket tile.

Scala — Chipyard RoCC integration (LazyRoCC pattern)

// In src/main/scala/rocc/MacAccelerator.scala
package chipyard.rocc

import chisel3._
import chisel3.util._
import freechips.rocketchip.tile._
import freechips.rocketchip.config._
import freechips.rocketchip.diplomacy._

// LazyRoCC: Diplomacy wrapper for the RoCC accelerator
class MacAccelerator(opcodes: OpcodeSet)(implicit p: Parameters)
    extends LazyRoCC(opcodes, nPTWPorts = 0) {
  override lazy val module = new MacAcceleratorModule(this)
}

class MacAcceleratorModule(outer: MacAccelerator)(implicit p: Parameters)
    extends LazyRoCCModuleImp(outer) {

  // Wire to your Verilog (or write directly in Chisel)
  val mac = Module(new rocc_mac(xLen = 64))  // BlackBox for Verilog

  // Connect RoCC command channel
  mac.io.clk        := clock
  mac.io.rst_n      := !reset.asBool
  mac.io.cmd_valid  := io.cmd.valid
  io.cmd.ready      := mac.io.cmd_ready
  mac.io.cmd_inst   := io.cmd.bits.inst.asUInt
  mac.io.cmd_rs1    := io.cmd.bits.rs1
  mac.io.cmd_rs2    := io.cmd.bits.rs2

  // Connect RoCC response channel
  io.resp.valid       := mac.io.resp_valid
  mac.io.resp_ready   := io.resp.ready
  io.resp.bits.rd     := mac.io.resp_rd
  io.resp.bits.data   := mac.io.resp_data

  // Control
  io.busy       := mac.io.busy
  io.interrupt  := mac.io.interrupt
}

// Config mixin — add to your SoC config
class WithMacAccelerator extends Config((site, here, up) => {
  case BuildRoCC => up(BuildRoCC, site) ++ Seq(
    (p: Parameters) => {
      val mac = LazyModule(new MacAccelerator(
        OpcodeSet.custom0)(p))  // custom-0 opcode (0x0B)
      mac
    }
  )
})

// Final SoC config
class MacRocketConfig extends Config(
  new WithMacAccelerator ++
  new freechips.rocketchip.system.DefaultConfig
)

7. Protocol Testbench

Verilog — RoCC MAC protocol testbench

`timescale 1ns/1ps
module tb_rocc_mac;
  localparam XLEN = 64;

  logic              clk=0, rst_n;
  logic              cmd_valid, resp_ready;
  logic [31:0]       cmd_inst;
  logic [XLEN-1:0]   cmd_rs1, cmd_rs2;
  logic              cmd_ready, resp_valid, busy;
  logic [4:0]        resp_rd;
  logic [XLEN-1:0]   resp_data;

  rocc_mac #(.XLEN(XLEN)) dut (.*);

  always #5 clk = ~clk;

  // Build instruction word: funct7|rs2|rs1|funct3|rd|opcode
  function automatic [31:0] make_insn(
    input [6:0] f7, input [4:0] rs2, input [4:0] rs1,
    input [2:0] f3, input [4:0] rd
  );
    make_insn = {f7, rs2, rs1, f3, rd, 7'h0B};
  endfunction

  // Send one command, wait for response
  task automatic send_cmd(
    input [31:0] inst,
    input [XLEN-1:0] rs1, rs2,
    output [XLEN-1:0] result
  );
    @(posedge clk);
    cmd_valid <= 1; cmd_inst <= inst;
    cmd_rs1   <= rs1; cmd_rs2 <= rs2;
    // Wait for cmd_ready
    while (!cmd_ready) @(posedge clk);
    @(posedge clk);
    cmd_valid <= 0;
    // Wait for response
    resp_ready <= 1;
    while (!resp_valid) @(posedge clk);
    result = resp_data;
    @(posedge clk);
    resp_ready <= 0;
  endtask

  initial begin
    rst_n=0; cmd_valid=0; resp_ready=0;
    repeat(4) @(posedge clk);
    rst_n = 1;

    // Test 1: CLEAR (funct3=1)
    logic [XLEN-1:0] res;
    send_cmd(make_insn(0,0,0,1,10), 0, 0, res);
    $display("CLEAR done");

    // Test 2: MAC 3*4 (funct3=0, rd=x10, rs1=x11, rs2=x12)
    send_cmd(make_insn(0,12,11,0,10), 3, 4, res);
    $display("MAC(3,4): resp_data(prev accum)=%0d (expect 0)", res);
    assert(res==0) else $error("MAC1 wrong");

    // Test 3: MAC 5*6 = 30, accum should now be 12+30=42
    send_cmd(make_insn(0,12,11,0,10), 5, 6, res);
    $display("MAC(5,6): resp_data=%0d (expect 12)", res);
    assert(res==12) else $error("MAC2 wrong");

    // Test 4: READ accumulator (funct3=2)
    send_cmd(make_insn(0,0,0,2,10), 0, 0, res);
    $display("READ accum=%0d (expect 42)", res);
    assert(res==42) else $error("READ wrong");

    $display("All RoCC MAC tests PASSED"); $finish;
  end
endmodule

8. Interview Q&A

#	Question	Answer Points
1	What happens when a RoCC coprocessor deasserts cmd.ready?	The CPU pipeline stalls — it inserts pipeline bubbles and holds the custom instruction in the decode/issue stage until cmd.ready goes high again. This is the standard back-pressure mechanism. The CPU does not proceed to the next instruction while a RoCC command is pending and not yet accepted.
2	Can a RoCC coprocessor access memory without going through the CPU?	Yes, via the optional RoCC memory channel. The coprocessor issues load/store requests directly to the L1 data cache using mem.req.valid/ready/bits. The cache processes these independently of the CPU's own memory requests. The coprocessor can load a full cache line (64 bytes) per request and issue multiple outstanding requests for high bandwidth.
3	What is the busy signal used for in RoCC?	The busy signal tells the CPU that the coprocessor has outstanding work. When busy=1, the CPU will not context-switch (save/restore architectural state for another thread) because the coprocessor's internal state (e.g., the accumulator) is not part of the standard architectural register file and would be lost. The CPU OS will only allow context switching after busy deasserts.
4	How does the RoCC response channel differ from a simple wire back to the register file?	The response uses a valid/ready handshake because the register file write port may be busy on the cycle the result is ready (another instruction might be using the write port). The RoCC coprocessor holds resp.valid + resp.bits stable until the CPU asserts resp.ready, at which point the register file write happens. This decouples the coprocessor's compute latency from the register file's availability.
5	What is LazyRoCC in Chipyard/Rocket-chip?	LazyRoCC is the Diplomacy-based wrapper class for RoCC accelerators in the Rocket-chip/Chipyard framework. The "lazy" refers to Chisel's lazy module evaluation used by Diplomacy for parameter negotiation. You extend LazyRoCC, specify which opcode set to intercept (e.g. OpcodeSet.custom0), and implement the module logic in LazyRoCCModuleImp. The config mixin adds your accelerator to the Rocket tile's BuildRoCC sequence.

Day 3 Knowledge Checklist

☐ Name all RoCC command channel signals and their directions
☐ Name all RoCC response channel signals and their directions
☐ Explain the valid/ready handshake rule (when does a transfer happen?)
☐ Describe what happens when cmd.ready is deasserted
☐ Explain what the busy signal prevents the CPU from doing
☐ Write a minimal RoCC coprocessor in Verilog (cmd→compute→resp)
☐ Describe how the RoCC memory channel accesses the L1 cache
☐ Know what LazyRoCC is and how to integrate in Chipyard

← Day 2Custom ISA Extension Next → Day 4Systolic Array Design