HomeRISC-V + AcceleratorDay 5 — Systolic Array via RoCC
RISC-V + Accelerator · Day 05 of 15

Systolic Array via RoCC
Full End-to-End Integration

By EcrioniX · Updated June 2026 · ~50 min read
RoCC WrapperFSM ControllerWeight Load Run MatMulChipyard ScalaC Driverfunct7 decode

Connecting Systolic Array to RoCC

In Day 4 we built a standalone 2×2 systolic array. Now we wrap it with a RoCC interface so the RISC-V CPU can control it using custom instructions. The key design decisions: what does each funct7 command do, how does the FSM sequence load→compute→respond, and how does the C driver invoke it?

Integration Architecture

CPU issues .insn r 0x0B, funct3, funct7, rd, rs1, rs2 → RoCC cmd channel → Controller FSM decodes funct7 → drives systolic array phases → sends result back via resp channel. Memory access (weight/activation DMA) uses RoCC mem channel.

Command Encoding

funct7Operationrs1rs2Response
0x00LOAD_WEIGHTWeight matrix base addrMatrix size NACK when done
0x01RUN_MATMULInput matrix A addrOutput matrix C addrCycle count
0x02READ_STATUSbusy/idle status
Verilog — RoCC wrapper FSM for systolic array
module rocc_systolic (
  input         clk, rst,
  // RoCC cmd channel
  input         cmd_valid,
  output        cmd_ready,
  input  [6:0] funct7,
  input  [4:0] rd,
  input  [63:0] rs1, rs2,
  // RoCC resp channel
  output reg    resp_valid,
  input         resp_ready,
  output reg [4:0]  resp_rd,
  output reg [63:0] resp_data,
  // busy to stall CPU
  output        busy
);
  // FSM states
  localparam IDLE        = 3'd0;
  localparam LOAD_W     = 3'd1;  // loading weights into PE array
  localparam WAIT_LOAD  = 3'd2;  // waiting for DMA weight load
  localparam RUN        = 3'd3;  // streaming activations, collecting output
  localparam RESPOND   = 3'd4;  // sending result back

  reg [2:0]  state;
  reg [63:0] w_addr, a_addr, c_addr;
  reg [31:0] cycle_cnt;
  reg [4:0]  saved_rd;

  assign cmd_ready = (state == IDLE);
  assign busy      = (state != IDLE);

  always @(posedge clk) begin
    if (rst) state <= IDLE;
    else case (state)

      IDLE: if (cmd_valid) begin
        saved_rd <= rd;
        cycle_cnt <= 0;
        case (funct7)
          7'h00: begin  // LOAD_WEIGHT
            w_addr <= rs1; state <= LOAD_W;
          end
          7'h01: begin  // RUN_MATMUL
            a_addr <= rs1; c_addr <= rs2; state <= RUN;
          end
          7'h02: begin  // READ_STATUS — immediate response
            resp_data <= 64'd0; state <= RESPOND;
          end
        endcase
      end

      LOAD_W:    state <= WAIT_LOAD;  // initiate DMA via mem channel
      WAIT_LOAD: if (dma_done) state <= RESPOND;

      RUN: begin
        cycle_cnt <= cycle_cnt + 1;
        if (matmul_done) begin
          resp_data <= cycle_cnt; state <= RESPOND;
        end
      end

      RESPOND: begin
        resp_valid <= 1; resp_rd <= saved_rd;
        if (resp_ready) begin resp_valid <= 0; state <= IDLE; end
      end
    endcase
  end
endmodule

Chipyard LazyRoCC Scala Wrapper

Scala — Chipyard LazyRoCC integration
// In src/main/scala/SystolicAccelerator.scala
class SystolicAccelerator(implicit p: Parameters)
  extends LazyRoCC(
    opcodes = OpcodeSet.custom0,   // uses 0x0B opcode space
    nPTWPorts = 0
  ) {
  override lazy val module = new SystolicAcceleratorImp(this)
}

class SystolicAcceleratorImp(outer: SystolicAccelerator)
  extends LazyRoCCModuleImp(outer) {

  val sys = Module(new RoccSystolicBlackBox)

  sys.io.clk       := clock
  sys.io.rst       := reset
  sys.io.cmd_valid := io.cmd.valid
  io.cmd.ready     := sys.io.cmd_ready
  sys.io.funct7    := io.cmd.bits.inst.funct
  sys.io.rd        := io.cmd.bits.inst.rd
  sys.io.rs1       := io.cmd.bits.rs1
  sys.io.rs2       := io.cmd.bits.rs2

  io.resp.valid       := sys.io.resp_valid
  sys.io.resp_ready   := io.resp.ready
  io.resp.bits.rd     := sys.io.resp_rd
  io.resp.bits.data   := sys.io.resp_data
  io.busy             := sys.io.busy
  io.interrupt        := false.B
}

// Config mixin
class WithSystolicAccelerator extends Config((site, here, up) => {
  case BuildRoCC => up(BuildRoCC, site) ++ Seq(
    (p: Parameters) => {
      val acc = LazyModule(new SystolicAccelerator()(p))
      acc
    }
  )
})

C Driver

C — Host driver for systolic matrix multiply
#include <stdint.h>

// Custom instruction macros — funct7 encodes operation
#define ROCC_FUNCT7_LOAD_WEIGHT  0x00
#define ROCC_FUNCT7_RUN_MATMUL   0x01
#define ROCC_FUNCT7_STATUS       0x02

#define ROCC_INSN(funct7, rd, rs1, rs2) \
  __asm__ volatile ( \
    ".insn r 0x0B, 0, " #funct7 ", %0, %1, %2\n" \
    : "=r"(rd) : "r"(rs1), "r"(rs2) \
  )

/* Load weight matrix B into systolic array PE registers */
static inline void sys_load_weight(const int8_t *B, int N) {
  uint64_t ack;
  ROCC_INSN(0, ack, (uint64_t)B, (uint64_t)N);
  /* blocks until weights are loaded */
}

/* Run matrix multiply: C = A × B (B already loaded) */
static inline uint64_t sys_matmul(const int8_t *A, int32_t *C) {
  uint64_t cycles;
  ROCC_INSN(1, cycles, (uint64_t)A, (uint64_t)C);
  return cycles;   /* returns cycle count for profiling */
}

/* Example: 4×4 INT8 matrix multiply */
void example() {
  int8_t   A[4][4] = {{1,2,3,4},{5,6,7,8},{9,10,11,12},{13,14,15,16}};
  int8_t   B[4][4] = {{1,0,0,0},{0,1,0,0},{0,0,1,0},{0,0,0,1}};  // identity
  int32_t  C[4][4];

  sys_load_weight((int8_t*)B, 4);
  uint64_t cycles = sys_matmul((int8_t*)A, (int32_t*)C);
  // C should equal A (identity multiply)
  printf("MatMul done in %llu cycles\n", cycles);
}

Day 5 — Interview Questions

Q1How do you map multiple operations (load weight, run matmul, read status) to a single RoCC accelerator?
Use the funct7 field (7 bits) of the custom instruction to encode the operation type. The RoCC cmd includes cmd.bits.inst.funct (7 bits) — this is the primary dispatch field. The RoCC wrapper FSM reads funct7 when cmd_valid fires and enters the appropriate state. rs1/rs2 carry addresses or parameters. This gives 128 distinct operations per custom opcode space (custom-0/1/2/3), and with funct3 (3 bits) adds another 8 sub-operations for a total of 1024 distinct operations per opcode space.
Q2Why does the RoCC wrapper assert busy=1 during matrix multiply?
busy=1 stalls the RISC-V CPU pipeline — it prevents the CPU from issuing the next RoCC instruction until the current one completes. This is necessary because the systolic array takes multiple cycles to complete a matrix multiply, and if the CPU immediately issued another RoCC cmd before the current one finished, the cmd channel would receive a new command while the FSM is in the RUN state (not IDLE), and cmd_ready=0 would cause a stall anyway. busy=1 is the safer signal because it stalls the CPU at the issue stage before the command even enters the RoCC cmd channel, preventing pipeline hazards.
Q3What is the role of the LazyRoCC Diplomacy wrapper in Chipyard?
LazyRoCC is a Scala Diplomacy abstract class that handles the parameter negotiation and TileLink port generation for RoCC accelerators. By extending LazyRoCC and specifying OpcodeSet.custom0, the accelerator is automatically wired to the Rocket core's RoCC interface, the TileLink memory bus (for the mem channel), and the Page Table Walker (for virtual address translation if nPTWPorts > 0). The concrete implementation goes in LazyRoCCModuleImp — the Verilog RTL is typically wrapped as a BlackBox and instantiated there. The Config mixin (WithSystolicAccelerator) adds it to the chip configuration without modifying the core files.
Q4How does the RoCC mem channel differ from the cmd/resp channels?
The cmd/resp channels carry small data: a 64-bit rs1, rs2 (register values) and a 64-bit resp.data. They're designed for passing addresses, config values, or scalar results — not bulk data. The mem channel is a TileLink interface that connects directly to the L1D cache or L2, allowing the accelerator to perform load/store operations to arbitrary memory addresses. For matrix multiply, the weight matrix (potentially megabytes) must be transferred via the mem channel — the accelerator sends read requests for each row/element and buffers them into the PE array. This is effectively an in-accelerator DMA engine.
Q5What happens if the CPU issues a RoCC instruction while busy=1?
When busy=1, the RISC-V CPU stalls at the instruction issue stage — the pipeline is frozen and no new instruction is dispatched to any execution unit, including the RoCC interface. This is a hard stall (the PC doesn't advance). The CPU waits until busy=0 (the accelerator returns to IDLE after sending its resp) before proceeding. This ensures ordering: the previous accelerator operation fully completes before the next begins. It is the hardware equivalent of a blocking function call. For performance, software can use poll-based status checking instead (READ_STATUS cmd) to overlap CPU work with accelerator computation.
Q6How would you verify the RoCC systolic accelerator in simulation?
Three levels: (1) Unit test: directly drive the Verilog rocc_systolic module's cmd/resp ports with a testbench — verify FSM state transitions, correct resp_data, busy timing. (2) Integration test: use Chipyard's Verilator simulation — write a bare-metal C test that calls sys_load_weight() and sys_matmul(), run it on the simulated Rocket+RoCC system, compare output matrix C against a golden software reference. (3) Formal/FPGA: run on FPGA (VCU118/ZCU106) with a larger test suite. The key check is correctness: C[i][j] must match Σ A[i][k]×B[k][j] for all i,j.
← Day 4: Systolic Array Day 6: Memory Architecture →