Connecting Systolic Array to RoCC
In Day 4 we built a standalone 2×2 systolic array. Now we wrap it with a RoCC interface so the RISC-V CPU can control it using custom instructions. The key design decisions: what does each funct7 command do, how does the FSM sequence load→compute→respond, and how does the C driver invoke it?
Integration Architecture
CPU issues .insn r 0x0B, funct3, funct7, rd, rs1, rs2 → RoCC cmd channel → Controller FSM decodes funct7 → drives systolic array phases → sends result back via resp channel. Memory access (weight/activation DMA) uses RoCC mem channel.
Command Encoding
| funct7 | Operation | rs1 | rs2 | Response |
|---|---|---|---|---|
0x00 | LOAD_WEIGHT | Weight matrix base addr | Matrix size N | ACK when done |
0x01 | RUN_MATMUL | Input matrix A addr | Output matrix C addr | Cycle count |
0x02 | READ_STATUS | — | — | busy/idle status |
Verilog — RoCC wrapper FSM for systolic array
module rocc_systolic ( input clk, rst, // RoCC cmd channel input cmd_valid, output cmd_ready, input [6:0] funct7, input [4:0] rd, input [63:0] rs1, rs2, // RoCC resp channel output reg resp_valid, input resp_ready, output reg [4:0] resp_rd, output reg [63:0] resp_data, // busy to stall CPU output busy ); // FSM states localparam IDLE = 3'd0; localparam LOAD_W = 3'd1; // loading weights into PE array localparam WAIT_LOAD = 3'd2; // waiting for DMA weight load localparam RUN = 3'd3; // streaming activations, collecting output localparam RESPOND = 3'd4; // sending result back reg [2:0] state; reg [63:0] w_addr, a_addr, c_addr; reg [31:0] cycle_cnt; reg [4:0] saved_rd; assign cmd_ready = (state == IDLE); assign busy = (state != IDLE); always @(posedge clk) begin if (rst) state <= IDLE; else case (state) IDLE: if (cmd_valid) begin saved_rd <= rd; cycle_cnt <= 0; case (funct7) 7'h00: begin // LOAD_WEIGHT w_addr <= rs1; state <= LOAD_W; end 7'h01: begin // RUN_MATMUL a_addr <= rs1; c_addr <= rs2; state <= RUN; end 7'h02: begin // READ_STATUS — immediate response resp_data <= 64'd0; state <= RESPOND; end endcase end LOAD_W: state <= WAIT_LOAD; // initiate DMA via mem channel WAIT_LOAD: if (dma_done) state <= RESPOND; RUN: begin cycle_cnt <= cycle_cnt + 1; if (matmul_done) begin resp_data <= cycle_cnt; state <= RESPOND; end end RESPOND: begin resp_valid <= 1; resp_rd <= saved_rd; if (resp_ready) begin resp_valid <= 0; state <= IDLE; end end endcase end endmodule
Chipyard LazyRoCC Scala Wrapper
Scala — Chipyard LazyRoCC integration
// In src/main/scala/SystolicAccelerator.scala class SystolicAccelerator(implicit p: Parameters) extends LazyRoCC( opcodes = OpcodeSet.custom0, // uses 0x0B opcode space nPTWPorts = 0 ) { override lazy val module = new SystolicAcceleratorImp(this) } class SystolicAcceleratorImp(outer: SystolicAccelerator) extends LazyRoCCModuleImp(outer) { val sys = Module(new RoccSystolicBlackBox) sys.io.clk := clock sys.io.rst := reset sys.io.cmd_valid := io.cmd.valid io.cmd.ready := sys.io.cmd_ready sys.io.funct7 := io.cmd.bits.inst.funct sys.io.rd := io.cmd.bits.inst.rd sys.io.rs1 := io.cmd.bits.rs1 sys.io.rs2 := io.cmd.bits.rs2 io.resp.valid := sys.io.resp_valid sys.io.resp_ready := io.resp.ready io.resp.bits.rd := sys.io.resp_rd io.resp.bits.data := sys.io.resp_data io.busy := sys.io.busy io.interrupt := false.B } // Config mixin class WithSystolicAccelerator extends Config((site, here, up) => { case BuildRoCC => up(BuildRoCC, site) ++ Seq( (p: Parameters) => { val acc = LazyModule(new SystolicAccelerator()(p)) acc } ) })
C Driver
C — Host driver for systolic matrix multiply
#include <stdint.h> // Custom instruction macros — funct7 encodes operation #define ROCC_FUNCT7_LOAD_WEIGHT 0x00 #define ROCC_FUNCT7_RUN_MATMUL 0x01 #define ROCC_FUNCT7_STATUS 0x02 #define ROCC_INSN(funct7, rd, rs1, rs2) \ __asm__ volatile ( \ ".insn r 0x0B, 0, " #funct7 ", %0, %1, %2\n" \ : "=r"(rd) : "r"(rs1), "r"(rs2) \ ) /* Load weight matrix B into systolic array PE registers */ static inline void sys_load_weight(const int8_t *B, int N) { uint64_t ack; ROCC_INSN(0, ack, (uint64_t)B, (uint64_t)N); /* blocks until weights are loaded */ } /* Run matrix multiply: C = A × B (B already loaded) */ static inline uint64_t sys_matmul(const int8_t *A, int32_t *C) { uint64_t cycles; ROCC_INSN(1, cycles, (uint64_t)A, (uint64_t)C); return cycles; /* returns cycle count for profiling */ } /* Example: 4×4 INT8 matrix multiply */ void example() { int8_t A[4][4] = {{1,2,3,4},{5,6,7,8},{9,10,11,12},{13,14,15,16}}; int8_t B[4][4] = {{1,0,0,0},{0,1,0,0},{0,0,1,0},{0,0,0,1}}; // identity int32_t C[4][4]; sys_load_weight((int8_t*)B, 4); uint64_t cycles = sys_matmul((int8_t*)A, (int32_t*)C); // C should equal A (identity multiply) printf("MatMul done in %llu cycles\n", cycles); }
Day 5 — Interview Questions
Q1How do you map multiple operations (load weight, run matmul, read status) to a single RoCC accelerator?
Use the funct7 field (7 bits) of the custom instruction to encode the operation type. The RoCC cmd includes cmd.bits.inst.funct (7 bits) — this is the primary dispatch field. The RoCC wrapper FSM reads funct7 when cmd_valid fires and enters the appropriate state. rs1/rs2 carry addresses or parameters. This gives 128 distinct operations per custom opcode space (custom-0/1/2/3), and with funct3 (3 bits) adds another 8 sub-operations for a total of 1024 distinct operations per opcode space.
Q2Why does the RoCC wrapper assert busy=1 during matrix multiply?
busy=1 stalls the RISC-V CPU pipeline — it prevents the CPU from issuing the next RoCC instruction until the current one completes. This is necessary because the systolic array takes multiple cycles to complete a matrix multiply, and if the CPU immediately issued another RoCC cmd before the current one finished, the cmd channel would receive a new command while the FSM is in the RUN state (not IDLE), and cmd_ready=0 would cause a stall anyway. busy=1 is the safer signal because it stalls the CPU at the issue stage before the command even enters the RoCC cmd channel, preventing pipeline hazards.
Q3What is the role of the LazyRoCC Diplomacy wrapper in Chipyard?
LazyRoCC is a Scala Diplomacy abstract class that handles the parameter negotiation and TileLink port generation for RoCC accelerators. By extending LazyRoCC and specifying OpcodeSet.custom0, the accelerator is automatically wired to the Rocket core's RoCC interface, the TileLink memory bus (for the mem channel), and the Page Table Walker (for virtual address translation if nPTWPorts > 0). The concrete implementation goes in LazyRoCCModuleImp — the Verilog RTL is typically wrapped as a BlackBox and instantiated there. The Config mixin (WithSystolicAccelerator) adds it to the chip configuration without modifying the core files.
Q4How does the RoCC mem channel differ from the cmd/resp channels?
The cmd/resp channels carry small data: a 64-bit rs1, rs2 (register values) and a 64-bit resp.data. They're designed for passing addresses, config values, or scalar results — not bulk data. The mem channel is a TileLink interface that connects directly to the L1D cache or L2, allowing the accelerator to perform load/store operations to arbitrary memory addresses. For matrix multiply, the weight matrix (potentially megabytes) must be transferred via the mem channel — the accelerator sends read requests for each row/element and buffers them into the PE array. This is effectively an in-accelerator DMA engine.
Q5What happens if the CPU issues a RoCC instruction while busy=1?
When busy=1, the RISC-V CPU stalls at the instruction issue stage — the pipeline is frozen and no new instruction is dispatched to any execution unit, including the RoCC interface. This is a hard stall (the PC doesn't advance). The CPU waits until busy=0 (the accelerator returns to IDLE after sending its resp) before proceeding. This ensures ordering: the previous accelerator operation fully completes before the next begins. It is the hardware equivalent of a blocking function call. For performance, software can use poll-based status checking instead (READ_STATUS cmd) to overlap CPU work with accelerator computation.
Q6How would you verify the RoCC systolic accelerator in simulation?
Three levels: (1) Unit test: directly drive the Verilog rocc_systolic module's cmd/resp ports with a testbench — verify FSM state transitions, correct resp_data, busy timing. (2) Integration test: use Chipyard's Verilator simulation — write a bare-metal C test that calls sys_load_weight() and sys_matmul(), run it on the simulated Rocket+RoCC system, compare output matrix C against a golden software reference. (3) Formal/FPGA: run on FPGA (VCU118/ZCU106) with a larger test suite. The key check is correctness: C[i][j] must match Σ A[i][k]×B[k][j] for all i,j.