RISC-V permanently reserves four opcode spaces for user-defined instructions. Learn exactly how to encode, emit, decode, and execute custom instructions — from bit-field layout to GCC inline assembly to Verilog decode logic.
The RISC-V ISA specification divides the 32-bit instruction space into opcode groups using bits [6:0]. Standard extensions (I, M, A, F, D, C, V…) occupy specific opcodes. The spec permanently reserves four opcodes for custom use — they will never be assigned to any ratified extension, making them safe for permanent custom instruction deployment.
RISC-V's modularity promise means that any conforming implementation can ignore extensions it doesn't support. The custom opcode spaces are deliberately left out of the standard allocation process — no RISC-V foundation working group can ever assign them to a standard extension. This gives SoC designers a stable, conflict-free space for proprietary instructions that won't collide with future GCC/LLVM compiler updates.
R-type is the most useful encoding for custom instructions that operate on CPU registers. The format gives you two source registers (rs1, rs2), one destination register (rd), and 10 bits of sub-operation selection (funct3 + funct7).
The RISC-V GNU assembler (GAS) provides the .insn directive to emit arbitrary instruction encodings without modifying the assembler or compiler. This is the fastest path to testing a custom instruction in simulation — no toolchain patches required.
# RISC-V assembly using .insn for custom-0 instructions
# Assemble with: riscv64-unknown-elf-as -march=rv32i custom_test.s
.section .text
.global custom_mac_test
# Custom instruction definitions (using .insn directive)
# CUSTOM_MAC: rd = rs1 * rs2 + accumulator (funct3=0, funct7=0)
# CUSTOM_CLEAR: clear accumulator (funct3=1, funct7=0)
custom_mac_test:
li a1, 3 # a1 = 3 (rs1)
li a2, 4 # a2 = 4 (rs2)
# Clear the accumulator first
.insn r 0x0B, 1, 0, a0, a0, a0 # CUSTOM_CLEAR (rd=a0, ignored)
# Compute 3*4 + 0 = 12
.insn r 0x0B, 0, 0, a0, a1, a2 # CUSTOM_MAC: a0 = a1*a2 + accum
# Compute 3*4 + 12 = 24 (accumulates)
.insn r 0x0B, 0, 0, a0, a1, a2 # CUSTOM_MAC: a0 = a1*a2 + 12
retFor production use, wrap the .insn directive in a C macro using GCC inline assembly. This gives you a callable C function with proper register constraints — the compiler handles register allocation automatically.
#ifndef CUSTOM_INSN_H
#define CUSTOM_INSN_H
#include
/*
* CUSTOM_MAC: result = rs1 * rs2 + accumulator
* Uses custom-0 opcode (0x0B), funct3=0, funct7=0
*/
static inline uint32_t custom_mac(uint32_t rs1, uint32_t rs2) {
uint32_t rd;
__asm__ volatile (
".insn r 0x0B, 0, 0, %0, %1, %2"
: "=r"(rd) /* output: rd register */
: "r"(rs1), "r"(rs2) /* inputs: rs1, rs2 */
: /* no clobbers */
);
return rd;
}
/*
* CUSTOM_CLEAR: clear the on-chip accumulator
* Uses custom-0 opcode (0x0B), funct3=1, funct7=0
*/
static inline void custom_clear(void) {
__asm__ volatile (
".insn r 0x0B, 1, 0, x0, x0, x0"
: /* no outputs */
: /* no inputs */
: /* no clobbers */
);
}
/*
* CUSTOM_DOT4: 4-element dot product (packed 8-bit in 32-bit word)
* a = {a3,a2,a1,a0} packed bytes, b = {b3,b2,b1,b0}
* result = a0*b0 + a1*b1 + a2*b2 + a3*b3
*/
static inline uint32_t custom_dot4(uint32_t a, uint32_t b) {
uint32_t rd;
__asm__ volatile (
".insn r 0x0B, 2, 0, %0, %1, %2"
: "=r"(rd)
: "r"(a), "r"(b)
);
return rd;
}
#endif /* CUSTOM_INSN_H */ #include "custom_insn.h"
#include
/* Dot product of two int8 arrays using custom instruction */
int32_t dot_product(int8_t *a, int8_t *b, int len) {
custom_clear(); /* reset accumulator */
for (int i = 0; i < len; i += 4) {
/* Pack 4 bytes into a 32-bit word */
uint32_t va = (uint8_t)a[i] | ((uint8_t)a[i+1] << 8)
| ((uint8_t)a[i+2] << 16) | ((uint8_t)a[i+3] << 24);
uint32_t vb = (uint8_t)b[i] | ((uint8_t)b[i+1] << 8)
| ((uint8_t)b[i+2] << 16) | ((uint8_t)b[i+3] << 24);
custom_dot4(va, vb); /* accumulates internally */
}
/* Final MAC with zero to read accumulator */
return (int32_t)custom_mac(0, 0);
}
int main(void) {
int8_t a[] = {1, 2, 3, 4};
int8_t b[] = {5, 6, 7, 8};
/* Expected: 1*5 + 2*6 + 3*7 + 4*8 = 5+12+21+32 = 70 */
printf("dot product = %d\n", dot_product(a, b, 4));
return 0;
}
/* Compile: riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32
-nostdlib -O2 -o dot_test dot_test.c */ In a simple in-order RISC-V pipeline (like the one built in RISC-V From Scratch), custom instructions are decoded at the Decode stage. The opcode field selects the custom execution unit, and funct3/funct7 selects the specific operation.
// Custom instruction decoder — plug into the main decode stage
// Detects custom-0 (0x0B) instructions and routes to custom EX unit
module custom_decode (
input logic [31:0] instr, // raw 32-bit instruction word
output logic is_custom, // 1 = this is a custom instruction
output logic [6:0] funct7,
output logic [2:0] funct3,
output logic [4:0] rd, rs1, rs2,
output logic [2:0] custom_op // decoded operation for EX unit
);
localparam CUSTOM0_OP = 7'b000_1011; // 0x0B
// Extract fields
assign funct7 = instr[31:25];
assign rs2 = instr[24:20];
assign rs1 = instr[19:15];
assign funct3 = instr[14:12];
assign rd = instr[11:7];
// Detect custom-0 opcode
assign is_custom = (instr[6:0] == CUSTOM0_OP);
// Map {funct7[0], funct3} to custom operation code
always_comb begin
custom_op = 3'b000; // default: NOP
if (is_custom) begin
case ({funct7[0], funct3})
4'b0_000: custom_op = 3'd0; // CUSTOM_MAC
4'b0_001: custom_op = 3'd1; // CUSTOM_CLEAR
4'b0_010: custom_op = 3'd2; // CUSTOM_DOT4
4'b0_011: custom_op = 3'd3; // CUSTOM_AES_ENC
4'b1_000: custom_op = 3'd4; // CUSTOM_SHA256
default: custom_op = 3'd7; // ILLEGAL_CUSTOM
endcase
end
end
endmodule// Custom execution unit — connected to the pipeline after decode
module custom_exec (
input logic clk, rst_n,
input logic valid, // instruction is valid (from decode)
input logic [2:0] custom_op, // operation from decoder
input logic [31:0] rs1_val, // register file read value for rs1
input logic [31:0] rs2_val, // register file read value for rs2
output logic done, // result is ready (this cycle for simple ops)
output logic [31:0] result // value to write back to rd
);
logic [63:0] accumulator; // 64-bit accumulator (wider than 32-bit rd)
localparam MAC = 3'd0;
localparam CLEAR = 3'd1;
localparam DOT4 = 3'd2;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
accumulator <= '0;
done <= 1'b0;
result <= '0;
end else if (valid) begin
done <= 1'b1; // single-cycle for these ops
case (custom_op)
MAC: begin
accumulator <= accumulator + (rs1_val * rs2_val);
result <= accumulator[31:0];
end
CLEAR: begin
accumulator <= '0;
result <= '0;
end
DOT4: begin
// Unpack 4×8-bit elements and dot product
automatic logic [63:0] dp;
dp = $signed(rs1_val[7:0]) * $signed(rs2_val[7:0])
+ $signed(rs1_val[15:8]) * $signed(rs2_val[15:8])
+ $signed(rs1_val[23:16]) * $signed(rs2_val[23:16])
+ $signed(rs1_val[31:24]) * $signed(rs2_val[31:24]);
accumulator <= accumulator + dp;
result <= accumulator[31:0];
end
default: begin
done <= 1'b0;
result <= '0;
end
endcase
end else begin
done <= 1'b0;
end
end
endmodule`timescale 1ns/1ps
module tb_custom_exec;
logic clk = 0, rst_n;
logic valid;
logic [2:0] custom_op;
logic [31:0] rs1_val, rs2_val;
logic done;
logic [31:0] result;
custom_exec dut (.*);
always #5 clk = ~clk; // 100 MHz
task send_op(input [2:0] op, input [31:0] a, b);
@(posedge clk);
valid <= 1; custom_op <= op;
rs1_val <= a; rs2_val <= b;
@(posedge clk);
valid <= 0;
@(posedge clk);
endtask
initial begin
rst_n = 0; valid = 0;
repeat(3) @(posedge clk);
rst_n = 1;
// Test 1: CLEAR
send_op(3'd1, 0, 0);
$display("After CLEAR: accum should be 0");
// Test 2: MAC 3*4 = 12
send_op(3'd0, 32'd3, 32'd4);
@(posedge clk);
assert (result == 32'd0) else $error("MAC1 result wrong: %0d", result); // reads prev accum
$display("MAC(3,4) done, accum = 12");
// Test 3: MAC 3*4 again = 24
send_op(3'd0, 32'd3, 32'd4);
@(posedge clk);
assert (result == 32'd12) else $error("MAC2 result wrong: %0d", result);
$display("MAC(3,4) done, accum = 24, result (prev accum) = %0d", result);
// Test 4: DOT4 — [1,2,3,4]·[5,6,7,8] = 70
send_op(3'd2, 32'h04030201, 32'h08070605);
@(posedge clk);
$display("DOT4 done, accum includes 70");
$display("All tests passed!"); $finish;
end
endmodule| Opcode | Hex | Recommended Use | Notes |
|---|---|---|---|
| custom-0 | 0x0B | Math / ML accelerator (MAC, DOT, GEMM) | Most widely used, best tool support |
| custom-1 | 0x2B | Crypto accelerator (AES, SHA, HMAC) | Separate namespace from compute |
| custom-2 | 0x5B | DSP / signal processing (FFT, FIR) | May conflict with future rv128 |
| custom-3 | 0x7B | Debug / profiling / special ops | Avoid in shipping silicon — rv128 risk |
The RISC-V spec notes that custom-2 and custom-3 are "reserved for custom extensions" but also annotated as potential rv128 (128-bit RISC-V) opcodes. For production chips shipping today, stick to custom-0 and custom-1 — they are unambiguously reserved for custom use with no future standard extension risk.
| # | Question | Answer Points |
|---|---|---|
| 1 | How many custom operations can you define per custom opcode space? | Using R-type encoding: funct3 (3 bits) × funct7 (7 bits) = 8 × 128 = 1024 distinct operations per opcode space. Across all 4 custom opcode spaces: 4096 total. Using I-type, the immediate field can encode additional sub-modes. |
| 2 | What is the .insn directive and when would you use it? | A GAS (GNU Assembler) directive that emits any instruction encoding without compiler modification. Used to prototype custom instructions without patching GCC/LLVM. Syntax: .insn r 0x0B, funct3, funct7, rd, rs1, rs2. Good for simulation and early bring-up; production code uses compiler intrinsics or a full toolchain patch. |
| 3 | What is the difference between a custom ISA extension and the RoCC interface? | Custom ISA extension modifies the CPU pipeline directly — you add decode and execute logic inside the processor core. RoCC is a standard tightly-coupled coprocessor interface where the custom instruction is intercepted at decode and dispatched to an external (but closely attached) coprocessor via the cmd/resp channels. RoCC is easier to integrate (no pipeline modification) but only works with Rocket/Chipyard cores. Custom ISA can be used with any RISC-V core you have source access to. |
| 4 | How do you tell GCC to allocate registers for a custom instruction? | Use GCC inline assembly with output/input constraints: =r for output register (rd), r for input registers (rs1, rs2). The compiler resolves which physical registers to use; the .insn line refers to them by placeholder. Example: __asm__ volatile (".insn r 0x0B, 0, 0, %0, %1, %2" : "=r"(rd) : "r"(rs1), "r"(rs2)); |