How do I add a custom instruction to RISC-V?

RISC-V reserves four opcode spaces for custom instructions: custom-0 (0x0B), custom-1 (0x2B), custom-2 (0x5B), and custom-3 (0x7B). To add a custom instruction: (1) choose an opcode space and encode your instruction in R/I/S/U format using funct3 and funct7 fields to distinguish sub-operations; (2) add decode logic in your CPU pipeline or use the RoCC interface to intercept these opcodes; (3) use the GCC .insn directive to emit the instruction from C/assembly without modifying the compiler; (4) implement the execution unit in RTL. The RISC-V ISA spec guarantees these opcodes will never be used by ratified standard extensions.

What is the RISC-V .insn directive?

The .insn directive in RISC-V GNU assembler (GAS) allows you to emit any custom instruction encoding without modifying the assembler or compiler. Syntax: .insn , , , , for R-type. Example: .insn r 0x0B, 0, 0, a0, a1, a2 emits a custom-0 R-type instruction with funct3=0, funct7=0, rd=a0, rs1=a1, rs2=a2. You can wrap this in a C macro using inline assembly (__asm__) to create a callable C function for the custom instruction.

RISC-V Custom ISA Extension — Adding Custom Instructions to RISC-V | Day 2

1. The Four Custom Opcode Spaces

The RISC-V ISA specification divides the 32-bit instruction space into opcode groups using bits [6:0]. Standard extensions (I, M, A, F, D, C, V…) occupy specific opcodes. The spec permanently reserves four opcodes for custom use — they will never be assigned to any ratified extension, making them safe for permanent custom instruction deployment.

RISC-V 32-bit instruction opcode map (bits [6:2], bits [1:0] always = 11): Opcode [6:2] Hex Name Reserved for ───────────────────────────────────────────────────── 00010 0x0B custom-0 User-defined (non-standard, not 64-bit) 01010 0x2B custom-1 User-defined (non-standard, not 64-bit) 10110 0x5B custom-2/rv128 User-defined (reserved/non-standard) 11110 0x7B custom-3/rv128 User-defined (reserved/non-standard) Full 7-bit opcode values: custom-0: 0x0B = 0000_1011 custom-1: 0x2B = 0010_1011 custom-2: 0x5B = 0101_1011 custom-3: 0x7B = 0111_1011 Instruction variants per opcode space: Each opcode supports R / I / S / U type encoding R-type: funct3 (3 bits) × funct7 (7 bits) = 1024 distinct operations I-type: funct3 (3 bits) × imm[11:0] — immediate operand variant Total across 4 spaces: 4 × 1024 = 4096 distinct R-type custom instructions

Why "permanently reserved"?

RISC-V's modularity promise means that any conforming implementation can ignore extensions it doesn't support. The custom opcode spaces are deliberately left out of the standard allocation process — no RISC-V foundation working group can ever assign them to a standard extension. This gives SoC designers a stable, conflict-free space for proprietary instructions that won't collide with future GCC/LLVM compiler updates.

2. Custom Instruction Encoding — R-Type Deep Dive

R-type is the most useful encoding for custom instructions that operate on CPU registers. The format gives you two source registers (rs1, rs2), one destination register (rd), and 10 bits of sub-operation selection (funct3 + funct7).

Fig 1: R-type custom-0 instruction encoding. funct7+funct3 together select 1 of 1024 possible sub-operations. rd, rs1, rs2 index the CPU integer register file.

Sub-operation encoding strategy with funct7 + funct3: {funct7, funct3} = 10-bit sub-operation selector Example allocation for a custom math accelerator: {7'b000_0000, 3'b000} → CUSTOM_MAC (multiply-accumulate: rd = rs1*rs2 + accum) {7'b000_0000, 3'b001} → CUSTOM_CLEAR (clear accumulator: accum = 0) {7'b000_0000, 3'b010} → CUSTOM_DOT4 (4-element dot product) {7'b000_0001, 3'b000} → CUSTOM_AES_ENC (AES round encrypt) {7'b000_0001, 3'b001} → CUSTOM_AES_DEC (AES round decrypt) {7'b000_0010, 3'b000} → CUSTOM_SHA256 (SHA256 compression round) Using custom-1 (0x2B) for a separate crypto unit: {7'b000_0000, 3'b000} → CRYPTO_HASH {7'b000_0000, 3'b001} → CRYPTO_VERIFY ... up to 1024 more operations

3. GCC .insn Directive — Emitting Custom Instructions

The RISC-V GNU assembler (GAS) provides the .insn directive to emit arbitrary instruction encodings without modifying the assembler or compiler. This is the fastest path to testing a custom instruction in simulation — no toolchain patches required.

.insn Directive Syntax

.insn type opcode, [operands...] R-type syntax: .insn r opcode, funct3, funct7, rd, rs1, rs2 opcode = immediate (0x0B for custom-0) funct3 = immediate (0–7) funct7 = immediate (0–127) rd = destination register rs1,rs2 = source registers Example — emit custom_mac a0, a1, a2: .insn r 0x0B, 0, 0, a0, a1, a2 → encodes: funct7=0, rs2=a2, rs1=a1, funct3=0, rd=a0, opcode=0x0B → binary: 0x00C5850B I-type syntax (immediate operand): .insn i opcode, funct3, rd, rs1, imm Example: .insn i 0x0B, 1, a0, a1, 42 → rs1=a1, imm=42, funct3=1, rd=a0, opcode=0x0B

Assembly — custom instructions with .insn directive

# RISC-V assembly using .insn for custom-0 instructions
# Assemble with: riscv64-unknown-elf-as -march=rv32i custom_test.s

.section .text
.global custom_mac_test

# Custom instruction definitions (using .insn directive)
# CUSTOM_MAC:   rd = rs1 * rs2 + accumulator  (funct3=0, funct7=0)
# CUSTOM_CLEAR: clear accumulator              (funct3=1, funct7=0)

custom_mac_test:
    li   a1, 3          # a1 = 3 (rs1)
    li   a2, 4          # a2 = 4 (rs2)

    # Clear the accumulator first
    .insn r 0x0B, 1, 0, a0, a0, a0   # CUSTOM_CLEAR (rd=a0, ignored)

    # Compute 3*4 + 0 = 12
    .insn r 0x0B, 0, 0, a0, a1, a2   # CUSTOM_MAC: a0 = a1*a2 + accum

    # Compute 3*4 + 12 = 24 (accumulates)
    .insn r 0x0B, 0, 0, a0, a1, a2   # CUSTOM_MAC: a0 = a1*a2 + 12

    ret

4. C Intrinsics — Calling Custom Instructions from C

For production use, wrap the .insn directive in a C macro using GCC inline assembly. This gives you a callable C function with proper register constraints — the compiler handles register allocation automatically.

C — custom instruction macros and intrinsics

#ifndef CUSTOM_INSN_H
#define CUSTOM_INSN_H

#include 

/*
 * CUSTOM_MAC: result = rs1 * rs2 + accumulator
 * Uses custom-0 opcode (0x0B), funct3=0, funct7=0
 */
static inline uint32_t custom_mac(uint32_t rs1, uint32_t rs2) {
    uint32_t rd;
    __asm__ volatile (
        ".insn r 0x0B, 0, 0, %0, %1, %2"
        : "=r"(rd)          /* output: rd register */
        : "r"(rs1), "r"(rs2) /* inputs: rs1, rs2 */
        :                    /* no clobbers */
    );
    return rd;
}

/*
 * CUSTOM_CLEAR: clear the on-chip accumulator
 * Uses custom-0 opcode (0x0B), funct3=1, funct7=0
 */
static inline void custom_clear(void) {
    __asm__ volatile (
        ".insn r 0x0B, 1, 0, x0, x0, x0"
        :   /* no outputs */
        :   /* no inputs */
        :   /* no clobbers */
    );
}

/*
 * CUSTOM_DOT4: 4-element dot product (packed 8-bit in 32-bit word)
 * a = {a3,a2,a1,a0} packed bytes, b = {b3,b2,b1,b0}
 * result = a0*b0 + a1*b1 + a2*b2 + a3*b3
 */
static inline uint32_t custom_dot4(uint32_t a, uint32_t b) {
    uint32_t rd;
    __asm__ volatile (
        ".insn r 0x0B, 2, 0, %0, %1, %2"
        : "=r"(rd)
        : "r"(a), "r"(b)
    );
    return rd;
}

#endif /* CUSTOM_INSN_H */

C — using the custom instruction intrinsics

#include "custom_insn.h"
#include 

/* Dot product of two int8 arrays using custom instruction */
int32_t dot_product(int8_t *a, int8_t *b, int len) {
    custom_clear();          /* reset accumulator */
    for (int i = 0; i < len; i += 4) {
        /* Pack 4 bytes into a 32-bit word */
        uint32_t va = (uint8_t)a[i]   | ((uint8_t)a[i+1] << 8)
                    | ((uint8_t)a[i+2] << 16) | ((uint8_t)a[i+3] << 24);
        uint32_t vb = (uint8_t)b[i]   | ((uint8_t)b[i+1] << 8)
                    | ((uint8_t)b[i+2] << 16) | ((uint8_t)b[i+3] << 24);
        custom_dot4(va, vb);  /* accumulates internally */
    }
    /* Final MAC with zero to read accumulator */
    return (int32_t)custom_mac(0, 0);
}

int main(void) {
    int8_t a[] = {1, 2, 3, 4};
    int8_t b[] = {5, 6, 7, 8};
    /* Expected: 1*5 + 2*6 + 3*7 + 4*8 = 5+12+21+32 = 70 */
    printf("dot product = %d\n", dot_product(a, b, 4));
    return 0;
}

/* Compile: riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32
            -nostdlib -O2 -o dot_test dot_test.c             */

5. Verilog Decode Logic — Intercepting Custom Instructions

In a simple in-order RISC-V pipeline (like the one built in RISC-V From Scratch), custom instructions are decoded at the Decode stage. The opcode field selects the custom execution unit, and funct3/funct7 selects the specific operation.

Verilog — custom instruction decode and dispatch

// Custom instruction decoder — plug into the main decode stage
// Detects custom-0 (0x0B) instructions and routes to custom EX unit
module custom_decode (
  input  logic [31:0] instr,          // raw 32-bit instruction word
  output logic        is_custom,      // 1 = this is a custom instruction
  output logic [6:0]  funct7,
  output logic [2:0]  funct3,
  output logic [4:0]  rd, rs1, rs2,
  output logic [2:0]  custom_op       // decoded operation for EX unit
);

  localparam CUSTOM0_OP = 7'b000_1011; // 0x0B

  // Extract fields
  assign funct7 = instr[31:25];
  assign rs2    = instr[24:20];
  assign rs1    = instr[19:15];
  assign funct3 = instr[14:12];
  assign rd     = instr[11:7];

  // Detect custom-0 opcode
  assign is_custom = (instr[6:0] == CUSTOM0_OP);

  // Map {funct7[0], funct3} to custom operation code
  always_comb begin
    custom_op = 3'b000; // default: NOP
    if (is_custom) begin
      case ({funct7[0], funct3})
        4'b0_000: custom_op = 3'd0; // CUSTOM_MAC
        4'b0_001: custom_op = 3'd1; // CUSTOM_CLEAR
        4'b0_010: custom_op = 3'd2; // CUSTOM_DOT4
        4'b0_011: custom_op = 3'd3; // CUSTOM_AES_ENC
        4'b1_000: custom_op = 3'd4; // CUSTOM_SHA256
        default:  custom_op = 3'd7; // ILLEGAL_CUSTOM
      endcase
    end
  end

endmodule

Verilog — custom execution unit (MAC + DOT4)

// Custom execution unit — connected to the pipeline after decode
module custom_exec (
  input  logic        clk, rst_n,
  input  logic        valid,          // instruction is valid (from decode)
  input  logic [2:0]  custom_op,      // operation from decoder
  input  logic [31:0] rs1_val,        // register file read value for rs1
  input  logic [31:0] rs2_val,        // register file read value for rs2
  output logic        done,           // result is ready (this cycle for simple ops)
  output logic [31:0] result          // value to write back to rd
);

  logic [63:0] accumulator;           // 64-bit accumulator (wider than 32-bit rd)

  localparam MAC    = 3'd0;
  localparam CLEAR  = 3'd1;
  localparam DOT4   = 3'd2;

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      accumulator <= '0;
      done        <= 1'b0;
      result      <= '0;
    end else if (valid) begin
      done <= 1'b1;  // single-cycle for these ops
      case (custom_op)
        MAC: begin
          accumulator <= accumulator + (rs1_val * rs2_val);
          result      <= accumulator[31:0];
        end
        CLEAR: begin
          accumulator <= '0;
          result      <= '0;
        end
        DOT4: begin
          // Unpack 4×8-bit elements and dot product
          automatic logic [63:0] dp;
          dp = $signed(rs1_val[7:0])   * $signed(rs2_val[7:0])
             + $signed(rs1_val[15:8])  * $signed(rs2_val[15:8])
             + $signed(rs1_val[23:16]) * $signed(rs2_val[23:16])
             + $signed(rs1_val[31:24]) * $signed(rs2_val[31:24]);
          accumulator <= accumulator + dp;
          result      <= accumulator[31:0];
        end
        default: begin
          done   <= 1'b0;
          result <= '0;
        end
      endcase
    end else begin
      done <= 1'b0;
    end
  end

endmodule

6. Testbench — Verifying Custom Instruction Decode

Verilog — custom instruction testbench

`timescale 1ns/1ps
module tb_custom_exec;
  logic        clk = 0, rst_n;
  logic        valid;
  logic [2:0]  custom_op;
  logic [31:0] rs1_val, rs2_val;
  logic        done;
  logic [31:0] result;

  custom_exec dut (.*);

  always #5 clk = ~clk;  // 100 MHz

  task send_op(input [2:0] op, input [31:0] a, b);
    @(posedge clk);
    valid     <= 1; custom_op <= op;
    rs1_val   <= a; rs2_val   <= b;
    @(posedge clk);
    valid <= 0;
    @(posedge clk);
  endtask

  initial begin
    rst_n = 0; valid = 0;
    repeat(3) @(posedge clk);
    rst_n = 1;

    // Test 1: CLEAR
    send_op(3'd1, 0, 0);
    $display("After CLEAR: accum should be 0");

    // Test 2: MAC 3*4 = 12
    send_op(3'd0, 32'd3, 32'd4);
    @(posedge clk);
    assert (result == 32'd0) else $error("MAC1 result wrong: %0d", result); // reads prev accum
    $display("MAC(3,4) done, accum = 12");

    // Test 3: MAC 3*4 again = 24
    send_op(3'd0, 32'd3, 32'd4);
    @(posedge clk);
    assert (result == 32'd12) else $error("MAC2 result wrong: %0d", result);
    $display("MAC(3,4) done, accum = 24, result (prev accum) = %0d", result);

    // Test 4: DOT4 — [1,2,3,4]·[5,6,7,8] = 70
    send_op(3'd2, 32'h04030201, 32'h08070605);
    @(posedge clk);
    $display("DOT4 done, accum includes 70");

    $display("All tests passed!"); $finish;
  end
endmodule

7. Custom Opcode Space Usage Table

Opcode	Hex	Recommended Use	Notes
custom-0	0x0B	Math / ML accelerator (MAC, DOT, GEMM)	Most widely used, best tool support
custom-1	0x2B	Crypto accelerator (AES, SHA, HMAC)	Separate namespace from compute
custom-2	0x5B	DSP / signal processing (FFT, FIR)	May conflict with future rv128
custom-3	0x7B	Debug / profiling / special ops	Avoid in shipping silicon — rv128 risk

custom-2 and custom-3 — Use With Caution

The RISC-V spec notes that custom-2 and custom-3 are "reserved for custom extensions" but also annotated as potential rv128 (128-bit RISC-V) opcodes. For production chips shipping today, stick to custom-0 and custom-1 — they are unambiguously reserved for custom use with no future standard extension risk.

8. Interview Q&A

#	Question	Answer Points
1	How many custom operations can you define per custom opcode space?	Using R-type encoding: funct3 (3 bits) × funct7 (7 bits) = 8 × 128 = 1024 distinct operations per opcode space. Across all 4 custom opcode spaces: 4096 total. Using I-type, the immediate field can encode additional sub-modes.
2	What is the .insn directive and when would you use it?	A GAS (GNU Assembler) directive that emits any instruction encoding without compiler modification. Used to prototype custom instructions without patching GCC/LLVM. Syntax: .insn r 0x0B, funct3, funct7, rd, rs1, rs2. Good for simulation and early bring-up; production code uses compiler intrinsics or a full toolchain patch.
3	What is the difference between a custom ISA extension and the RoCC interface?	Custom ISA extension modifies the CPU pipeline directly — you add decode and execute logic inside the processor core. RoCC is a standard tightly-coupled coprocessor interface where the custom instruction is intercepted at decode and dispatched to an external (but closely attached) coprocessor via the cmd/resp channels. RoCC is easier to integrate (no pipeline modification) but only works with Rocket/Chipyard cores. Custom ISA can be used with any RISC-V core you have source access to.
4	How do you tell GCC to allocate registers for a custom instruction?	Use GCC inline assembly with output/input constraints: =r for output register (rd), r for input registers (rs1, rs2). The compiler resolves which physical registers to use; the .insn line refers to them by placeholder. Example: __asm__ volatile (".insn r 0x0B, 0, 0, %0, %1, %2" : "=r"(rd) : "r"(rs1), "r"(rs2));

Day 2 Knowledge Checklist

☐ Know the 4 custom opcode hex values (0x0B, 0x2B, 0x5B, 0x7B)
☐ Draw the R-type custom instruction bit layout (funct7/rs2/rs1/funct3/rd/opcode)
☐ Calculate how many distinct operations fit in one opcode space
☐ Write a .insn directive to emit a custom R-type instruction
☐ Wrap a custom instruction in a C inline assembly macro
☐ Write a Verilog decode block that detects custom-0 opcode
☐ Explain the difference between custom ISA and RoCC

← Day 1Architecture Overview Next → Day 3RoCC Interface Deep Dive