HomeRISC-V + AcceleratorDay 7 — AXI4 Integration & Tiling
RISC-V + Accelerator · Day 07 of 15

AXI4 Integration
MMIO, Control Plane & Tiling Strategy

By EcrioniX · Updated June 2026 · ~45 min read
AXI4-Lite SlaveAXI4 MasterMMIO Registers TilingAddress MapInterruptC Driver

AXI4 for Accelerator Integration

When an accelerator is connected as a standalone IP block (not via RoCC but as an MMIO peripheral), AXI4 is the standard interconnect. The accelerator has two AXI ports: an AXI4-Lite slave for the control plane (CPU configures registers) and an AXI4 full master for the data plane (DMA reads/writes memory).

Two-Port Architecture

Control plane (AXI4-Lite slave): CPU writes start/stop, addresses, sizes to MMIO registers. Data plane (AXI4 full master): Accelerator DMA reads A matrix and B weights from DRAM, writes C result back — without CPU involvement.

MMIO Register Map

OffsetRegisterBitsR/WDescription
0x00CTRL[0]=start, [1]=reset, [2]=int_enWControl: start computation, soft reset, interrupt enable
0x04STATUS[0]=busy, [1]=done, [2]=errorRAccelerator status — poll or wait for interrupt
0x08MATRIX_N[15:0]=NWMatrix dimension N (NxN matrices)
0x10A_ADDR[31:0]WPhysical base address of matrix A in DRAM
0x18B_ADDR[31:0]WPhysical base address of matrix B (weights)
0x20C_ADDR[31:0]WPhysical base address of result matrix C
0x28CYCLES[31:0]RCycle count of last operation (performance)
0x2CTILE_CFG[15:0]=tile_M, [31:16]=tile_NWTile dimensions for large matrix tiling
Verilog — AXI4-Lite slave register block
module axilite_regs #(parameter AW=32, DW=32) (
  input            clk, rst,
  // AXI4-Lite Write Address
  input            awvalid, output awready,
  input [AW-1:0] awaddr,
  // AXI4-Lite Write Data
  input            wvalid, output wready,
  input [DW-1:0] wdata,
  // AXI4-Lite Write Response
  output reg       bvalid, input bready,
  // AXI4-Lite Read Address
  input            arvalid, output arready,
  input [AW-1:0] araddr,
  // AXI4-Lite Read Data
  output reg       rvalid, input rready,
  output reg [DW-1:0] rdata,
  // Register outputs to accelerator core
  output reg        reg_start, reg_int_en,
  output reg [15:0] reg_N,
  output reg [31:0] reg_A_addr, reg_B_addr, reg_C_addr,
  input             accel_busy, accel_done,
  input  [31:0]    accel_cycles
);
  reg [AW-1:0] aw_addr_r;
  reg           aw_pend, w_pend;

  assign awready = ~aw_pend;
  assign wready  = ~w_pend;
  assign arready = ~rvalid;

  // Write channel
  always @(posedge clk) begin
    reg_start <= 0;  // auto-clear start bit
    if (awvalid && !aw_pend) begin aw_addr_r<=awaddr; aw_pend<=1; end
    if (wvalid && aw_pend) begin
      case (aw_addr_r[5:0])
        6'h00: begin reg_start<=wdata[0]; reg_int_en<=wdata[2]; end
        6'h08: reg_N      <= wdata[15:0];
        6'h10: reg_A_addr <= wdata;
        6'h18: reg_B_addr <= wdata;
        6'h20: reg_C_addr <= wdata;
      endcase
      aw_pend<=0; bvalid<=1;
    end
    if (bvalid && bready) bvalid<=0;
    // Read channel
    if (arvalid && !rvalid) begin
      rvalid <= 1;
      case (araddr[5:0])
        6'h04: rdata <= {30'b0, accel_done, accel_busy};
        6'h28: rdata <= accel_cycles;
        default: rdata <= 32'hDEAD;
      endcase
    end
    if (rvalid && rready) rvalid<=0;
  end
endmodule

Tiling Strategy for Large Matrices

An N×N systolic array cannot process matrices larger than N×N in one shot. Tiling divides the matrix into N×N sub-matrices (tiles) and processes them sequentially, accumulating partial results.

C[M×K] = A[M×N] × B[N×K] Tile sizes: tile_M, tile_N, tile_K (usually = array dimension P) Tiling loop: for mi in range(0, M, P): for ki in range(0, K, P): C[mi:mi+P, ki:ki+P] = 0 // init output tile for ni in range(0, N, P): // Load A tile [mi:mi+P, ni:ni+P] → SPM bank A // Load B tile [ni:ni+P, ki:ki+P] → weight array (via RoCC) // run_matmul → partial sum accumulates in C tile // Write C tile [mi:mi+P, ki:ki+P] → DRAM Total tiles = (M/P) × (N/P) × (K/P) Total cycles ≈ tiles × (3P - 1) + DMA_overhead
C — MMIO driver for AXI4-connected accelerator
#define ACC_BASE    0x60000000UL
#define REG_CTRL    (ACC_BASE + 0x00)
#define REG_STATUS  (ACC_BASE + 0x04)
#define REG_N       (ACC_BASE + 0x08)
#define REG_A_ADDR  (ACC_BASE + 0x10)
#define REG_B_ADDR  (ACC_BASE + 0x18)
#define REG_C_ADDR  (ACC_BASE + 0x20)
#define REG_CYCLES  (ACC_BASE + 0x28)

#define MMIO_W(addr, val)  (*(volatile uint32_t*)(addr) = (val))
#define MMIO_R(addr)       (*(volatile uint32_t*)(addr))

#define STATUS_BUSY  (1 << 0)
#define STATUS_DONE  (1 << 1)
#define CTRL_START   (1 << 0)

uint32_t accel_matmul(const int8_t *A, const int8_t *B, int32_t *C, int N) {
  // Configure registers
  MMIO_W(REG_N,      N);
  MMIO_W(REG_A_ADDR, (uint32_t)A);
  MMIO_W(REG_B_ADDR, (uint32_t)B);
  MMIO_W(REG_C_ADDR, (uint32_t)C);

  // Flush CPU cache before DMA reads
  cache_flush((void*)A, N*N); cache_flush((void*)B, N*N);
  cache_inval((void*)C, N*N*sizeof(int32_t));

  // Kick off accelerator
  MMIO_W(REG_CTRL, CTRL_START);

  // Poll until done (or use interrupt)
  while (!(MMIO_R(REG_STATUS) & STATUS_DONE));

  return MMIO_R(REG_CYCLES);
}

Day 7 — Interview Questions

Q1What is the difference between AXI4-Lite and AXI4 full, and which is used for accelerator control vs data?
AXI4-Lite is a simplified subset with no burst support, fixed-width (32/64-bit) transfers, and a single outstanding transaction. It is ideal for the control plane: CPU writes configuration registers (addresses, matrix size, start bit) to the accelerator's MMIO register block. AXI4 full supports burst transfers (up to 256 beats), multiple outstanding transactions (via AXI IDs), and optional narrow/unaligned transfers. The accelerator's DMA uses full AXI4 as a master to fetch megabytes of matrix data from DRAM efficiently. A typical accelerator has: one AXI4-Lite slave (control) + one or two AXI4 full master ports (data read/write).
Q2Why is the start bit auto-cleared in the MMIO register?
The start bit triggers a one-shot action — it tells the accelerator to begin one computation. If it remained set (sticky), re-reading the register or a spurious write could accidentally restart the accelerator while it's computing. Auto-clearing on the next clock cycle after capture ensures the accelerator starts exactly once per CPU write. The CPU checks the STATUS register (busy/done bits) to know when the operation completes — it doesn't need the start bit to remain set after triggering.
Q3Explain the tiling algorithm for matrix multiply on a fixed-size systolic array.
For matrices A[M×N] and B[N×K] on a P×P systolic array: divide each dimension into P-sized tiles. Three nested loops: outer loops over M and K tiles (output tile coordinates), inner loop over N tiles (accumulation dimension). For each (mi, ki, ni) tile combination: load B tile into the weight array, stream A tile through the array, accumulate the partial sum into the C tile. After all N tiles are processed for one (mi, ki) output tile, write the final C tile to DRAM. The total number of array invocations is (M/P × K/P × N/P) — same FLOP count as direct computation, but broken into P×P chunks that fit the hardware.
Q4What is address alignment and why does it matter for AXI4 bursts?
AXI4 requires burst transactions to be aligned to the burst size boundary. For a 64-bit (8-byte) data bus with 16-beat bursts, the burst must start at an address that is a multiple of 8×16 = 128 bytes. Misaligned bursts may require splitting into two transactions, halving effective bandwidth. In practice, the DMA engine should align matrix tile base addresses to at least 64-byte cache line boundaries (and ideally to full burst boundaries) for maximum throughput. The software driver is responsible for allocating aligned buffers (using memalign or posix_memalign) and ensuring tile address calculations produce aligned addresses.
Q5What is the write response channel (B channel) in AXI4 and can it be ignored?
The B channel carries a write response (BRESP: OKAY, SLVERR, DECERR) from the slave back to the master, acknowledging that the write was accepted and executed. In simulation and basic designs it is tempting to ignore it, but in production SoC designs it must be properly handled: (1) The master cannot issue the next write address until it has received BVALID (or uses multiple outstanding transactions with different IDs). (2) BRESP indicates errors — SLVERR means the slave rejected the write (e.g., write to a read-only register). (3) An outstanding write without accepting BVALID can stall the AXI interconnect if the response buffer fills. Always connect bready = 1 at minimum, or implement a proper response handler.
Q6How do you implement interrupt-driven accelerator completion instead of polling?
The accelerator drives an interrupt output (irq) high when done=1 and int_en=1 (configured in the CTRL register). The interrupt line connects to the RISC-V PLIC (Platform Level Interrupt Controller), which routes it to the CPU's external interrupt input. In software: (1) Register an interrupt handler that reads the STATUS register and clears the done flag, (2) Set int_en=1 before starting, (3) CPU puts itself in WFI (Wait For Interrupt) state or continues other work. The handler wakes the CPU on completion. Interrupt-driven completion allows the CPU to overlap other computation with the accelerator, improving overall throughput vs polling which wastes CPU cycles in a busy-wait loop.
← Day 6: Memory Architecture Day 8: Performance Profiling →