Driver Architecture
A bare-metal accelerator driver has three layers: register access (volatile MMIO macros), transport (polling or interrupt), and API (high-level functions like accel_matmul()). No OS, no malloc, no libc — just direct hardware control.
Why volatile?
Without volatile, the compiler assumes reads return the same value and may cache STATUS in a register — so your polling loop never sees the done bit go high. volatile forces every read to go to the actual hardware address.
C — Full driver header: accel.h
#pragma once #include <stdint.h> #include <stddef.h> // Accelerator MMIO base (from SoC address map) #define ACCEL_BASE 0x60000000UL #define REG_CTRL (ACCEL_BASE + 0x00) #define REG_STATUS (ACCEL_BASE + 0x04) #define REG_N (ACCEL_BASE + 0x08) #define REG_A_ADDR (ACCEL_BASE + 0x10) #define REG_B_ADDR (ACCEL_BASE + 0x18) #define REG_C_ADDR (ACCEL_BASE + 0x20) #define REG_CYCLES (ACCEL_BASE + 0x28) #define CTRL_START (1u << 0) #define CTRL_RESET (1u << 1) #define CTRL_INT_EN (1u << 2) #define STATUS_BUSY (1u << 0) #define STATUS_DONE (1u << 1) #define STATUS_ERR (1u << 2) #define MMIO_W(a,v) (*(volatile uint32_t*)(a) = (uint32_t)(v)) #define MMIO_R(a) (*(volatile uint32_t*)(a)) // Cache line size for flush alignment #define CACHE_LINE 64 #define ALIGN64 __attribute__((aligned(64))) void accel_reset(void); uint32_t accel_matmul_poll(const int8_t *A, const int8_t *B, int32_t *C, int N); void accel_matmul_async(const int8_t *A, const int8_t *B, int32_t *C, int N); void accel_irq_handler(void); // register with PLIC int accel_wait_done(uint32_t timeout_ms);
C — Driver implementation: accel.c
#include "accel.h" static volatile int _done_flag = 0; static void cache_flush_range(const void *ptr, size_t bytes) { uintptr_t p = (uintptr_t)ptr & ~(uintptr_t)(CACHE_LINE-1); uintptr_t end = (uintptr_t)ptr + bytes; for (; p < end; p += CACHE_LINE) __asm__ volatile("cbo.flush (%0)" :: "r"(p) : "memory"); } static void cache_inval_range(void *ptr, size_t bytes) { uintptr_t p = (uintptr_t)ptr & ~(uintptr_t)(CACHE_LINE-1); uintptr_t end = (uintptr_t)ptr + bytes; for (; p < end; p += CACHE_LINE) __asm__ volatile("cbo.inval (%0)" :: "r"(p) : "memory"); } void accel_reset() { MMIO_W(REG_CTRL, CTRL_RESET); for (volatile int i=0; i<16; i++); // hold reset ~16 cycles MMIO_W(REG_CTRL, 0); } uint32_t accel_matmul_poll(const int8_t *A, const int8_t *B, int32_t *C, int N) { size_t ab_sz = (size_t)N*N*sizeof(int8_t); size_t c_sz = (size_t)N*N*sizeof(int32_t); cache_flush_range(A, ab_sz); // push CPU writes to DRAM cache_flush_range(B, ab_sz); cache_inval_range(C, c_sz); // discard stale CPU view of C MMIO_W(REG_N, N); MMIO_W(REG_A_ADDR, (uint32_t)(uintptr_t)A); MMIO_W(REG_B_ADDR, (uint32_t)(uintptr_t)B); MMIO_W(REG_C_ADDR, (uint32_t)(uintptr_t)C); MMIO_W(REG_CTRL, CTRL_START); while (!(MMIO_R(REG_STATUS) & STATUS_DONE)); // poll return MMIO_R(REG_CYCLES); } void accel_matmul_async(const int8_t *A, const int8_t *B, int32_t *C, int N) { cache_flush_range(A, N*N); cache_flush_range(B, N*N); cache_inval_range(C, N*N*4); _done_flag = 0; MMIO_W(REG_N, N); MMIO_W(REG_A_ADDR, (uint32_t)A); MMIO_W(REG_B_ADDR, (uint32_t)B); MMIO_W(REG_C_ADDR, (uint32_t)C); MMIO_W(REG_CTRL, CTRL_START | CTRL_INT_EN); // fire + enable IRQ } void accel_irq_handler() { uint32_t st = MMIO_R(REG_STATUS); if (st & STATUS_DONE) _done_flag = 1; MMIO_W(REG_CTRL, 0); // clear int_en to deassert IRQ } int accel_wait_done(uint32_t timeout_ms) { uint64_t t0, now, freq = 1000000; // 1 MHz mtime tick __asm__ volatile("csrr %0, time":"=r"(t0)); do { if (_done_flag) return 0; __asm__ volatile("csrr %0, time":"=r"(now)); } while ((now - t0) < (uint64_t)timeout_ms * (freq/1000)); return -1; // timeout }
Polling vs Interrupt Comparison
| Aspect | Polling | Interrupt-Driven |
|---|---|---|
| CPU during wait | Spinning (wastes cycles) | WFI or doing other work |
| Latency | Very low (immediate detect) | IRQ latency (~10–50 cycles) |
| Code complexity | Simple loop | ISR registration, PLIC config |
| Best use case | Short ops (<10µs) | Long ops, overlapping work |
| Power | High (CPU active) | Low (CPU can sleep in WFI) |
Day 9 — Interview Questions
Q1Why must MMIO pointers be declared volatile in C?
Without volatile, the C compiler assumes that a memory location's value doesn't change unless the current program writes to it. For STATUS registers, the hardware changes the value asynchronously — the done bit goes high when the accelerator finishes. If the compiler caches the first read (0x00 = not done) in a register and never re-reads the address, the polling loop becomes infinite. volatile forces every access to generate a real load instruction against the exact memory address, preventing the compiler from eliminating or reordering the access. This applies to both read (status polling) and write (register configuration) operations.
Q2What is the correct order of operations when setting up a DMA transfer?
The correct sequence is: (1) Allocate and fill source buffers (A, B matrices), (2) cache_flush source buffers — push CPU cache contents to DRAM so the DMA sees correct data, (3) cache_inval destination buffer C — discard any stale CPU cache lines for C so the CPU doesn't read old data after DMA writes, (4) Write DMA addresses and sizes to MMIO registers, (5) Start the accelerator. After completion: (6) The DMA has written C to DRAM, (7) CPU reads C — cache_inval ensures the CPU fetches fresh data from DRAM. Skipping step 2 causes the DMA to read stale data from DRAM. Skipping step 3 causes the CPU to read old cached data instead of DMA results.
Q3How do you register an interrupt handler for the accelerator on a RISC-V bare-metal system?
On RISC-V, external interrupts route through the PLIC (Platform Level Interrupt Controller). To register a handler: (1) Set the interrupt priority at the PLIC source register (PLIC_BASE + 4×irq_id), (2) Enable the interrupt at the PLIC hart enable register, (3) Set the PLIC threshold to 0 (allow all priorities), (4) Set the mie CSR bit for external interrupts (MEIE bit 11), (5) Set mstatus.MIE=1 to enable global interrupts. When the accelerator asserts its IRQ line, the PLIC claims it, the CPU jumps to mtvec (trap vector), and your trap handler dispatches to accel_irq_handler() based on mcause = 0x8000000B (external interrupt). After the handler, write the IRQ ID to the PLIC complete register.
Q4What is buffer alignment and why does it matter for DMA?
DMA engines typically require source and destination buffers to be aligned to the AXI bus width boundary (e.g., 64-byte cache line, or 128-byte burst boundary). Misaligned buffers require the DMA to issue partial-beat transactions at the start and end, reducing efficiency and sometimes causing protocol violations. In bare-metal C: declare buffers with __attribute__((aligned(64))) or allocate with posix_memalign(). Also ensure tile address arithmetic produces aligned addresses — if the base is 64-byte aligned and each tile is a multiple of 64 bytes in size, all tile addresses are automatically aligned. Misalignment bugs are silent: the DMA may silently corrupt data at the boundary without an error flag.
Q5What is a memory barrier and when do you need one in a driver?
A memory barrier (fence) prevents the CPU and compiler from reordering memory accesses across the barrier point. In RISC-V assembly, fence rw,rw ensures all prior reads and writes are visible before any subsequent ones. In bare-metal drivers, barriers are needed: (1) Before writing the START bit — all configuration registers must be written and visible to hardware before the start, (2) After the done bit is read — result data must be fetched after the done flag, not speculatively, (3) Around cache flush/inval operations. In C, __asm__ volatile("fence rw,rw" ::: "memory") or __asm__ volatile("" ::: "memory") (compiler fence only). Using volatile on MMIO macros handles most cases, but DMA completion barriers between MMIO and memory accesses may still need explicit fences.
Q6How do you implement a timeout for accelerator polling in bare-metal C?
Read the RISC-V time CSR (0xC01) before the poll loop — this is the system real-time clock driven by the CLINT, ticking at a fixed frequency (e.g., 1 MHz or 10 MHz). Convert the timeout_ms to ticks: ticks = timeout_ms × (freq / 1000). In the poll loop, re-read the time CSR and compare (now - start) >= ticks. If true, exit with an error code. This avoids depending on the cycle counter (which is affected by pipeline stalls) and instead uses real wall time. Always implement a timeout in production drivers — a stuck accelerator (hardware bug, protocol deadlock, power issue) would otherwise hang the CPU indefinitely in a pure poll loop.