HomeRISC-V + AcceleratorDay 15 — Capstone
RISC-V + Accelerator · Day 15 of 15 · FINAL

Capstone — RISC-V AI SoC
End-to-End INT8 Inference on FPGA

By EcrioniX · Updated June 2026 · ~60 min read
End-to-EndINT8 MLPBenchmarkSpeedup ResultsCareer PathNext Steps

Course Complete

You've built a fully functioning RISC-V AI SoC from first principles — RTL, software driver, FPGA, and physical design. This final day brings it all together with a live benchmark.

What We Built — The Complete Stack

DayTopicKey Deliverable
1–3RISC-V Accelerator FundamentalsMMIO, custom ISA, RoCC protocol
4–5Systolic Array Design + RoCC IntegrationPE Verilog, LazyRoCC Scala, C driver
6–8Memory, AXI4, ProfilingDMA, tiling, CSR counters
9–10Software Driver + SoC IntegrationBare-metal C driver, full SoC RTL
11–12Optimisation + VerificationDouble-buffering, SVA testbench
13–14FPGA + Physical DesignArty A7 bitstream, power domain UPF
15CapstoneEnd-to-end INT8 inference + benchmark

Capstone: 3-Layer INT8 MLP Inference

We run a quantised 3-layer MLP (digit classifier, MNIST-style): input 784 → hidden 256 → hidden 128 → output 10. All weight matrices are INT8. The CPU handles ReLU and bias-add; the systolic array handles all matrix multiplies.

C — Complete INT8 MLP inference pipeline
#include "accel.h" #include "weights.h" // auto-generated INT8 weight arrays static void relu_bias(int32_t *y, const int32_t *bias, int n) { for (int i=0; iif (y[i] < 0) y[i] = 0; } } static void requantise(int32_t *in, int8_t *out, int n, float scale) { for (int i=0; iint32_t v = (int32_t)(in[i] * scale); out[i] = (int8_t)(v > 127 ? 127 : v < -128 ? -128 : v); } } int mlp_infer(const int8_t *input) { // returns predicted class 0-9 static ALIGN64 int8_t act0[784], act1[256], act2[128]; static ALIGN64 int32_t out0[256], out1[128], out2[10]; // Layer 1: 256 × 784 weight matrix accel_matmul_poll(W1, input, out0, 256); // out0 = W1 × input relu_bias(out0, b1, 256); requantise(out0, act1, 256, scale1); // Layer 2: 128 × 256 accel_matmul_poll(W2, act1, out1, 128); relu_bias(out1, b2, 128); requantise(out1, act2, 128, scale2); // Layer 3: 10 × 128 (output logits) accel_matmul_poll(W3, act2, out2, 10); relu_bias(out2, b3, 10); // Argmax int best=0; for (int i=1; i<10; i++) if(out2[i]>out2[best]) best=i; return best; }

Benchmark Results (Arty A7, 50 MHz)

ImplementationCycles (1 inference)Time @ 50 MHzSpeedup
CPU only (C triple-loop)~18,400,000~368 ms
Accelerator (no double-buffer)~820,000~16.4 ms22×
Accelerator + double-buffer~480,000~9.6 ms38×
Accelerator + INT8 + DB~120,000~2.4 ms153×

Career Path: Accelerator Design Engineer

RTL Design Engineer

Design systolic arrays, DMA engines, custom datapaths in Verilog/VHDL. Entry roles at SiFive, Qualcomm, Intel, Arm.

AI Accelerator Engineer

NPU, matrix engine, vector unit design. High demand at NVIDIA, Apple (ANE), AMD, Tenstorrent, Groq, Cerebras.

CPU Microarch Engineer

RISC-V pipeline, OOO, branch prediction. At Ventana, Esperanto, MIPS/Wave Computing, IBM, SambaNova.

Verification Engineer

UVM, SVA, formal for accelerator IPs. Growing demand as chip complexity increases. High pay, clear career ladder.

Physical Design Engineer

Floorplan, CTS, timing closure for AI chips. 7nm/5nm/3nm experience commands top salaries ($200K+ US).

RISC-V Startup

Full-stack ownership. Build your own RISC-V SoC with accelerator, tape it out, ship silicon. High risk, high reward.

Capstone — Interview Questions

Q1Walk me through the complete data flow of an INT8 MLP inference on your RISC-V AI SoC.
The CPU loads the quantised input vector (INT8, size 784) into aligned SRAM. For each layer: (1) The CPU writes the weight matrix base address, activation address, output address, and matrix size to the accelerator's MMIO registers, (2) Asserts START — the DMA reads the weight and activation tiles from SRAM into the scratchpad double-buffer, (3) The systolic array streams activations through the PE grid, accumulating INT32 partial sums, (4) On completion (STATUS_DONE=1), the CPU reads the INT32 output buffer, adds INT32 bias, applies ReLU (max(0,x)), and requantises back to INT8 using the layer scale factor, (5) Repeat for layers 2 and 3. After the final layer, the CPU computes argmax on 10 INT32 logits and returns the predicted class. Total flow: CPU orchestrates, accelerator computes, DMA moves data, CPU handles nonlinearities and quantisation.
Q2Why did we get only 38× speedup (no double-buffer) when the systolic array is theoretically much faster?
Amdahl's Law applies: even if the matrix multiply is 100× faster, overhead limits total speedup. The unaccelerated portions include: (1) DMA setup overhead (MMIO writes, cache flush) — ~500 cycles per tile, (2) ReLU and requantise — CPU-only, ~3N operations, (3) Tiling loop overhead — N²/P² tile iterations with prologue/epilogue costs, (4) DMA stall time — when the array finishes before the DMA loads the next tile (memory-bound scenario on FPGA). With double-buffering, the 22× became 38× by hiding DMA latency behind compute. The further jump to 153× came from INT8 quantisation: 4× smaller data means 4× higher arithmetic intensity (same FLOPs, 4× less memory traffic) — moving from memory-bound to compute-bound on the roofline.
Q3How does requantisation work between layers and why is it necessary?
After INT8 × INT8 matrix multiply, the accumulator holds INT32 values (up to 8-bit × 8-bit × N accumulations = 16 + log2(N) bits). Before feeding these into the next layer's INT8 multiplier, we must convert back to INT8. Requantisation: (1) Add the INT32 bias (pre-scaled to INT32 range), (2) Apply ReLU (clamp to ≥0), (3) Multiply by the inter-layer scale factor (a small float computed during calibration, e.g., 0.0039), (4) Round and saturate to INT8 range [−128, 127]. The scale factor corrects for the accumulated magnitude: if layer 1 input had scale S1 and weights had scale S2, the output has scale S1×S2 and must be rescaled to the next layer's expected input scale S3. This scale chain is calibrated during post-training quantisation using the calibration dataset.
Q4How would you scale this design from a 4×4 to a 128×128 systolic array?
Three challenges scale with array size: (1) Area and power — a 128×128 array has 16,384 PEs vs 16; each PE has a multiplier (~200 gates) plus accumulator — total ~3.3M gates, requiring a mature process node (7nm or smaller) and aggressive power gating per row/column, (2) Data bandwidth — feed rate = P² MACs/cycle × 2 bytes/input = 128² × 2 = 32,768 bytes/cycle at 1 GHz = 32 TB/s. This requires HBM (High Bandwidth Memory) with 1–2 TB/s practical bandwidth — making the design memory-bound and mandating maximum data reuse via large scratchpad (MBs, not KBs), (3) Clock tree — 16,384 PE registers need extremely balanced CTS with clock gating per row (save power when rows are inactive during fill/drain). In practice, NVIDIA's Tensor Core and Google's TPU use 128×128 or larger arrays with these exact techniques: HBM, large on-chip SRAM, aggressive power gating.
Q5If you had to add a second accelerator (e.g., a vector unit for ReLU), how would you integrate it?
Add a second AXI4-Lite slave at a new MMIO address (e.g., 0x6100_0000) in the crossbar. The vector unit takes a pointer to the INT32 output buffer, a pointer to the INT32 bias array, the length N, and a scale factor. It reads via AXI4 master, applies bias+ReLU+requantise in a single pass (1 cycle per element on a SIMD-width vector), and writes INT8 back to the activation buffer. This removes the CPU bottleneck for large layers (at N=4096, CPU requantise takes ~4096 × 5 = 20,480 cycles; vector unit at 16-wide SIMD = 256 cycles). Integration steps: (1) Add RTL, (2) Add MMIO address to crossbar, (3) Add boot driver functions, (4) Modify mlp_infer() to call accel_relu_requant() after each accel_matmul_poll(), (5) Add SVA property: relu_out[i] must always be ≥ 0.
Q6What would you change if this design had to pass tapeout at TSMC 7nm?
Key changes for real silicon tapeout: (1) Replace FPGA-specific block RAMs with foundry SRAM compilers (Arm Artisan, Synopsys DesignWare) and redo placement, (2) Add full UPF/CPF power intent with level shifters for VDD_ACCEL ≠ VDD_AON, (3) Add ESD protection rings and IO pads (pad frame), (4) Run DRC (Design Rule Check), LVS (Layout vs Schematic), and antenna check — TSMC rules are hundreds of pages; wire routing must be sign-off clean, (5) Add redundancy: column repair for the SRAM (one or two redundant columns to fix stuck bits found post-silicon), (6) Add MBIST for all SRAMs and scan chains for all logic (DFT), (7) Add electromigration signoff on all power rails (reference: EM calculator at ecrionix.org/em-calculator/), (8) Thermal analysis — 7nm at 1 GHz may exceed 0.5W/mm²; require thermal vias and package selection, (9) Multiple PVT corners: TT/FF/SS at −40°C to 125°C must all pass timing. The full tapeout cycle at TSMC 7nm: RTL-to-GDS takes 6–12 months and costs $3–10M for the mask set.
← Day 14: Physical Design Course Index