You've built a fully functioning RISC-V AI SoC from first principles — RTL, software driver, FPGA, and physical design. This final day brings it all together with a live benchmark.
What We Built — The Complete Stack
Day
Topic
Key Deliverable
1–3
RISC-V Accelerator Fundamentals
MMIO, custom ISA, RoCC protocol
4–5
Systolic Array Design + RoCC Integration
PE Verilog, LazyRoCC Scala, C driver
6–8
Memory, AXI4, Profiling
DMA, tiling, CSR counters
9–10
Software Driver + SoC Integration
Bare-metal C driver, full SoC RTL
11–12
Optimisation + Verification
Double-buffering, SVA testbench
13–14
FPGA + Physical Design
Arty A7 bitstream, power domain UPF
15
Capstone
End-to-end INT8 inference + benchmark
Capstone: 3-Layer INT8 MLP Inference
We run a quantised 3-layer MLP (digit classifier, MNIST-style): input 784 → hidden 256 → hidden 128 → output 10. All weight matrices are INT8. The CPU handles ReLU and bias-add; the systolic array handles all matrix multiplies.
UVM, SVA, formal for accelerator IPs. Growing demand as chip complexity increases. High pay, clear career ladder.
Physical Design Engineer
Floorplan, CTS, timing closure for AI chips. 7nm/5nm/3nm experience commands top salaries ($200K+ US).
RISC-V Startup
Full-stack ownership. Build your own RISC-V SoC with accelerator, tape it out, ship silicon. High risk, high reward.
Capstone — Interview Questions
Q1Walk me through the complete data flow of an INT8 MLP inference on your RISC-V AI SoC.
The CPU loads the quantised input vector (INT8, size 784) into aligned SRAM. For each layer: (1) The CPU writes the weight matrix base address, activation address, output address, and matrix size to the accelerator's MMIO registers, (2) Asserts START — the DMA reads the weight and activation tiles from SRAM into the scratchpad double-buffer, (3) The systolic array streams activations through the PE grid, accumulating INT32 partial sums, (4) On completion (STATUS_DONE=1), the CPU reads the INT32 output buffer, adds INT32 bias, applies ReLU (max(0,x)), and requantises back to INT8 using the layer scale factor, (5) Repeat for layers 2 and 3. After the final layer, the CPU computes argmax on 10 INT32 logits and returns the predicted class. Total flow: CPU orchestrates, accelerator computes, DMA moves data, CPU handles nonlinearities and quantisation.
Q2Why did we get only 38× speedup (no double-buffer) when the systolic array is theoretically much faster?
Amdahl's Law applies: even if the matrix multiply is 100× faster, overhead limits total speedup. The unaccelerated portions include: (1) DMA setup overhead (MMIO writes, cache flush) — ~500 cycles per tile, (2) ReLU and requantise — CPU-only, ~3N operations, (3) Tiling loop overhead — N²/P² tile iterations with prologue/epilogue costs, (4) DMA stall time — when the array finishes before the DMA loads the next tile (memory-bound scenario on FPGA). With double-buffering, the 22× became 38× by hiding DMA latency behind compute. The further jump to 153× came from INT8 quantisation: 4× smaller data means 4× higher arithmetic intensity (same FLOPs, 4× less memory traffic) — moving from memory-bound to compute-bound on the roofline.
Q3How does requantisation work between layers and why is it necessary?
After INT8 × INT8 matrix multiply, the accumulator holds INT32 values (up to 8-bit × 8-bit × N accumulations = 16 + log2(N) bits). Before feeding these into the next layer's INT8 multiplier, we must convert back to INT8. Requantisation: (1) Add the INT32 bias (pre-scaled to INT32 range), (2) Apply ReLU (clamp to ≥0), (3) Multiply by the inter-layer scale factor (a small float computed during calibration, e.g., 0.0039), (4) Round and saturate to INT8 range [−128, 127]. The scale factor corrects for the accumulated magnitude: if layer 1 input had scale S1 and weights had scale S2, the output has scale S1×S2 and must be rescaled to the next layer's expected input scale S3. This scale chain is calibrated during post-training quantisation using the calibration dataset.
Q4How would you scale this design from a 4×4 to a 128×128 systolic array?
Three challenges scale with array size: (1) Area and power — a 128×128 array has 16,384 PEs vs 16; each PE has a multiplier (~200 gates) plus accumulator — total ~3.3M gates, requiring a mature process node (7nm or smaller) and aggressive power gating per row/column, (2) Data bandwidth — feed rate = P² MACs/cycle × 2 bytes/input = 128² × 2 = 32,768 bytes/cycle at 1 GHz = 32 TB/s. This requires HBM (High Bandwidth Memory) with 1–2 TB/s practical bandwidth — making the design memory-bound and mandating maximum data reuse via large scratchpad (MBs, not KBs), (3) Clock tree — 16,384 PE registers need extremely balanced CTS with clock gating per row (save power when rows are inactive during fill/drain). In practice, NVIDIA's Tensor Core and Google's TPU use 128×128 or larger arrays with these exact techniques: HBM, large on-chip SRAM, aggressive power gating.
Q5If you had to add a second accelerator (e.g., a vector unit for ReLU), how would you integrate it?
Add a second AXI4-Lite slave at a new MMIO address (e.g., 0x6100_0000) in the crossbar. The vector unit takes a pointer to the INT32 output buffer, a pointer to the INT32 bias array, the length N, and a scale factor. It reads via AXI4 master, applies bias+ReLU+requantise in a single pass (1 cycle per element on a SIMD-width vector), and writes INT8 back to the activation buffer. This removes the CPU bottleneck for large layers (at N=4096, CPU requantise takes ~4096 × 5 = 20,480 cycles; vector unit at 16-wide SIMD = 256 cycles). Integration steps: (1) Add RTL, (2) Add MMIO address to crossbar, (3) Add boot driver functions, (4) Modify mlp_infer() to call accel_relu_requant() after each accel_matmul_poll(), (5) Add SVA property: relu_out[i] must always be ≥ 0.
Q6What would you change if this design had to pass tapeout at TSMC 7nm?
Key changes for real silicon tapeout: (1) Replace FPGA-specific block RAMs with foundry SRAM compilers (Arm Artisan, Synopsys DesignWare) and redo placement, (2) Add full UPF/CPF power intent with level shifters for VDD_ACCEL ≠ VDD_AON, (3) Add ESD protection rings and IO pads (pad frame), (4) Run DRC (Design Rule Check), LVS (Layout vs Schematic), and antenna check — TSMC rules are hundreds of pages; wire routing must be sign-off clean, (5) Add redundancy: column repair for the SRAM (one or two redundant columns to fix stuck bits found post-silicon), (6) Add MBIST for all SRAMs and scan chains for all logic (DFT), (7) Add electromigration signoff on all power rails (reference: EM calculator at ecrionix.org/em-calculator/), (8) Thermal analysis — 7nm at 1 GHz may exceed 0.5W/mm²; require thermal vias and package selection, (9) Multiple PVT corners: TT/FF/SS at −40°C to 125°C must all pass timing. The full tapeout cycle at TSMC 7nm: RTL-to-GDS takes 6–12 months and costs $3–10M for the mask set.