You have built a complete RV32I CPU, pipelined it, added peripherals, and run it on an FPGA. For the final lesson we look at the next major performance step — caches — and map out what would be needed to run Linux on your core. Understanding that gap is what separates a microcontroller-class design from a full application processor.
Modern DRAM has 50–100 ns access latency. At 500 MHz that is 25–50 cycles wasted on every cache miss. A small, fast SRAM cache (L1: 32 KB, 1–4 cycles) sits between the CPU and DRAM and exploits temporal locality (recently used data is likely needed again) and spatial locality (nearby data is likely needed soon).
For a 16-line, 4-byte-per-line direct-mapped cache on a 32-bit address:
32-bit address: [ TAG (26 bits) | INDEX (4 bits) | OFFSET (2 bits) ]
bits 31..6 bits 5..2 bits 1..0
INDEX selects which of the 16 cache lines to check
OFFSET selects which byte within the 4-byte cache line
TAG stored alongside the line — must match for a hit
// icache.v — 16-line direct-mapped instruction cache
// Line size: 1 word (4 bytes). Tag+index+offset from 32-bit address.
// On miss, fetches from the backing imem (next-level store).
module icache #(
parameter LINES = 16 // must be power of 2
)(
input clk, rst,
input [31:0] addr, // fetch address from PC
input req, // 1 = fetch requested
output reg [31:0] rdata, // instruction data
output reg hit, // 1 = cache hit this cycle
// Backing store (e.g. BRAM imem)
output [31:0] mem_addr,
input [31:0] mem_rdata
);
localparam IDX_BITS = 4; // log2(LINES) = 4
localparam OFF_BITS = 2; // log2(4 bytes) = 2
localparam TAG_BITS = 32 - IDX_BITS - OFF_BITS; // = 26
reg [TAG_BITS-1:0] tag_array [0:LINES-1];
reg [31:0] data_array [0:LINES-1];
reg valid [0:LINES-1];
wire [OFF_BITS-1:0] offset = addr[OFF_BITS-1:0];
wire [IDX_BITS-1:0] index = addr[OFF_BITS+IDX_BITS-1:OFF_BITS];
wire [TAG_BITS-1:0] tag = addr[31:OFF_BITS+IDX_BITS];
// Connect miss path to backing store
assign mem_addr = addr;
integer i;
always @(posedge clk or posedge rst) begin
if (rst) begin
for (i = 0; i < LINES; i = i+1)
valid[i] <= 1'b0;
hit <= 0;
end else if (req) begin
if (valid[index] && (tag_array[index] == tag)) begin
// Cache hit
rdata <= data_array[index];
hit <= 1'b1;
end else begin
// Cache miss: fill from backing store
data_array[index] <= mem_rdata;
tag_array[index] <= tag;
valid[index] <= 1'b1;
rdata <= mem_rdata;
hit <= 1'b0; // miss this cycle; hit next cycle
end
end
end
endmodule
// tb_icache.v — Verify cache hit after first miss
`timescale 1ns/1ps
module tb_icache;
reg clk=0, rst=1;
always #5 clk=~clk;
reg [31:0] addr;
reg req;
wire [31:0] rdata;
wire hit;
wire [31:0] mem_addr;
// Stub memory: returns 0xDEAD_BEEF for any address
reg [31:0] mem_rdata;
always @(*) mem_rdata = 32'hDEAD_BEEF;
icache dut(.clk(clk),.rst(rst),.addr(addr),.req(req),
.rdata(rdata),.hit(hit),
.mem_addr(mem_addr),.mem_rdata(mem_rdata));
initial begin
$dumpfile("tb_icache.vcd"); $dumpvars(0,tb_icache);
req=0; addr=0;
@(posedge clk); @(posedge clk); rst=0;
// First access: expect MISS
addr=32'h0000_0010; req=1;
@(posedge clk); req=0;
@(posedge clk); // result available next cycle
if(!hit) $display("PASS: first access = miss (cold)");
else $display("FAIL: expected miss on cold cache");
// Second access same addr: expect HIT
req=1;
@(posedge clk); req=0;
@(posedge clk);
if(hit && rdata===32'hDEAD_BEEF) $display("PASS: second access = hit, data=DEADBEEF");
else $display("FAIL: expected hit, hit=%b data=%h",hit,rdata);
// Different address (same index, different tag): expect MISS (conflict)
addr=32'h0000_0110; req=1; // different tag, same index
@(posedge clk); req=0;
@(posedge clk);
if(!hit) $display("PASS: conflict miss on different tag");
else $display("FAIL: expected conflict miss");
$finish;
end
endmodule
| Feature | Needed for Linux | Complexity |
|---|---|---|
| S-mode privilege | Linux kernel runs in S-mode; U-mode for user apps | Medium — adds supervisor CSRs, privilege switching |
| Sv32 MMU | Virtual memory — required for process isolation and large address spaces | High — page-table walker, TLB, PTEs |
| CLINT timer | Core-Local Interrupt Controller — provides mtime/mtimecmp for timer interrupts | Low — simple MMIO counter |
| PLIC | Platform Level Interrupt Controller — routes external interrupts to harts | Medium — priority encoder + claim/complete protocol |
| OpenSBI | Open Source RISC-V Supervisor Binary Interface — M-mode firmware that Linux calls via ECALL | Software-only — open source firmware |
| Device tree | Describes hardware to Linux (memory ranges, UART, PLIC addresses) | Low — a .dts text file |
You have gone from a register file in Day 9 to a pipelined, hazard-handled, FPGA-proven RV32I CPU in 25 days. You understand every wire in the datapath, every pipeline register, every hazard, and the path forward to a production-grade Linux core. That knowledge is the foundation for anything in VLSI, FPGA design, or computer architecture.
What next? Explore VLSI Design, FPGA from Scratch, or study CVA6/VexRiscv source code to see how the concepts from this course scale to a full application processor.
The address is split into tag, index, and offset. The index selects a cache line. If the valid bit is set and the stored tag matches the address tag, it is a hit. Otherwise it is a miss and the line is fetched from main memory.
S-mode privilege, Sv32 virtual memory (MMU), CLINT timer, PLIC interrupt controller, OpenSBI M-mode firmware, and a device tree. All are substantially more complex than our 25-day RV32I base.
CVA6 (ETH Zurich, RV64GC), VexRiscv (SpinalHDL, configurable), and Rocket Chip (UC Berkeley, full SoC generator). All are open source on GitHub.