HomeFPGA from ScratchDay 15
DAY 15 · MEMORY

Using Block RAM & ROMs on FPGA

By EcrioniX · Updated Jun 11, 2026

FPGAs contain thousands of dedicated memory blocks — fast, power-efficient, and zero LUTs. This lesson shows you how to write Verilog that maps cleanly to Block RAM: a single-port synchronous BRAM and a ROM pre-loaded with $readmemh. Understanding how memory is inferred — and when it is not — is one of the most practically important FPGA skills.

1. Block RAM architecture

A Block RAM (BRAM) is a hard tile on the FPGA die — a dedicated 18 Kb or 36 Kb dual-port SRAM that runs at full system clock speed with no LUT usage. Key properties:

BRAM inference rule

Use a clocked read inside an always @(posedge clk) block. Write with if (we) mem[addr] <= din. Synthesis tools infer BRAM automatically when they see this pattern. If you read combinationally (outside always), the tool falls back to distributed RAM or registers.

2. Port table — bram_sp

PortDirWidthDescription
clkIN1Clock. All operations are synchronous to this edge.
weIN1Write enable. When high on a clock edge, writes din to addr.
addrINADDR_WMemory address. Valid range: 0 to 2^ADDR_W−1.
dinINDATA_WData to write. Only captured when we is high.
doutOUTDATA_WRead data. Registered — valid one cycle after addr is presented.

3. bram_sp.v — single-port synchronous BRAM

bram_sp.v
// bram_sp.v — Single-Port Synchronous Block RAM
// Synthesis tools infer BRAM when read is clocked.
// Parameters: DATA_W = data width, ADDR_W = address width
// Memory depth = 2^ADDR_W words.

module bram_sp #(
    parameter DATA_W = 8,
    parameter ADDR_W = 10    // 1024 x 8-bit = 1 KB
)(
    input  wire             clk,
    input  wire             we,
    input  wire [ADDR_W-1:0] addr,
    input  wire [DATA_W-1:0] din,
    output reg  [DATA_W-1:0] dout
);

// Declare memory array
// Synthesis tool maps this to BRAM tiles when read is synchronous
reg [DATA_W-1:0] mem [0:(1<

4. Read-first vs write-first mode

The code above is read-first mode: when a write and read happen at the same address simultaneously, the old value is returned. This is the safest and most portable mode. Xilinx and Intel both support it. If you need write-first (new value returned on same-cycle write), add a mux in RTL — but beware: write-first is not universally synthesisable to BRAM across all families.

5. Port table — rom_sync

PortDirWidthDescription
clkIN1Clock
addrINADDR_WRead address
doutOUTDATA_WROM data output, valid one cycle after addr

6. rom_sync.v — synchronous ROM with $readmemh

rom_sync.v
// rom_sync.v — Synchronous ROM initialised from a hex file
// Create "rom_init.mem" with one hex value per line.
// Synthesis loads these values into BRAM at bitstream time.

module rom_sync #(
    parameter DATA_W   = 8,
    parameter ADDR_W   = 8,          // 256 entries
    parameter MEM_FILE = "rom_init.mem"
)(
    input  wire             clk,
    input  wire [ADDR_W-1:0] addr,
    output reg  [DATA_W-1:0] dout
);

reg [DATA_W-1:0] rom [0:(1<

Example rom_init.mem

rom_init.mem
// rom_init.mem — 16 entries, 8-bit sine-like lookup table (quarter wave)
00
19
32
4A
61
76
89
9A
A9
B5
BD
C3
C6
C7
C6
C3

7. Testbench — tb_bram_sp.v

tb_bram_sp.v
// tb_bram_sp.v — self-checking testbench for bram_sp
`timescale 1ns/1ps

module tb_bram_sp;

parameter DATA_W = 8;
parameter ADDR_W = 4;   // 16 locations for quick test

reg               clk  = 0;
reg               we   = 0;
reg  [ADDR_W-1:0] addr = 0;
reg  [DATA_W-1:0] din  = 0;
wire [DATA_W-1:0] dout;

bram_sp #(.DATA_W(DATA_W),.ADDR_W(ADDR_W)) dut (
    .clk(clk),.we(we),.addr(addr),.din(din),.dout(dout)
);

always #5 clk = ~clk;   // 100 MHz

integer pass_cnt = 0;
integer fail_cnt = 0;
integer i;
reg [DATA_W-1:0] expected;

initial begin
    $dumpfile("tb_bram_sp.vcd");
    $dumpvars(0, tb_bram_sp);

    // --- Phase 1: write a known pattern ---
    // Write addr[i] = i * 5 (mod 256)
    for (i = 0; i < (1 << ADDR_W); i = i + 1) begin
        @(posedge clk);
        we   <= 1;
        addr <= i[ADDR_W-1:0];
        din  <= (i * 5) & 8'hFF;
    end
    @(posedge clk);
    we <= 0;

    // --- Phase 2: read back and verify ---
    // Note: synchronous BRAM has 1-cycle read latency
    for (i = 0; i < (1 << ADDR_W); i = i + 1) begin
        @(posedge clk);
        addr <= i[ADDR_W-1:0];
        @(posedge clk);   // wait 1 extra cycle for registered output
        expected = (i * 5) & 8'hFF;
        // Read happened the cycle before, output valid now
        // (addr was set at cycle i, read latches at next edge)
    end

    // Cleaner sequential approach: set addr, wait, check
    for (i = 0; i < (1 << ADDR_W); i = i + 1) begin
        @(negedge clk);   // set address on negedge
        addr = i[ADDR_W-1:0];
        @(posedge clk);   // clock edge captures address
        @(posedge clk);   // output valid on this edge
        expected = (i * 5) & 8'hFF;
        if (dout === expected) begin
            $display("PASS: addr=%0d  data=0x%02X", i, dout);
            pass_cnt = pass_cnt + 1;
        end else begin
            $display("FAIL: addr=%0d  got=0x%02X  exp=0x%02X", i, dout, expected);
            fail_cnt = fail_cnt + 1;
        end
    end

    // --- Phase 3: write then read same address ---
    @(negedge clk);
    addr = 4'h5; din = 8'hAB; we = 1;
    @(posedge clk);
    we = 0;
    @(posedge clk);
    if (dout === 8'hAB) begin
        $display("PASS: overwrite addr=5 -> 0xAB");
        pass_cnt = pass_cnt + 1;
    end else begin
        $display("FAIL: overwrite addr=5 got=0x%02X exp=0xAB", dout);
        fail_cnt = fail_cnt + 1;
    end

    if (fail_cnt == 0)
        $display("\nALL TESTS PASSED (%0d/%0d)", pass_cnt, pass_cnt+fail_cnt);
    else
        $display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);

    $finish;
end

initial begin
    #5000;
    $display("TIMEOUT");
    $finish;
end

endmodule

8. Expected output

PASS: addr=0  data=0x00
PASS: addr=1  data=0x05
PASS: addr=2  data=0x0A
PASS: addr=3  data=0x0F
...
PASS: addr=15  data=0x4B
PASS: overwrite addr=5 -> 0xAB

ALL TESTS PASSED (17/17)

9. BRAM vs Distributed RAM — when to use which

FeatureBlock RAMDistributed RAM (LUT)
Capacity18–36 Kb per tile64–256 bits per LUT
Read latency1 clock cycle (registered)Combinational (0 cycles)
SpeedVery high (500 MHz+)Moderate (limited by routing)
LUT costZero LUTs1 LUT per 64 bits
Best useFIFOs, frame buffers, large tablesSmall shift registers, tiny RAMs
Dual-portYes, independent clocksLimited

Key Takeaways

Frequently Asked Questions

What is the difference between BRAM and distributed RAM?

Block RAM is a dedicated hard memory tile — 18 or 36 Kb per block, zero LUT cost. Distributed RAM uses LUTs as small memory cells. Use BRAM for large buffers (FIFOs, frame buffers). Use distributed RAM for very small shift registers or tiny lookup tables.

Why is BRAM output registered?

BRAM uses a clocked output register for very high clock speeds. The address is presented on clock N and data appears on clock N+1. This 1-cycle latency is the trade-off for >500 MHz operation. Your control logic must account for this delay.

How do you initialise a ROM in Verilog?

Use $readmemh in an initial block with a .mem file containing one hex value per line. Synthesis tools recognise this pattern and pre-load the BRAM at configuration time — no runtime code needed.

← Previous
Day 14: UART Receiver