FPGAs contain thousands of dedicated memory blocks — fast, power-efficient, and zero LUTs. This lesson shows you how to write Verilog that maps cleanly to Block RAM: a single-port synchronous BRAM and a ROM pre-loaded with $readmemh. Understanding how memory is inferred — and when it is not — is one of the most practically important FPGA skills.
A Block RAM (BRAM) is a hard tile on the FPGA die — a dedicated 18 Kb or 36 Kb dual-port SRAM that runs at full system clock speed with no LUT usage. Key properties:
Use a clocked read inside an always @(posedge clk) block. Write with if (we) mem[addr] <= din. Synthesis tools infer BRAM automatically when they see this pattern. If you read combinationally (outside always), the tool falls back to distributed RAM or registers.
| Port | Dir | Width | Description |
|---|---|---|---|
| clk | IN | 1 | Clock. All operations are synchronous to this edge. |
| we | IN | 1 | Write enable. When high on a clock edge, writes din to addr. |
| addr | IN | ADDR_W | Memory address. Valid range: 0 to 2^ADDR_W−1. |
| din | IN | DATA_W | Data to write. Only captured when we is high. |
| dout | OUT | DATA_W | Read data. Registered — valid one cycle after addr is presented. |
// bram_sp.v — Single-Port Synchronous Block RAM
// Synthesis tools infer BRAM when read is clocked.
// Parameters: DATA_W = data width, ADDR_W = address width
// Memory depth = 2^ADDR_W words.
module bram_sp #(
parameter DATA_W = 8,
parameter ADDR_W = 10 // 1024 x 8-bit = 1 KB
)(
input wire clk,
input wire we,
input wire [ADDR_W-1:0] addr,
input wire [DATA_W-1:0] din,
output reg [DATA_W-1:0] dout
);
// Declare memory array
// Synthesis tool maps this to BRAM tiles when read is synchronous
reg [DATA_W-1:0] mem [0:(1<
The code above is read-first mode: when a write and read happen at the same address simultaneously, the old value is returned. This is the safest and most portable mode. Xilinx and Intel both support it. If you need write-first (new value returned on same-cycle write), add a mux in RTL — but beware: write-first is not universally synthesisable to BRAM across all families.
| Port | Dir | Width | Description |
|---|---|---|---|
| clk | IN | 1 | Clock |
| addr | IN | ADDR_W | Read address |
| dout | OUT | DATA_W | ROM data output, valid one cycle after addr |
// rom_sync.v — Synchronous ROM initialised from a hex file
// Create "rom_init.mem" with one hex value per line.
// Synthesis loads these values into BRAM at bitstream time.
module rom_sync #(
parameter DATA_W = 8,
parameter ADDR_W = 8, // 256 entries
parameter MEM_FILE = "rom_init.mem"
)(
input wire clk,
input wire [ADDR_W-1:0] addr,
output reg [DATA_W-1:0] dout
);
reg [DATA_W-1:0] rom [0:(1<
// rom_init.mem — 16 entries, 8-bit sine-like lookup table (quarter wave) 00 19 32 4A 61 76 89 9A A9 B5 BD C3 C6 C7 C6 C3
// tb_bram_sp.v — self-checking testbench for bram_sp
`timescale 1ns/1ps
module tb_bram_sp;
parameter DATA_W = 8;
parameter ADDR_W = 4; // 16 locations for quick test
reg clk = 0;
reg we = 0;
reg [ADDR_W-1:0] addr = 0;
reg [DATA_W-1:0] din = 0;
wire [DATA_W-1:0] dout;
bram_sp #(.DATA_W(DATA_W),.ADDR_W(ADDR_W)) dut (
.clk(clk),.we(we),.addr(addr),.din(din),.dout(dout)
);
always #5 clk = ~clk; // 100 MHz
integer pass_cnt = 0;
integer fail_cnt = 0;
integer i;
reg [DATA_W-1:0] expected;
initial begin
$dumpfile("tb_bram_sp.vcd");
$dumpvars(0, tb_bram_sp);
// --- Phase 1: write a known pattern ---
// Write addr[i] = i * 5 (mod 256)
for (i = 0; i < (1 << ADDR_W); i = i + 1) begin
@(posedge clk);
we <= 1;
addr <= i[ADDR_W-1:0];
din <= (i * 5) & 8'hFF;
end
@(posedge clk);
we <= 0;
// --- Phase 2: read back and verify ---
// Note: synchronous BRAM has 1-cycle read latency
for (i = 0; i < (1 << ADDR_W); i = i + 1) begin
@(posedge clk);
addr <= i[ADDR_W-1:0];
@(posedge clk); // wait 1 extra cycle for registered output
expected = (i * 5) & 8'hFF;
// Read happened the cycle before, output valid now
// (addr was set at cycle i, read latches at next edge)
end
// Cleaner sequential approach: set addr, wait, check
for (i = 0; i < (1 << ADDR_W); i = i + 1) begin
@(negedge clk); // set address on negedge
addr = i[ADDR_W-1:0];
@(posedge clk); // clock edge captures address
@(posedge clk); // output valid on this edge
expected = (i * 5) & 8'hFF;
if (dout === expected) begin
$display("PASS: addr=%0d data=0x%02X", i, dout);
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL: addr=%0d got=0x%02X exp=0x%02X", i, dout, expected);
fail_cnt = fail_cnt + 1;
end
end
// --- Phase 3: write then read same address ---
@(negedge clk);
addr = 4'h5; din = 8'hAB; we = 1;
@(posedge clk);
we = 0;
@(posedge clk);
if (dout === 8'hAB) begin
$display("PASS: overwrite addr=5 -> 0xAB");
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL: overwrite addr=5 got=0x%02X exp=0xAB", dout);
fail_cnt = fail_cnt + 1;
end
if (fail_cnt == 0)
$display("\nALL TESTS PASSED (%0d/%0d)", pass_cnt, pass_cnt+fail_cnt);
else
$display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);
$finish;
end
initial begin
#5000;
$display("TIMEOUT");
$finish;
end
endmodule
PASS: addr=0 data=0x00 PASS: addr=1 data=0x05 PASS: addr=2 data=0x0A PASS: addr=3 data=0x0F ... PASS: addr=15 data=0x4B PASS: overwrite addr=5 -> 0xAB ALL TESTS PASSED (17/17)
| Feature | Block RAM | Distributed RAM (LUT) |
|---|---|---|
| Capacity | 18–36 Kb per tile | 64–256 bits per LUT |
| Read latency | 1 clock cycle (registered) | Combinational (0 cycles) |
| Speed | Very high (500 MHz+) | Moderate (limited by routing) |
| LUT cost | Zero LUTs | 1 LUT per 64 bits |
| Best use | FIFOs, frame buffers, large tables | Small shift registers, tiny RAMs |
| Dual-port | Yes, independent clocks | Limited |
$readmemh in an initial block pre-loads BRAM at bitstream time — no run-time initialisation neededBlock RAM is a dedicated hard memory tile — 18 or 36 Kb per block, zero LUT cost. Distributed RAM uses LUTs as small memory cells. Use BRAM for large buffers (FIFOs, frame buffers). Use distributed RAM for very small shift registers or tiny lookup tables.
BRAM uses a clocked output register for very high clock speeds. The address is presented on clock N and data appears on clock N+1. This 1-cycle latency is the trade-off for >500 MHz operation. Your control logic must account for this delay.
Use $readmemh in an initial block with a .mem file containing one hex value per line. Synthesis tools recognise this pattern and pre-load the BRAM at configuration time — no runtime code needed.