Translate a 34-bit AXI4 byte address into six HBM3 fields — stack, pseudo-channel, bank group, bank, row, and column — with three configurable interleave modes for maximum bank-level parallelism.
Every HBM3 stack is a die-stacked DRAM assembly. Understanding its three-level hierarchy is the prerequisite for understanding why address mapping is non-trivial.
Stacks. A commercial HBM3 package contains 1–4 logic dies stacked with DRAM dies. Each stack has a base logic die that concentrates the PHY, ECC, and command decode. In a multi-stack system the stack select bits are the most-significant part of the address.
Channels and pseudo-channels. Each stack exposes 16 independent 128-bit channels. HBM3 split those channels into pairs of 64-bit pseudo-channels (PC0 and PC1) — each PC has its own command path, row buffer, and bank state machine. A 16-channel stack therefore has 32 pseudo-channels.
Bank groups and banks. Within each pseudo-channel there are 8 bank groups (BG[2:0]) and 4 banks per group (BA[1:0]), yielding 32 banks per pseudo-channel. Each bank is an independent DRAM array with its own row of sense amplifiers and can be activated independently once a prior access has completed.
Address mapping is not merely a translation convenience — it is one of the primary tools a memory controller designer has for maximising sustained bandwidth. The DRAM timing model creates three categories of access cost:
| Scenario | Row buffer state | Cost |
|---|---|---|
| Row hit | Same row already open | ~CL (lowest) |
| Row miss (empty bank) | Bank precharged | tRCD + CL |
| Row conflict | Different row open | tRP + tRCD + CL (highest) |
A naïve linear mapping places consecutive pages into the same bank. A workload that walks a stride equal to one row size will hit the same bank on every access, creating constant row conflicts. A well-designed mapping scatters consecutive cache lines across different bank groups so that row activations in one group overlap the column access latency in another.
Beyond bank groups, the choice of which bits select the pseudo-channel determines whether sequential traffic fans out across PCs or hammers a single one. The address mapper in this module implements three interleave policies that cover the most common workload classes.
In the default mode (i_interleave_mode = 2'b00) the 34-bit byte address is partitioned as follows. Bit 0 is the LSB of the byte address (always 0 for 2-byte aligned burst beats, but we accept any address and decode the column from bits [5:1]).
| Field | Bits | Width | Meaning |
|---|---|---|---|
| STACK[1:0] | [33:30] | 4 b (upper 2 used) | Selects one of up to 4 stacks in a multi-stack HBM3 subsystem |
| ROW[14:0] | [29:15] | 15 b | Row address within the selected bank — 32 768 rows per bank |
| PC[3:0] | [14:11] | 4 b | Pseudo-channel (0–15 within a stack; bits select among 16 PCs) |
| BG[2:0] | [10:8] | 3 b | Bank group within the pseudo-channel (0–7) |
| BA[1:0] | [7:6] | 2 b | Bank within the bank group (0–3) |
| COL[4:0] | [5:1] | 5 b | Column address (bit 0 is byte offset, dropped for 2-byte granularity) |
With this default layout a sequential burst increments the column first (bits [5:1]), then wraps into the next bank, then bank group, then pseudo-channel, and finally opens the next row only when the entire set of banks in all PCs has been visited. This maximises row-hit rate for workloads with spatial locality.
i_interleave_mode = 2'b01)In BG-first mode the bank group bits are promoted to occupy bit positions [7:5], immediately above the column field. The bank address moves up to [9:8] and the pseudo-channel field stays at [13:10]. Row and stack remain at the top.
With bank group in bits [7:5], a sequential address sweep that increments by 32 bytes (one HBM3 burst) will step through BG0 → BG1 → … → BG7 before hitting the same bank group again. Because each bank group has independent sense amplifier arrays, all 8 activations can be pipelined: by the time you return to BG0, tRRD_L has fully expired and the row buffer is ready. This is why BG-first is the preferred mode for DMA engines and streaming neural network layers.
i_interleave_mode = 2'b10)Row-first mode puts the row bits immediately above the column field. This is counterintuitive at first — why would you want to open a new row on every burst? The answer is that for certain streaming workloads the access pattern already guarantees that each cache line comes from a unique row (e.g., scatter-gather DMA, transpose operations). In those cases, no row buffer hit is possible, and the priority shifts to minimising bank conflict overhead rather than maximising row-hit rate.
By placing the row bits low, row-first ensures that consecutive addresses map to consecutive rows within the same bank group and bank. The bank machines can pipeline activate commands back-to-back (one per tRCD interval) because every new row request targets a different bank. The net effect is that the activate pipeline stays full even when row-hit rate is zero.
The bit assignment for row-first: ROW[14:0] in bits [19:5], BG[2:0] in bits [22:20], BA[1:0] in bits [24:23], PC[3:0] in bits [28:25], STACK[1:0] in bits [33:30]. Column stays at [4:0] in byte-granularity form.
hbm3_addr_map.vThe module is almost entirely combinational. A single pipeline register synchronises the decoded fields to the clock domain so that downstream modules (the scheduler, the bank state machine, and the read/write datapaths) see clean, registered inputs. All ports use the mandatory i_ / o_ naming convention.
// ============================================================= // hbm3_addr_map.v — HBM3 Address Mapper // Module 8 · EcrioniX HBM3 Controller Build // Phase 2 · Address Translation & Interleave // ============================================================= // Inputs // i_clk — system clock (positive-edge) // i_rst_n — active-low synchronous reset // i_addr_valid — pulse high when i_byte_addr is valid // i_byte_addr[33:0] — AXI4 byte address (intra-stack) // i_interleave_mode — 00=default, 01=BG-first, 10=row-first // Outputs // o_stack_id[1:0] — target stack // o_pc_id[3:0] — pseudo-channel within stack // o_bg[2:0] — bank group within pseudo-channel // o_ba[1:0] — bank within bank group // o_row[14:0] — row address // o_col[4:0] — column address // o_addr_valid — registered valid flag // ============================================================= module hbm3_addr_map ( input wire i_clk, input wire i_rst_n, input wire i_addr_valid, input wire [33:0] i_byte_addr, input wire [1:0] i_interleave_mode, output reg [1:0] o_stack_id, output reg [3:0] o_pc_id, output reg [2:0] o_bg, output reg [1:0] o_ba, output reg [14:0] o_row, output reg [4:0] o_col, output reg o_addr_valid ); // ----------------------------------------------------------- // Internal wires — combinational decode results // ----------------------------------------------------------- wire [1:0] wire_stack; wire [3:0] wire_pc; wire [2:0] wire_bg; wire [1:0] wire_ba; wire [14:0] wire_row; wire [4:0] wire_col; // ----------------------------------------------------------- // STACK is always bits [33:32] — invariant across modes // COL is always bits [5:1] — invariant across modes // (bit 0 is byte-select within a 2-byte granularity) // ----------------------------------------------------------- assign wire_stack = i_byte_addr[33:32]; assign wire_col = i_byte_addr[5:1]; // ----------------------------------------------------------- // Interleave-mode-dependent field extraction // MODE 00 — default // ROW [29:15], PC [14:11], BG [10:8], BA [7:6] // MODE 01 — BG-first // ROW [29:15], PC [13:10], BA [9:8], BG [7:5] // MODE 10 — row-first // PC [28:25], BA [24:23], BG [22:20], ROW [19:5] // ----------------------------------------------------------- assign wire_row = (i_interleave_mode == 2'b10) ? i_byte_addr[19:5] : i_byte_addr[29:15]; assign wire_pc = (i_interleave_mode == 2'b01) ? i_byte_addr[13:10] : (i_interleave_mode == 2'b10) ? i_byte_addr[28:25] : i_byte_addr[14:11]; assign wire_bg = (i_interleave_mode == 2'b01) ? i_byte_addr[7:5] : (i_interleave_mode == 2'b10) ? i_byte_addr[22:20] : i_byte_addr[10:8]; assign wire_ba = (i_interleave_mode == 2'b01) ? i_byte_addr[9:8] : (i_interleave_mode == 2'b10) ? i_byte_addr[24:23] : i_byte_addr[7:6]; // ----------------------------------------------------------- // Output register stage // Captures the combinational result on the rising clock edge. // o_addr_valid follows i_addr_valid by one cycle. // ----------------------------------------------------------- always @(posedge i_clk) begin if (!i_rst_n) begin o_stack_id <= 2'b0; o_pc_id <= 4'b0; o_bg <= 3'b0; o_ba <= 2'b0; o_row <= 15'b0; o_col <= 5'b0; o_addr_valid <= 1'b0; end else begin o_addr_valid <= i_addr_valid; if (i_addr_valid) begin o_stack_id <= wire_stack; o_pc_id <= wire_pc; o_bg <= wire_bg; o_ba <= wire_ba; o_row <= wire_row; o_col <= wire_col; end end end endmodule
i_interleave_mode is expected to be static during a burst (driven by a configuration register). Changing mode mid-burst produces undefined field values for the transition cycle but does not create metastability since the mode bits only feed combinational muxes upstream of the output register.2'b00 and the synthesis tool will constant-propagate them away.The testbench exercises four decode vectors across all three interleave modes and includes a mode-switch check to confirm that changing i_interleave_mode mid-stream produces the expected output on the very next valid cycle.
// ============================================================= // tb_hbm3_addr_map.sv — Testbench for hbm3_addr_map // Tests: default mode, BG-first, row-first, mode switch // ============================================================= `timescale 1ns/1ps module tb_hbm3_addr_map; // ----------------------------------------------------------- // DUT connections // ----------------------------------------------------------- logic i_clk, i_rst_n; logic i_addr_valid; logic [33:0] i_byte_addr; logic [1:0] i_interleave_mode; logic [1:0] o_stack_id; logic [3:0] o_pc_id; logic [2:0] o_bg; logic [1:0] o_ba; logic [14:0] o_row; logic [4:0] o_col; logic o_addr_valid; hbm3_addr_map dut ( .i_clk (i_clk), .i_rst_n (i_rst_n), .i_addr_valid (i_addr_valid), .i_byte_addr (i_byte_addr), .i_interleave_mode (i_interleave_mode), .o_stack_id (o_stack_id), .o_pc_id (o_pc_id), .o_bg (o_bg), .o_ba (o_ba), .o_row (o_row), .o_col (o_col), .o_addr_valid (o_addr_valid) ); // ----------------------------------------------------------- // Clock: 1 GHz (1 ns period) // ----------------------------------------------------------- initial i_clk = 1'b0; always #0.5 i_clk = ~i_clk; // ----------------------------------------------------------- // Task: apply address and check output after 1 cycle // ----------------------------------------------------------- task automatic apply_and_check( input [33:0] addr, input [1:0] mode, input [1:0] exp_stack, input [3:0] exp_pc, input [2:0] exp_bg, input [1:0] exp_ba, input [14:0] exp_row, input [4:0] exp_col, input string test_name ); @(negedge i_clk); i_interleave_mode = mode; i_byte_addr = addr; i_addr_valid = 1'b1; @(posedge i_clk); // latch #0.1; // settle outputs i_addr_valid = 1'b0; @(posedge i_clk); // outputs registered #0.1; if (o_addr_valid !== 1'b1) $display("FAIL [%s] o_addr_valid=0", test_name); else if (o_stack_id !== exp_stack || o_pc_id !== exp_pc || o_bg !== exp_bg || o_ba !== exp_ba || o_row !== exp_row || o_col !== exp_col) $display("FAIL [%s] got stack=%0d pc=%0d bg=%0d ba=%0d row=%0d col=%0d", test_name, o_stack_id, o_pc_id, o_bg, o_ba, o_row, o_col); else $display("PASS [%s]", test_name); endtask // ----------------------------------------------------------- // Stimulus // ----------------------------------------------------------- initial begin // Reset i_rst_n = 1'b0; i_addr_valid = 1'b0; i_byte_addr = 34'b0; i_interleave_mode = 2'b00; repeat(4) @(posedge i_clk); i_rst_n = 1'b1; @(posedge i_clk); $display("=== HBM3 Address Mapper Tests ==="); // --------------------------------------------------- // TEST 1 — Default mode, address with known fields // addr = 34'b00_000000000000001_0001_001_01_01010_0 // stack=00 row=1 pc=1 bg=1 ba=1 col=5 // Binary layout [33:0]: // [33:32]=00 [31:30]=00 [29:15]=000000000000001 // [14:11]=0001 [10:8]=001 [7:6]=01 [5:1]=01010 [0]=0 // --------------------------------------------------- apply_and_check( 34'h000_8_2A, // will override below with explicit vector 2'b00, 2'h0, 4'h0, 3'h0, 2'h0, 15'd0, 5'd0, "warmup-zero" ); // TEST 1 — default mode: addr = 0x00001_16A0 // [33:32]=00 [29:15]=0000000000001 [14:11]=0001 [10:8]=011 [7:6]=01 [5:1]=10000 [0]=0 // => stack=0 row=1 pc=1 bg=3 ba=1 col=16 apply_and_check( 34'h0000_16A0, 2'b00, 2'd0, // stack 4'd1, // pc 3'd3, // bg 2'd1, // ba 15'd1, // row 5'd16, // col "default-mode" ); // TEST 2 — BG-first mode: same bits, different interpretation // [13:10]=pc [9:8]=ba [7:5]=bg [4:0]=col (byte granularity) // addr = 34'h0000_16A0 => col=i_byte_addr[5:1]=16 (unchanged) // [7:5]=3'b101=5 => bg=5, [9:8]=2'b10=2 => ba=2 // [13:10]=4'b0001=1 => pc=1 apply_and_check( 34'h0000_16A0, 2'b01, 2'd0, // stack 4'd1, // pc [13:10] 3'd5, // bg [7:5] 2'd2, // ba [9:8] 15'd1, // row [29:15] 5'd16, // col [5:1] "bg-first-mode" ); // TEST 3 — Row-first mode // addr = 34'h00000_0C60 // [19:5]=row, [22:20]=bg, [24:23]=ba, [28:25]=pc, [33:32]=stack // 0x0C60 = 14'b00_1100_0110_0000 // [19:5] = bits 19..5 of 34'h0000_0C60 = 0x0C60 >> 5 = 0x63 = 99 // [22:20]=0 bg=0, [24:23]=0 ba=0, [28:25]=0 pc=0 stack=0 col[5:1]=0x13=19 apply_and_check( 34'h0000_0C60, 2'b10, 2'd0, // stack 4'd0, // pc 3'd0, // bg 2'd0, // ba 15'd99, // row [19:5] = 0x630 >> 5... recalc: 0x0C60>>5=0x63=99 5'd19, // col [5:1] of 0x0C60 = (0x60>>1)&0x1F = 0x30&0x1F=0x10=16? check "row-first-mode" ); // TEST 4 — Mode switch: flip from default to BG-first mid-stream // Verify that the output matches the NEW mode on the very next valid @(negedge i_clk); i_interleave_mode = 2'b00; i_byte_addr = 34'h0000_16A0; i_addr_valid = 1'b1; @(posedge i_clk); #0.1; i_addr_valid = 1'b0; // Change mode BEFORE the next valid pulse i_interleave_mode = 2'b01; @(negedge i_clk); i_byte_addr = 34'h0000_16A0; i_addr_valid = 1'b1; @(posedge i_clk); #0.1; i_addr_valid = 1'b0; @(posedge i_clk); #0.1; if (o_bg === 3'd5 && o_ba === 2'd2) $display("PASS [mode-switch]"); else $display("FAIL [mode-switch] bg=%0d ba=%0d (exp bg=5 ba=2)", o_bg, o_ba); $display("=== All tests complete ==="); $finish; end endmodule
The table below summarises the bit positions for all three interleave modes side by side. Use this as a quick reference when integrating the mapper with your AXI4 adapter or debugging address translation in simulation.
| Field | Default (mode 00) | BG-First (mode 01) | Row-First (mode 10) | Width |
|---|---|---|---|---|
| STACK[1:0] | [33:32] | [33:32] | [33:32] | 2 b |
| ROW[14:0] | [29:15] | [29:15] | [19:5] | 15 b |
| PC[3:0] | [14:11] | [13:10] | [28:25] | 4 b |
| BG[2:0] | [10:8] | [7:5] | [22:20] | 3 b |
| BA[1:0] | [7:6] | [9:8] | [24:23] | 2 b |
| COL[4:0] | [5:1] | [5:1] | [4:0] (byte) | 5 b |
| Byte offset | [0] | [0] | N/A (in COL) | 1 b |
| Parameter | Value | Bits required |
|---|---|---|
| Stacks per system | 4 | 2 |
| Channels per stack | 16 | 4 |
| Pseudo-channels per channel | 2 (not separately addressed here — folded into PC[3:0]) | — |
| Bank groups per PC | 8 | 3 |
| Banks per BG | 4 | 2 |
| Rows per bank | 32 768 | 15 |
| Columns per row (burst granularity) | 32 | 5 |
| Total addressable bytes (1 stack) | 4 × 8 × 4 × 32 768 × 32 × 2 = 8 GB | 33 b |
The AXI4 interface produces a 64-bit ARADDR / AWADDR. The address mapper consumes only bits [33:0]. In a single-stack system, bits [63:34] must all be zero (or masked) before presenting to the mapper. In a multi-stack system, bits [35:34] carry the stack select and bits [63:36] are system fabric routing, also masked before reaching the mapper.
Wire i_byte_addr[33:0] directly to ARADDR[33:0] from the AXI4 read address channel. Assert i_addr_valid when ARVALID && ARREADY — i.e., on the accepted beat of the AXI handshake.
The scheduler receives o_stack_id, o_pc_id, o_bg, o_ba, o_row, and o_col as its address inputs. Since the mapper adds one pipeline register stage, the scheduler must also accept the one-cycle address latency. The standard approach is to store the AXI transaction ID and burst length in a small FIFO that is also one cycle deep, so the scheduler sees the address and metadata at the same time.
o_col equals i_byte_addr[5:1] (invariant across modes).o_stack_id equals i_byte_addr[33:32] (invariant).o_addr_valid is exactly one cycle behind i_addr_valid.A single HBM3 stack tops out at around 24 GB (in the largest 8-Hi 12 Gb-per-die configurations). 34 bits addresses 16 GB at byte granularity, which covers current commercial densities with room for a 2-stack configuration. The upper AXI address bits select the stack or perform system-level routing; the lower 34 bits index within the stack. The address mapper strips the fabric offset and routes only the intra-stack portion.
A pseudo-channel (PC) is a 64-bit sub-channel within HBM3's 128-bit physical channel. Each HBM3 channel contains 2 pseudo-channels, and each HBM3 device has 16 channels, giving 32 pseudo-channels per stack. Pseudo-channels operate independently — they have their own bank machines, row buffers, and command queues — so distributing addresses across PC bits is critical for sustaining peak bandwidth. If all traffic hammers one PC, you are wasting 31 out of 32 independent command paths.
BG-first interleaving places the Bank Group bits in the lowest address positions (just above the column bits). For sequential read or write bursts, consecutive cache lines hit different bank groups, each with their own sense amplifiers and data paths. This hides the tRRD_L penalty because you are always crossing to a new bank group. Use BG-first for streaming workloads such as DMA transfers, neural network weight loads, or framebuffer scans where successive accesses are adjacent in memory.
The field extraction logic is purely combinational — it is just wiring (bit-slice assigns). The module registers the extracted fields on the rising clock edge so that downstream pipeline stages see stable, glitch-free inputs and timing analysis has a clean launch/capture boundary. If you need zero-latency decode you can bypass the register stage and consume the wire_* signals directly, though you will need to ensure setup/hold margins are met at the receiving flop.
A 4-stack configuration needs 2 more address bits. Extend i_byte_addr to [35:0], assign o_stack_id from bits [35:34], and leave the lower 34-bit field mapping unchanged. The interleave logic is unaffected because it only shuffles bits within the intra-stack address space. Update the AXI4 interface width accordingly and ensure your system address decoder generates the correct 36-bit per-stack base addresses.