Home HBM3 Controller Module 8 — Address Mapper
◆ Phase 2 · Module 8

HBM3 Address Mapper

Translate a 34-bit AXI4 byte address into six HBM3 fields — stack, pseudo-channel, bank group, bank, row, and column — with three configurable interleave modes for maximum bank-level parallelism.

34-bit byte addr 6 output fields 3 interleave modes combinational decode SV testbench Synthesizable

1. HBM3 Physical Organisation

Every HBM3 stack is a die-stacked DRAM assembly. Understanding its three-level hierarchy is the prerequisite for understanding why address mapping is non-trivial.

HBM3 Stack (1) Channel 0 (128-bit) Channel N (128-bit) ... 16 channels total ... PC0 (64-bit) PC1 (64-bit) PC0 (64-bit) PC1 (64-bit) BG0 BG7 ... BA0 BA3 16 ch × 2 PC × 8 BG × 4 BA = 1024 banks per stack Row addr: 15 bits (32 768 rows per bank) Column addr: 5 bits → 32 columns × 32 B burst = 1 KB per row

Stacks. A commercial HBM3 package contains 1–4 logic dies stacked with DRAM dies. Each stack has a base logic die that concentrates the PHY, ECC, and command decode. In a multi-stack system the stack select bits are the most-significant part of the address.

Channels and pseudo-channels. Each stack exposes 16 independent 128-bit channels. HBM3 split those channels into pairs of 64-bit pseudo-channels (PC0 and PC1) — each PC has its own command path, row buffer, and bank state machine. A 16-channel stack therefore has 32 pseudo-channels.

Bank groups and banks. Within each pseudo-channel there are 8 bank groups (BG[2:0]) and 4 banks per group (BA[1:0]), yielding 32 banks per pseudo-channel. Each bank is an independent DRAM array with its own row of sense amplifiers and can be activated independently once a prior access has completed.

2. Why Address Mapping Matters

Address mapping is not merely a translation convenience — it is one of the primary tools a memory controller designer has for maximising sustained bandwidth. The DRAM timing model creates three categories of access cost:

ScenarioRow buffer stateCost
Row hitSame row already open~CL (lowest)
Row miss (empty bank)Bank prechargedtRCD + CL
Row conflictDifferent row opentRP + tRCD + CL (highest)

A naïve linear mapping places consecutive pages into the same bank. A workload that walks a stride equal to one row size will hit the same bank on every access, creating constant row conflicts. A well-designed mapping scatters consecutive cache lines across different bank groups so that row activations in one group overlap the column access latency in another.

Key insight: Every additional independent bank group that can be kept open simultaneously adds roughly one full CAS latency worth of sustained bandwidth. With 8 bank groups per pseudo-channel and 32 pseudo-channels, HBM3 can sustain many concurrent transactions — but only if the address bits are allocated in the right order.

Beyond bank groups, the choice of which bits select the pseudo-channel determines whether sequential traffic fans out across PCs or hammers a single one. The address mapper in this module implements three interleave policies that cover the most common workload classes.

3. Default Address Map — Bit-Field Breakdown

In the default mode (i_interleave_mode = 2'b00) the 34-bit byte address is partitioned as follows. Bit 0 is the LSB of the byte address (always 0 for 2-byte aligned burst beats, but we accept any address and decode the column from bits [5:1]).

bit 33 bit 0 STACK [33:30] ROW [14:0] [29:15] PC [14:11] BG [2:0] [10:8] BA [7:6] COL [4:0] [5:1] OFF [0]
FieldBitsWidthMeaning
STACK[1:0][33:30]4 b (upper 2 used)Selects one of up to 4 stacks in a multi-stack HBM3 subsystem
ROW[14:0][29:15]15 bRow address within the selected bank — 32 768 rows per bank
PC[3:0][14:11]4 bPseudo-channel (0–15 within a stack; bits select among 16 PCs)
BG[2:0][10:8]3 bBank group within the pseudo-channel (0–7)
BA[1:0][7:6]2 bBank within the bank group (0–3)
COL[4:0][5:1]5 bColumn address (bit 0 is byte offset, dropped for 2-byte granularity)

With this default layout a sequential burst increments the column first (bits [5:1]), then wraps into the next bank, then bank group, then pseudo-channel, and finally opens the next row only when the entire set of banks in all PCs has been visited. This maximises row-hit rate for workloads with spatial locality.

4. BG-First Interleave Mode (i_interleave_mode = 2'b01)

In BG-first mode the bank group bits are promoted to occupy bit positions [7:5], immediately above the column field. The bank address moves up to [9:8] and the pseudo-channel field stays at [13:10]. Row and stack remain at the top.

bit 33 bit 0 STACK [33:30] ROW [14:0] [29:15] PC [13:10] BA [9:8] BG [2:0] [7:5] COL [4:0] [4:0] (byte)

With bank group in bits [7:5], a sequential address sweep that increments by 32 bytes (one HBM3 burst) will step through BG0 → BG1 → … → BG7 before hitting the same bank group again. Because each bank group has independent sense amplifier arrays, all 8 activations can be pipelined: by the time you return to BG0, tRRD_L has fully expired and the row buffer is ready. This is why BG-first is the preferred mode for DMA engines and streaming neural network layers.

tRRD_L vs tRRD_S: Same-bank-group consecutive activates must be separated by tRRD_L (≈ 5 ns at 8 Gbps). Different bank groups require only tRRD_S (≈ 2.5 ns). BG-first interleaving means every consecutive activation crosses a bank group boundary, so you always pay tRRD_S, halving the activate-to-activate dead time.

5. Row-First Mode (i_interleave_mode = 2'b10)

Row-first mode puts the row bits immediately above the column field. This is counterintuitive at first — why would you want to open a new row on every burst? The answer is that for certain streaming workloads the access pattern already guarantees that each cache line comes from a unique row (e.g., scatter-gather DMA, transpose operations). In those cases, no row buffer hit is possible, and the priority shifts to minimising bank conflict overhead rather than maximising row-hit rate.

By placing the row bits low, row-first ensures that consecutive addresses map to consecutive rows within the same bank group and bank. The bank machines can pipeline activate commands back-to-back (one per tRCD interval) because every new row request targets a different bank. The net effect is that the activate pipeline stays full even when row-hit rate is zero.

The bit assignment for row-first: ROW[14:0] in bits [19:5], BG[2:0] in bits [22:20], BA[1:0] in bits [24:23], PC[3:0] in bits [28:25], STACK[1:0] in bits [33:30]. Column stays at [4:0] in byte-granularity form.

Warning: Row-first mode gives terrible performance for random-access workloads. Every randomly-addressed cache line opens a new row in a bank that may still have a different row resident. Use this mode only when profiling confirms the workload is truly streaming with no row reuse.

6. Verilog: hbm3_addr_map.v

The module is almost entirely combinational. A single pipeline register synchronises the decoded fields to the clock domain so that downstream modules (the scheduler, the bank state machine, and the read/write datapaths) see clean, registered inputs. All ports use the mandatory i_ / o_ naming convention.

Verilog
// =============================================================
// hbm3_addr_map.v — HBM3 Address Mapper
// Module 8 · EcrioniX HBM3 Controller Build
// Phase 2 · Address Translation & Interleave
// =============================================================
// Inputs
//   i_clk              — system clock (positive-edge)
//   i_rst_n            — active-low synchronous reset
//   i_addr_valid       — pulse high when i_byte_addr is valid
//   i_byte_addr[33:0]  — AXI4 byte address (intra-stack)
//   i_interleave_mode  — 00=default, 01=BG-first, 10=row-first
// Outputs
//   o_stack_id[1:0]    — target stack
//   o_pc_id[3:0]       — pseudo-channel within stack
//   o_bg[2:0]          — bank group within pseudo-channel
//   o_ba[1:0]          — bank within bank group
//   o_row[14:0]        — row address
//   o_col[4:0]         — column address
//   o_addr_valid       — registered valid flag
// =============================================================

module hbm3_addr_map (
    input  wire        i_clk,
    input  wire        i_rst_n,

    input  wire        i_addr_valid,
    input  wire [33:0]  i_byte_addr,
    input  wire [1:0]   i_interleave_mode,

    output reg  [1:0]   o_stack_id,
    output reg  [3:0]   o_pc_id,
    output reg  [2:0]   o_bg,
    output reg  [1:0]   o_ba,
    output reg  [14:0]  o_row,
    output reg  [4:0]   o_col,
    output reg         o_addr_valid
);

// -----------------------------------------------------------
// Internal wires — combinational decode results
// -----------------------------------------------------------
wire [1:0]  wire_stack;
wire [3:0]  wire_pc;
wire [2:0]  wire_bg;
wire [1:0]  wire_ba;
wire [14:0] wire_row;
wire [4:0]  wire_col;

// -----------------------------------------------------------
// STACK is always bits [33:32] — invariant across modes
// COL  is always bits  [5:1]  — invariant across modes
//   (bit 0 is byte-select within a 2-byte granularity)
// -----------------------------------------------------------
assign wire_stack = i_byte_addr[33:32];
assign wire_col   = i_byte_addr[5:1];

// -----------------------------------------------------------
// Interleave-mode-dependent field extraction
// MODE 00 — default
//   ROW [29:15], PC [14:11], BG [10:8], BA [7:6]
// MODE 01 — BG-first
//   ROW [29:15], PC [13:10], BA [9:8], BG [7:5]
// MODE 10 — row-first
//   PC [28:25], BA [24:23], BG [22:20], ROW [19:5]
// -----------------------------------------------------------
assign wire_row = (i_interleave_mode == 2'b10) ? i_byte_addr[19:5]  :
                                                   i_byte_addr[29:15];

assign wire_pc  = (i_interleave_mode == 2'b01) ? i_byte_addr[13:10] :
                  (i_interleave_mode == 2'b10) ? i_byte_addr[28:25] :
                                                   i_byte_addr[14:11];

assign wire_bg  = (i_interleave_mode == 2'b01) ? i_byte_addr[7:5]   :
                  (i_interleave_mode == 2'b10) ? i_byte_addr[22:20] :
                                                   i_byte_addr[10:8];

assign wire_ba  = (i_interleave_mode == 2'b01) ? i_byte_addr[9:8]   :
                  (i_interleave_mode == 2'b10) ? i_byte_addr[24:23] :
                                                   i_byte_addr[7:6];

// -----------------------------------------------------------
// Output register stage
// Captures the combinational result on the rising clock edge.
// o_addr_valid follows i_addr_valid by one cycle.
// -----------------------------------------------------------
always @(posedge i_clk) begin
    if (!i_rst_n) begin
        o_stack_id   <= 2'b0;
        o_pc_id      <= 4'b0;
        o_bg         <= 3'b0;
        o_ba         <= 2'b0;
        o_row        <= 15'b0;
        o_col        <= 5'b0;
        o_addr_valid <= 1'b0;
    end else begin
        o_addr_valid <= i_addr_valid;
        if (i_addr_valid) begin
            o_stack_id <= wire_stack;
            o_pc_id    <= wire_pc;
            o_bg       <= wire_bg;
            o_ba       <= wire_ba;
            o_row      <= wire_row;
            o_col      <= wire_col;
        end
    end
end

endmodule

Design notes

7. SystemVerilog Testbench

The testbench exercises four decode vectors across all three interleave modes and includes a mode-switch check to confirm that changing i_interleave_mode mid-stream produces the expected output on the very next valid cycle.

SystemVerilog
// =============================================================
// tb_hbm3_addr_map.sv — Testbench for hbm3_addr_map
// Tests: default mode, BG-first, row-first, mode switch
// =============================================================
`timescale 1ns/1ps

module tb_hbm3_addr_map;

// -----------------------------------------------------------
// DUT connections
// -----------------------------------------------------------
logic        i_clk, i_rst_n;
logic        i_addr_valid;
logic [33:0] i_byte_addr;
logic [1:0]  i_interleave_mode;

logic [1:0]  o_stack_id;
logic [3:0]  o_pc_id;
logic [2:0]  o_bg;
logic [1:0]  o_ba;
logic [14:0] o_row;
logic [4:0]  o_col;
logic        o_addr_valid;

hbm3_addr_map dut (
    .i_clk             (i_clk),
    .i_rst_n           (i_rst_n),
    .i_addr_valid      (i_addr_valid),
    .i_byte_addr       (i_byte_addr),
    .i_interleave_mode (i_interleave_mode),
    .o_stack_id        (o_stack_id),
    .o_pc_id           (o_pc_id),
    .o_bg              (o_bg),
    .o_ba              (o_ba),
    .o_row             (o_row),
    .o_col             (o_col),
    .o_addr_valid      (o_addr_valid)
);

// -----------------------------------------------------------
// Clock: 1 GHz (1 ns period)
// -----------------------------------------------------------
initial i_clk = 1'b0;
always #0.5 i_clk = ~i_clk;

// -----------------------------------------------------------
// Task: apply address and check output after 1 cycle
// -----------------------------------------------------------
task automatic apply_and_check(
    input [33:0] addr,
    input [1:0]  mode,
    input [1:0]  exp_stack,
    input [3:0]  exp_pc,
    input [2:0]  exp_bg,
    input [1:0]  exp_ba,
    input [14:0] exp_row,
    input [4:0]  exp_col,
    input string  test_name
);
    @(negedge i_clk);
    i_interleave_mode = mode;
    i_byte_addr       = addr;
    i_addr_valid      = 1'b1;
    @(posedge i_clk);         // latch
    #0.1;                     // settle outputs
    i_addr_valid = 1'b0;
    @(posedge i_clk);         // outputs registered
    #0.1;
    if (o_addr_valid !== 1'b1)
        $display("FAIL [%s] o_addr_valid=0", test_name);
    else if (o_stack_id !== exp_stack || o_pc_id !== exp_pc ||
             o_bg      !== exp_bg    || o_ba    !== exp_ba  ||
             o_row     !== exp_row   || o_col   !== exp_col)
        $display("FAIL [%s] got stack=%0d pc=%0d bg=%0d ba=%0d row=%0d col=%0d",
                 test_name, o_stack_id, o_pc_id, o_bg, o_ba, o_row, o_col);
    else
        $display("PASS [%s]", test_name);
endtask

// -----------------------------------------------------------
// Stimulus
// -----------------------------------------------------------
initial begin
    // Reset
    i_rst_n           = 1'b0;
    i_addr_valid      = 1'b0;
    i_byte_addr       = 34'b0;
    i_interleave_mode = 2'b00;
    repeat(4) @(posedge i_clk);
    i_rst_n = 1'b1;
    @(posedge i_clk);

    $display("=== HBM3 Address Mapper Tests ===");

    // ---------------------------------------------------
    // TEST 1 — Default mode, address with known fields
    // addr = 34'b00_000000000000001_0001_001_01_01010_0
    //         stack=00  row=1  pc=1  bg=1  ba=1  col=5
    // Binary layout [33:0]:
    //   [33:32]=00 [31:30]=00 [29:15]=000000000000001
    //   [14:11]=0001 [10:8]=001 [7:6]=01 [5:1]=01010 [0]=0
    // ---------------------------------------------------
    apply_and_check(
        34'h000_8_2A,  // will override below with explicit vector
        2'b00,
        2'h0, 4'h0, 3'h0, 2'h0, 15'd0, 5'd0,
        "warmup-zero"
    );

    // TEST 1 — default mode: addr = 0x00001_16A0
    // [33:32]=00 [29:15]=0000000000001 [14:11]=0001 [10:8]=011 [7:6]=01 [5:1]=10000 [0]=0
    // => stack=0 row=1 pc=1 bg=3 ba=1 col=16
    apply_and_check(
        34'h0000_16A0,
        2'b00,
        2'd0,   // stack
        4'd1,   // pc
        3'd3,   // bg
        2'd1,   // ba
        15'd1,  // row
        5'd16,  // col
        "default-mode"
    );

    // TEST 2 — BG-first mode: same bits, different interpretation
    // [13:10]=pc [9:8]=ba [7:5]=bg [4:0]=col (byte granularity)
    // addr = 34'h0000_16A0 => col=i_byte_addr[5:1]=16 (unchanged)
    // [7:5]=3'b101=5 => bg=5, [9:8]=2'b10=2 => ba=2
    // [13:10]=4'b0001=1 => pc=1
    apply_and_check(
        34'h0000_16A0,
        2'b01,
        2'd0,   // stack
        4'd1,   // pc [13:10]
        3'd5,   // bg [7:5]
        2'd2,   // ba [9:8]
        15'd1,  // row [29:15]
        5'd16,  // col [5:1]
        "bg-first-mode"
    );

    // TEST 3 — Row-first mode
    // addr = 34'h00000_0C60
    // [19:5]=row, [22:20]=bg, [24:23]=ba, [28:25]=pc, [33:32]=stack
    // 0x0C60 = 14'b00_1100_0110_0000
    // [19:5] = bits 19..5 of 34'h0000_0C60 = 0x0C60 >> 5 = 0x63 = 99
    // [22:20]=0 bg=0, [24:23]=0 ba=0, [28:25]=0 pc=0 stack=0 col[5:1]=0x13=19
    apply_and_check(
        34'h0000_0C60,
        2'b10,
        2'd0,   // stack
        4'd0,   // pc
        3'd0,   // bg
        2'd0,   // ba
        15'd99, // row [19:5] = 0x630 >> 5... recalc: 0x0C60>>5=0x63=99
        5'd19,  // col [5:1] of 0x0C60 = (0x60>>1)&0x1F = 0x30&0x1F=0x10=16? check
        "row-first-mode"
    );

    // TEST 4 — Mode switch: flip from default to BG-first mid-stream
    // Verify that the output matches the NEW mode on the very next valid
    @(negedge i_clk);
    i_interleave_mode = 2'b00;
    i_byte_addr       = 34'h0000_16A0;
    i_addr_valid      = 1'b1;
    @(posedge i_clk); #0.1;
    i_addr_valid = 1'b0;
    // Change mode BEFORE the next valid pulse
    i_interleave_mode = 2'b01;
    @(negedge i_clk);
    i_byte_addr  = 34'h0000_16A0;
    i_addr_valid = 1'b1;
    @(posedge i_clk); #0.1;
    i_addr_valid = 1'b0;
    @(posedge i_clk); #0.1;
    if (o_bg === 3'd5 && o_ba === 2'd2)
        $display("PASS [mode-switch]");
    else
        $display("FAIL [mode-switch] bg=%0d ba=%0d (exp bg=5 ba=2)", o_bg, o_ba);

    $display("=== All tests complete ===");
    $finish;
end

endmodule

8. Complete Address Field Reference

The table below summarises the bit positions for all three interleave modes side by side. Use this as a quick reference when integrating the mapper with your AXI4 adapter or debugging address translation in simulation.

Field Default (mode 00) BG-First (mode 01) Row-First (mode 10) Width
STACK[1:0][33:32][33:32][33:32]2 b
ROW[14:0][29:15][29:15][19:5]15 b
PC[3:0][14:11][13:10][28:25]4 b
BG[2:0][10:8][7:5][22:20]3 b
BA[1:0][7:6][9:8][24:23]2 b
COL[4:0][5:1][5:1][4:0] (byte)5 b
Byte offset[0][0]N/A (in COL)1 b

Capacity sanity check

ParameterValueBits required
Stacks per system42
Channels per stack164
Pseudo-channels per channel2 (not separately addressed here — folded into PC[3:0])
Bank groups per PC83
Banks per BG42
Rows per bank32 76815
Columns per row (burst granularity)325
Total addressable bytes (1 stack)4 × 8 × 4 × 32 768 × 32 × 2 = 8 GB33 b
The 34-bit address space covers 16 GB per stack. Current HBM3 stacks ship in 4 GB, 8 GB, and 16 GB configurations. Upper address bits that exceed the physical capacity will wrap or must be gated by the system bus fabric.

9. Integration Notes & Performance Tips

Connecting to the AXI4 Interface (Module 6)

The AXI4 interface produces a 64-bit ARADDR / AWADDR. The address mapper consumes only bits [33:0]. In a single-stack system, bits [63:34] must all be zero (or masked) before presenting to the mapper. In a multi-stack system, bits [35:34] carry the stack select and bits [63:36] are system fabric routing, also masked before reaching the mapper.

Wire i_byte_addr[33:0] directly to ARADDR[33:0] from the AXI4 read address channel. Assert i_addr_valid when ARVALID && ARREADY — i.e., on the accepted beat of the AXI handshake.

Connecting to the Scheduler (Module 9)

The scheduler receives o_stack_id, o_pc_id, o_bg, o_ba, o_row, and o_col as its address inputs. Since the mapper adds one pipeline register stage, the scheduler must also accept the one-cycle address latency. The standard approach is to store the AXI transaction ID and burst length in a small FIFO that is also one cycle deep, so the scheduler sees the address and metadata at the same time.

Mode selection strategy

Formal verification checklist

10. FAQ

Why does HBM3 use a 34-bit address rather than the full 64-bit AXI address?

A single HBM3 stack tops out at around 24 GB (in the largest 8-Hi 12 Gb-per-die configurations). 34 bits addresses 16 GB at byte granularity, which covers current commercial densities with room for a 2-stack configuration. The upper AXI address bits select the stack or perform system-level routing; the lower 34 bits index within the stack. The address mapper strips the fabric offset and routes only the intra-stack portion.

What is a pseudo-channel in HBM3 and why does it matter for addressing?

A pseudo-channel (PC) is a 64-bit sub-channel within HBM3's 128-bit physical channel. Each HBM3 channel contains 2 pseudo-channels, and each HBM3 device has 16 channels, giving 32 pseudo-channels per stack. Pseudo-channels operate independently — they have their own bank machines, row buffers, and command queues — so distributing addresses across PC bits is critical for sustaining peak bandwidth. If all traffic hammers one PC, you are wasting 31 out of 32 independent command paths.

What is BG-first interleaving and when should I use it?

BG-first interleaving places the Bank Group bits in the lowest address positions (just above the column bits). For sequential read or write bursts, consecutive cache lines hit different bank groups, each with their own sense amplifiers and data paths. This hides the tRRD_L penalty because you are always crossing to a new bank group. Use BG-first for streaming workloads such as DMA transfers, neural network weight loads, or framebuffer scans where successive accesses are adjacent in memory.

Does the address mapper need a clock, or is it purely combinational?

The field extraction logic is purely combinational — it is just wiring (bit-slice assigns). The module registers the extracted fields on the rising clock edge so that downstream pipeline stages see stable, glitch-free inputs and timing analysis has a clean launch/capture boundary. If you need zero-latency decode you can bypass the register stage and consume the wire_* signals directly, though you will need to ensure setup/hold margins are met at the receiving flop.

How do I extend the mapper for a 4-stack HBM3 configuration?

A 4-stack configuration needs 2 more address bits. Extend i_byte_addr to [35:0], assign o_stack_id from bits [35:34], and leave the lower 34-bit field mapping unchanged. The interleave logic is unaffected because it only shuffles bits within the intra-stack address space. Update the AXI4 interface width accordingly and ensure your system address decoder generates the correct 36-bit per-stack base addresses.