HBM3 Read Data Path — Module 6 | HBM3 Controller Build

1. Read Data Path Overview

The read data path is the most latency-sensitive component of the HBM3 controller. When the scheduler fires a READ command on the Command/Address (CA) bus, nothing comes back for a very long time — 70 clock cycles at 2 GHz, or 35 nanoseconds. During those 70 cycles, the DRAM sense amplifiers detect and amplify the row data, the column address selects the right word, the output drivers enable, and the signal propagates through the package and TSVs back to the controller.

The read path module must solve four problems simultaneously:

Latency tracking — A shift register pipeline marks exactly which clock cycle each read's data is expected to arrive, without consuming a timer per request.
DQS-based capture — The DRAM drives a differential DQS strobe center-aligned to the DQ data. The PHY uses DQS edges (not the controller clock) to sample each bit. The controller sees a i_dqs_valid signal from the PHY indicating valid data on i_dq_in.
Burst deserialization — Four consecutive 32-bit beats on i_dq_in must be assembled into one 128-bit word before the host can consume the data.
Read FIFO — Multiple reads may be in-flight simultaneously (pipelined). A read FIFO stores assembled 128-bit words until the host is ready to drain them.

The read path output is simple from the host's perspective: o_rd_data[127:0] with o_rd_valid high when data is available. Everything below that interface is the careful timing machinery described in this module.

2. CAS Latency (CL) — Physics Behind 35 Nanoseconds

CL (CAS Latency, also called RL for Read Latency in some documents) is the number of clock cycles between the READ command and the first DQ data beat. At 2 GHz, CL=70 equals exactly 35 ns. This is not an arbitrary number — it reflects the physical chain of operations inside the DRAM die:

Step	Operation	Approx. Time
1	Command decode in DRAM logic	~1 ns
2	Column address multiplexer routing	~2 ns
3	Sense amplifier sense & amplify (tAA)	~10 ns
4	Data path through bitline to output	~6 ns
5	Output driver enable (tOE)	~4 ns
6	DQ propagation through TSVs + package	~4 ns
7	PHY receiver and DQS alignment	~5 ns
8	Guard-band for PVT variation	~3 ns
Total		~35 ns = CL 70

At faster speed grades, CL increases in absolute cycle count but stays roughly constant in nanoseconds. HBM3-6400 at 3200 MHz uses CL=140 (still ~43 ns) — the cycle budget grows because the clock period shrinks faster than the physics can be squeezed.

CL is a read-only characteristic of the DRAM die and speed grade. Unlike write-side ODT which can be adjusted, CL is fixed for a given device and must be precisely programmed into the controller's CL pipeline during initialization.

3. CL Pipeline — Shift Register Latency Tracking

A naive approach to CL tracking would allocate one counter per in-flight read, decrementing each cycle and signaling capture when zero. This works but requires N counters for N concurrent reads — expensive at high queue depths.

Shift Register Approach

The efficient approach uses a single N-bit shift register where N equals the maximum supported CL value (here 128 bits to cover CL up to 128 cycles). Each cycle a new bit is shifted in at position 0: set to 1 if a read command was issued this cycle, 0 otherwise. The bit at position i_cl - 1 is tapped as the capture_expected signal.

When a read command was issued CL cycles ago, its tracking bit has shifted exactly to position i_cl - 1. capture_expected fires for one cycle, telling the capture logic to expect valid data on i_dq_in this cycle and the next 3 cycles (the BL4 window).

Single register serves unlimited concurrent in-flight reads
Runtime-configurable via i_cl[7:0] — no recompilation needed
Zero combinational depth on critical path — just a register read
Works correctly for back-to-back reads: each READ command sets its own bit independently

The shift register must be wide enough for the maximum CL value across all supported speed grades. 128 bits covers HBM3-6400 CL=140 only if the parameter is increased. For a fixed 2 GHz design, 80 bits is sufficient.

4. DQS-Based Capture — Why DQS Exists

In synchronous DDR memory, data is transferred at both the rising and falling edges of the clock. This means one bit can change every 250 ps at 2 GHz. At such speeds, the controller's internal clock has accumulated enough jitter and skew relative to the DRAM's output that sampling DQ on the controller clock is unreliable.

HBM3 solves this by having the DRAM output a differential DQS (Data Strobe) signal alongside the DQ bits. DQS is center-aligned to the DQ eye — it toggles at the midpoint of each valid data period. The PHY implements a DLL (Delay-Locked Loop) that locks to DQS and shifts its phase 90° to position the capture edge at the center of the DQ eye, where setup and hold margins are maximum.

Controller Interface to PHY

The controller module does not implement DQS capture directly — that is done in the hardened PHY. Instead, the PHY presents two signals to the controller:

i_dq_in[31:0] — the captured 32-bit DQ word, already latched on DQS edges
i_dqs_valid — a synchronous flag in the controller clock domain, high for exactly 4 cycles (BL4) when DQ data is valid

The controller's capture logic simply samples i_dq_in on every cycle where i_dqs_valid is high, using the beat counter to assemble the 128-bit burst.

Read Preamble

Before the first DQS toggle, the DRAM holds DQS low for 1 or 2 tCK (read preamble). This gives the PHY DQS receiver time to enable. The i_dqs_valid signal presented to this controller module is already preamble-compensated by the PHY — it asserts on the first true data beat, not on the preamble.

5. Read Burst Deserializer — 4 Beats to 128 Bits

The burst deserializer is the inverse of the write path serializer. It receives four consecutive 32-bit beats on i_dq_in and assembles them into a single 128-bit word. A 2-bit beat counter (beat_cnt) tracks position within the BL4 window. The mapping is:

Beat	i_dq_in received	Packed into
0 (first)	DQ[31:0]	rd_data[31:0]
1	DQ[31:0]	rd_data[63:32]
2	DQ[31:0]	rd_data[95:64]
3 (last)	DQ[31:0]	rd_data[127:96]

On the rising edge that sees beat 3 (beat_cnt == 2'b11 and i_dqs_valid), the assembled 128-bit word is pushed into the Read FIFO and beat_cnt resets to 0. The FIFO push completes in a single cycle — no stall is needed as long as the read FIFO is not full.

If the read FIFO is full when beat 3 arrives, the assembled burst word is dropped. The host must drain the read FIFO fast enough to prevent this. The scheduler should throttle READ commands when o_rfifo_empty is chronically 0 (indicating the host is consuming slowly).

6. Read FIFO — Buffering for Back-to-Back Reads

The read FIFO sits between the burst deserializer and the host data bus. With CL=70 cycles and typical HBM3 pseudo-channel bandwidths supporting 4–8 reads in-flight simultaneously, the FIFO must accommodate multiple assembled bursts while the host is busy with other work.

FIFO Sizing

Entry width: 128 bits (one complete BL4 burst)
Depth: 4 entries (covers typical maximum in-flight read depth of 4 at this abstraction level)
Push: on beat_cnt == 3 with i_dqs_valid
Pop: when host reads o_rd_data with o_rd_valid high

A depth of 4 provides adequate buffer for burst read sequences. The scheduler must monitor in-flight read count and limit to RFIFO depth to prevent overflow. In a full controller implementation, the read data return path includes a request ID tag so that out-of-order responses can be reordered before delivery to the host — this module implements in-order capture only.

o_rd_valid and o_rfifo_empty

o_rd_valid asserts high when the read FIFO contains at least one valid entry and o_rd_data holds the word at the head of the FIFO. A single-cycle pop advances the read pointer. o_rfifo_empty is the complement of o_rd_valid (when no valid data is waiting). Both are registered for clean hold times to the host interface.

7. Read Path Pipeline Diagram

8. Full Verilog Source — hbm3_read_path.v

Complete synthesizable module. All <= non-blocking assignments and < comparison operators are HTML-encoded inside the pre block. Copy-paste-ready for Vivado, Quartus, or VCS.

verilog — hbm3_read_path.v

// =============================================================
// hbm3_read_path.v
// HBM3 Read Data Path — Module 6
// Phase 2 of the HBM3 Controller Build series
// EcrioniX · https://ecrionix.org/hbm3-controller/read-path/
// =============================================================
// Parameters
//   RFIFO_DEPTH : Read FIFO depth (power of 2, default 4)
//   BL          : Burst Length (fixed 4 for HBM3)
//   DQ_W        : DQ bus width per pseudo-channel (32 bits)
//   CL_TOL      : Latency error tolerance in cycles (default 2)
// =============================================================

module hbm3_read_path #(
    parameter RFIFO_DEPTH = 4,    // must be power of 2
    parameter BL          = 4,    // burst length — HBM3 fixed BL4
    parameter DQ_W        = 32,   // DQ bus width
    parameter CL_TOL      = 2     // ± cycles for latency error flag
) (
    // clock / reset
    input  wire              i_clk,
    input  wire              i_rst_n,

    // read command interface
    input  wire              i_rd_cmd,         // read command pulse from scheduler
    input  wire [7:0]       i_cl,             // CAS Latency in cycles (default 70)

    // DQ bus inputs from PHY
    input  wire [31:0]      i_dq_in,          // DQ data from PHY (one beat)
    input  wire              i_dqs_valid,      // PHY: DQ is valid this cycle

    // host read data interface
    output  reg [127:0]     o_rd_data,        // assembled 128-bit burst
    output  reg              o_rd_valid,       // read data valid to host
    output  wire             o_rfifo_empty,    // read FIFO has no data
    output  reg              o_latency_err     // DQS arrived outside CL window
);

// ─────────────────────────────────────────────
// Local parameters
// ─────────────────────────────────────────────
localparam PIPE_W   = 128;                    // shift register width (max CL)
localparam PTR_W   = $clog2(RFIFO_DEPTH) + 1; // extra bit for full/empty
localparam DEPTH_W = $clog2(RFIFO_DEPTH);

// ─────────────────────────────────────────────
// CL Pipeline — 128-bit shift register
// ─────────────────────────────────────────────
reg [PIPE_W-1:0] cl_pipe;
wire              capture_expected;   // fires CL cycles after each rd_cmd

always @(posedge i_clk or negedge i_rst_n) begin
    if (!i_rst_n)
        cl_pipe <= {PIPE_W{1'b0}};
    else
        cl_pipe <= {cl_pipe[PIPE_W-2:0], i_rd_cmd};
end

// Tap the pipeline at position i_cl-1 (clipped to PIPE_W-1)
assign capture_expected = cl_pipe[i_cl - 8'd1];

// ─────────────────────────────────────────────
// Latency Error Detection
//   Flag if DQS arrives outside CL ± CL_TOL window
// ─────────────────────────────────────────────
reg [PIPE_W-1:0] cl_window;          // OR of ±CL_TOL range
integer j;

always @(*) begin
    cl_window = {PIPE_W{1'b0}};
    for (j = 0; j <= CL_TOL*2; j = j + 1) begin
        if ((i_cl - CL_TOL + j) < PIPE_W)
            cl_window[i_cl - CL_TOL + j] = 1'b1;
    end
end

always @(posedge i_clk or negedge i_rst_n) begin
    if (!i_rst_n)
        o_latency_err <= 1'b0;
    else if (i_dqs_valid && |(cl_pipe & ~cl_window) && |cl_pipe)
        o_latency_err <= 1'b1;   // DQS outside tolerance window
    else
        o_latency_err <= 1'b0;
end

// ─────────────────────────────────────────────
// Read Burst Deserializer
// ─────────────────────────────────────────────
reg [127:0] capture_buf;
reg [1:0]   beat_cnt;
reg          capture_active;

wire push_fifo;   // push to RFIFO on last beat
assign push_fifo = capture_active && i_dqs_valid && (beat_cnt == 2'd3);

always @(posedge i_clk or negedge i_rst_n) begin
    if (!i_rst_n) begin
        capture_buf    <= 128'b0;
        beat_cnt       <= 2'b0;
        capture_active <= 1'b0;
    end else begin
        if (capture_expected && !capture_active)
            capture_active <= 1'b1;  // arm capture

        if (capture_active && i_dqs_valid) begin
            case (beat_cnt)
                2'd0: capture_buf[31:0]   <= i_dq_in;
                2'd1: capture_buf[63:32]  <= i_dq_in;
                2'd2: capture_buf[95:64]  <= i_dq_in;
                2'd3: capture_buf[127:96] <= i_dq_in;
            endcase

            if (beat_cnt == 2'd3) begin
                beat_cnt       <= 2'b0;
                capture_active <= 1'b0;
            end else
                beat_cnt <= beat_cnt + 1'b1;
        end
    end
end

// ─────────────────────────────────────────────
// Read FIFO — depth=4, width=128
// ─────────────────────────────────────────────
reg [127:0]       rfifo_mem [0:RFIFO_DEPTH-1];
reg [PTR_W-1:0]  rfifo_wr_ptr;
reg [PTR_W-1:0]  rfifo_rd_ptr;

wire rfifo_empty_w = (rfifo_wr_ptr == rfifo_rd_ptr);
wire rfifo_full_w  = (rfifo_wr_ptr[PTR_W-1] != rfifo_rd_ptr[PTR_W-1]) &&
                     (rfifo_wr_ptr[DEPTH_W-1:0] == rfifo_rd_ptr[DEPTH_W-1:0]);
assign o_rfifo_empty = rfifo_empty_w;

// FIFO push (on last beat of deserializer)
always @(posedge i_clk) begin
    if (push_fifo && !rfifo_full_w) begin
        rfifo_mem[rfifo_wr_ptr[DEPTH_W-1:0]] <= {capture_buf[95:0], i_dq_in};
        rfifo_wr_ptr <= rfifo_wr_ptr + 1'b1;
    end
end

// FIFO pop + host output
always @(posedge i_clk or negedge i_rst_n) begin
    if (!i_rst_n) begin
        o_rd_data    <= 128'b0;
        o_rd_valid   <= 1'b0;
        rfifo_wr_ptr <= 0;
        rfifo_rd_ptr <= 0;
    end else begin
        if (!rfifo_empty_w) begin
            o_rd_data  <= rfifo_mem[rfifo_rd_ptr[DEPTH_W-1:0]];
            o_rd_valid <= 1'b1;
            rfifo_rd_ptr <= rfifo_rd_ptr + 1'b1;
        end else begin
            o_rd_valid <= 1'b0;
        end
    end
end

endmodule // hbm3_read_path

9. SystemVerilog Testbench with SVA Assertions

The testbench drives READ commands, simulates DRAM by asserting i_dqs_valid and i_dq_in exactly CL cycles later, and uses SVA to verify o_rd_valid timing, FIFO integrity, and latency error detection.

systemverilog — tb_hbm3_read_path.sv

// ===========================================================
// tb_hbm3_read_path.sv  — Self-checking SV testbench
// ===========================================================
module tb_hbm3_read_path;

parameter CLK_PERIOD = 500; // 500 ps = 2 GHz
parameter CL         = 70;
parameter BL         = 4;

logic          i_clk       = 0;
logic          i_rst_n     = 0;
logic          i_rd_cmd    = 0;
logic [7:0]   i_cl        = CL;
logic [31:0]  i_dq_in     = 0;
logic          i_dqs_valid = 0;

wire [127:0]  o_rd_data;
wire           o_rd_valid;
wire           o_rfifo_empty;
wire           o_latency_err;

hbm3_read_path #(.RFIFO_DEPTH(4), .BL(4), .DQ_W(32), .CL_TOL(2)) dut (.*);

always #(CLK_PERIOD/2) i_clk = !i_clk;

// ── SVA: o_rd_valid must not assert when rfifo_empty ─────
property p_valid_needs_data;
    @(posedge i_clk) o_rd_valid |-> !o_rfifo_empty;
endproperty
assert property (p_valid_needs_data)
    else $error("SVA FAIL: o_rd_valid asserted with empty FIFO");

// ── SVA: o_latency_err must not persist ──────────────────
property p_err_clears;
    @(posedge i_clk) $rose(o_latency_err) |-> ##1 !o_latency_err;
endproperty
assert property (p_err_clears)
    else $error("SVA FAIL: o_latency_err stuck high");

// ── SVA: rd_data must equal expected after valid ──────────
logic [127:0] expected_data;

property p_data_correct;
    @(posedge i_clk) o_rd_valid |-> (o_rd_data == expected_data);
endproperty
assert property (p_data_correct)
    else $error("SVA FAIL: o_rd_data mismatch. Got %0h, expected %0h",
                o_rd_data, expected_data);

// ── Task: simulate DRAM returning data after CL cycles ───
task automatic dram_return_data(input logic [127:0] data);
    repeat (CL) @(posedge i_clk); // wait CL cycles
    // Drive 4 beats on DQ bus
    repeat (BL) begin
        @(posedge i_clk);
        i_dqs_valid = 1;
        i_dq_in     = data[31:0];
        data        = {32'b0, data[127:32]};
    end
    @(posedge i_clk);
    i_dqs_valid = 0;
    i_dq_in     = 0;
endtask

initial begin
    $dumpfile("hbm3_read_path.vcd");
    $dumpvars(0, tb_hbm3_read_path);

    repeat (5) @(posedge i_clk);
    i_rst_n = 1;
    repeat (3) @(posedge i_clk);

    // Test 1: single read
    $display("[T1] Single read transaction");
    expected_data = 128'hDEAD_BEEF_CAFE_BABE_1234_5678_9ABC_DEF0;
    @(posedge i_clk); i_rd_cmd = 1;
    @(posedge i_clk); i_rd_cmd = 0;
    fork
        dram_return_data(expected_data);
    join
    repeat (5) @(posedge i_clk);

    // Test 2: back-to-back pipelined reads
    $display("[T2] Pipelined reads");
    expected_data = 128'hAAAA_BBBB_CCCC_DDDD_EEEE_FFFF_1111_2222;
    fork
        begin
            @(posedge i_clk); i_rd_cmd = 1;
            @(posedge i_clk); i_rd_cmd = 0;
            repeat (4) @(posedge i_clk);
            @(posedge i_clk); i_rd_cmd = 1;
            @(posedge i_clk); i_rd_cmd = 0;
        end
        dram_return_data(expected_data);
    join
    repeat (CL + 10) @(posedge i_clk);

    // Test 3: latency error (DQS arrives 5 cycles late)
    $display("[T3] Latency error detection — DQS arrives CL+5 late");
    @(posedge i_clk); i_rd_cmd = 1;
    @(posedge i_clk); i_rd_cmd = 0;
    repeat (CL + 5) @(posedge i_clk); // 5 extra cycles
    repeat (BL) begin
        @(posedge i_clk);
        i_dqs_valid = 1;
        i_dq_in     = 32'hBAD_DA7A;
    end
    @(posedge i_clk); i_dqs_valid = 0;
    repeat (5) @(posedge i_clk);
    if (o_latency_err)
        $display("[T3 PASS] latency error flagged correctly");
    else
        $error("[T3 FAIL] latency error not flagged");

    $display("[PASS] All tests complete");
    $finish;
end

endmodule

10. Timing Parameter Reference

Parameter	Symbol	Value (2 GHz)	Cycles	Description
CAS Latency	CL / RL	35 ns	70	READ command to first DQ data valid
Burst Length	BL	4 beats	4	Fixed per HBM3 pseudo-channel
Read Preamble	tRPRE	0.5–1 ns	1–2	DQS low before first valid strobe
Read Postamble	tRPST	0.5 ns	1	DQS low after last valid strobe
DQ output hold	tQH	0.38 tCK	—	DQ held valid after DQS edge
DQ output valid	tDQSQ	70 ps max	—	DQ to DQS skew at DRAM output
Read-to-Read (same bank)	tCCD_S	4 ns	8	Min gap between consecutive READs
Row Active Time	tRAS	32 ns	64	Min time row stays open after ACT

11. Port Reference — hbm3_read_path

Port	Dir	Width	Description
i_clk	Input	1	Controller clock (2 GHz nominal)
i_rst_n	Input	1	Active-low synchronous reset
i_rd_cmd	Input	1	Read command pulse (1 cycle) from scheduler
i_cl	Input	8	CAS Latency in cycles (default 70; runtime-configurable)
i_dq_in	Input	32	DQ bus input from PHY — one 32-bit beat per cycle
i_dqs_valid	Input	1	PHY asserts high for exactly 4 cycles when DQ data is valid
o_rd_data	Output	128	Assembled 128-bit BL4 burst word to host
o_rd_valid	Output	1	Read data valid — host may sample o_rd_data
o_rfifo_empty	Output	1	Read FIFO contains no valid entries
o_latency_err	Output	1	DQS arrived outside CL ± CL_TOL window; retrain required

12. FAQ

What does CAS Latency represent physically in HBM3?

CL is the total latency in clock cycles from the DRAM receiving a READ command to the controller receiving the first valid data beat. At 2 GHz, CL=70 equals 35 ns. The main contributors are: sense amplifier recovery time (sense amps must detect sub-millivolt differential on bitlines — ~10 ns), column path delay through DRAM array logic (~6 ns), output driver enable (~4 ns), TSV and package propagation (~4 ns), and PHY DQS alignment plus margins (~11 ns). CL is fixed for a given die and speed bin — it cannot be reduced through training.

Why is a DQS strobe needed? Can't the controller clock sample DQ directly?

No. At 2 GHz, each DQ bit window is only 250 ps. The DRAM's internal clock phase, the package trace delay, and the controller's internal clock may differ by 100–200 ps. Sampling on the controller clock would place the sample point randomly within the DQ eye with no guaranteed margin. DQS is generated by the DRAM die with a known phase relationship to DQ (center-aligned), so the PHY can lock onto DQS using a DLL and derive an optimal sample point with full setup and hold margin.

How does the shift register handle multiple in-flight reads?

Each i_rd_cmd pulse sets a '1' at bit 0 of the shift register. This bit shifts right by one position every clock cycle. After CL cycles it reaches bit position i_cl - 1 and fires capture_expected. Since multiple '1' bits can exist in the register simultaneously (one per in-flight read), the register tracks all in-flight reads without a per-read counter. The only constraint is that reads must be spaced by at least BL=4 cycles (tCCD_S) to avoid their capture windows overlapping on the DQ bus — which is enforced by the scheduler anyway.

What should firmware do when o_latency_err asserts?

o_latency_err means DQS arrived more than CL_TOL=2 cycles outside the expected window. The most common cause is temperature drift shifting the PHY DLL lock point, or a training calibration that has drifted. The recommended response is: (1) pause new READ commands, (2) drain the read FIFO, (3) trigger PHY read leveling re-calibration, (4) update i_cl if the calibrated latency changed, then resume. In some SoC implementations this is handled by a hardware state machine in the initialization controller.

Can the read FIFO depth be increased, and at what cost?

Yes — RFIFO_DEPTH is a parameter. Each entry is 128 bits (16 bytes). Depth 4 = 512 bits (64 bytes) of SRAM. Depth 8 doubles this to 128 bytes. For designs with high read bandwidth and a slow host data bus (e.g. PCIe Gen4 downstream from HBM3), a deeper RFIFO prevents backpressure from the host from starving the DRAM pipeline. The pointer width and full/empty logic scale automatically through $clog2.

← Module 5 — Write Data Path Module 7 — ECC Engine →