The read path captures DQ data from HBM3 DRAM exactly CL=70 cycles after the READ command, assembles four 32-bit beats into a 128-bit word, buffers it in a read FIFO, and signals the host with o_rd_valid. One missed clock ruins the capture.
The read data path is the most latency-sensitive component of the HBM3 controller. When the scheduler fires a READ command on the Command/Address (CA) bus, nothing comes back for a very long time — 70 clock cycles at 2 GHz, or 35 nanoseconds. During those 70 cycles, the DRAM sense amplifiers detect and amplify the row data, the column address selects the right word, the output drivers enable, and the signal propagates through the package and TSVs back to the controller.
The read path module must solve four problems simultaneously:
i_dqs_valid signal from the PHY indicating valid data on i_dq_in.i_dq_in must be assembled into one 128-bit word before the host can consume the data.The read path output is simple from the host's perspective: o_rd_data[127:0] with o_rd_valid high when data is available. Everything below that interface is the careful timing machinery described in this module.
CL (CAS Latency, also called RL for Read Latency in some documents) is the number of clock cycles between the READ command and the first DQ data beat. At 2 GHz, CL=70 equals exactly 35 ns. This is not an arbitrary number — it reflects the physical chain of operations inside the DRAM die:
| Step | Operation | Approx. Time |
|---|---|---|
| 1 | Command decode in DRAM logic | ~1 ns |
| 2 | Column address multiplexer routing | ~2 ns |
| 3 | Sense amplifier sense & amplify (tAA) | ~10 ns |
| 4 | Data path through bitline to output | ~6 ns |
| 5 | Output driver enable (tOE) | ~4 ns |
| 6 | DQ propagation through TSVs + package | ~4 ns |
| 7 | PHY receiver and DQS alignment | ~5 ns |
| 8 | Guard-band for PVT variation | ~3 ns |
| Total | ~35 ns = CL 70 | |
At faster speed grades, CL increases in absolute cycle count but stays roughly constant in nanoseconds. HBM3-6400 at 3200 MHz uses CL=140 (still ~43 ns) — the cycle budget grows because the clock period shrinks faster than the physics can be squeezed.
A naive approach to CL tracking would allocate one counter per in-flight read, decrementing each cycle and signaling capture when zero. This works but requires N counters for N concurrent reads — expensive at high queue depths.
The efficient approach uses a single N-bit shift register where N equals the maximum supported CL value (here 128 bits to cover CL up to 128 cycles). Each cycle a new bit is shifted in at position 0: set to 1 if a read command was issued this cycle, 0 otherwise. The bit at position i_cl - 1 is tapped as the capture_expected signal.
When a read command was issued CL cycles ago, its tracking bit has shifted exactly to position i_cl - 1. capture_expected fires for one cycle, telling the capture logic to expect valid data on i_dq_in this cycle and the next 3 cycles (the BL4 window).
i_cl[7:0] — no recompilation neededIn synchronous DDR memory, data is transferred at both the rising and falling edges of the clock. This means one bit can change every 250 ps at 2 GHz. At such speeds, the controller's internal clock has accumulated enough jitter and skew relative to the DRAM's output that sampling DQ on the controller clock is unreliable.
HBM3 solves this by having the DRAM output a differential DQS (Data Strobe) signal alongside the DQ bits. DQS is center-aligned to the DQ eye — it toggles at the midpoint of each valid data period. The PHY implements a DLL (Delay-Locked Loop) that locks to DQS and shifts its phase 90° to position the capture edge at the center of the DQ eye, where setup and hold margins are maximum.
The controller module does not implement DQS capture directly — that is done in the hardened PHY. Instead, the PHY presents two signals to the controller:
i_dq_in[31:0] — the captured 32-bit DQ word, already latched on DQS edgesi_dqs_valid — a synchronous flag in the controller clock domain, high for exactly 4 cycles (BL4) when DQ data is validThe controller's capture logic simply samples i_dq_in on every cycle where i_dqs_valid is high, using the beat counter to assemble the 128-bit burst.
Before the first DQS toggle, the DRAM holds DQS low for 1 or 2 tCK (read preamble). This gives the PHY DQS receiver time to enable. The i_dqs_valid signal presented to this controller module is already preamble-compensated by the PHY — it asserts on the first true data beat, not on the preamble.
The burst deserializer is the inverse of the write path serializer. It receives four consecutive 32-bit beats on i_dq_in and assembles them into a single 128-bit word. A 2-bit beat counter (beat_cnt) tracks position within the BL4 window. The mapping is:
| Beat | i_dq_in received | Packed into |
|---|---|---|
| 0 (first) | DQ[31:0] | rd_data[31:0] |
| 1 | DQ[31:0] | rd_data[63:32] |
| 2 | DQ[31:0] | rd_data[95:64] |
| 3 (last) | DQ[31:0] | rd_data[127:96] |
On the rising edge that sees beat 3 (beat_cnt == 2'b11 and i_dqs_valid), the assembled 128-bit word is pushed into the Read FIFO and beat_cnt resets to 0. The FIFO push completes in a single cycle — no stall is needed as long as the read FIFO is not full.
o_rfifo_empty is chronically 0 (indicating the host is consuming slowly).The read FIFO sits between the burst deserializer and the host data bus. With CL=70 cycles and typical HBM3 pseudo-channel bandwidths supporting 4–8 reads in-flight simultaneously, the FIFO must accommodate multiple assembled bursts while the host is busy with other work.
o_rd_data with o_rd_valid highA depth of 4 provides adequate buffer for burst read sequences. The scheduler must monitor in-flight read count and limit to RFIFO depth to prevent overflow. In a full controller implementation, the read data return path includes a request ID tag so that out-of-order responses can be reordered before delivery to the host — this module implements in-order capture only.
o_rd_valid asserts high when the read FIFO contains at least one valid entry and o_rd_data holds the word at the head of the FIFO. A single-cycle pop advances the read pointer. o_rfifo_empty is the complement of o_rd_valid (when no valid data is waiting). Both are registered for clean hold times to the host interface.
Complete synthesizable module. All <= non-blocking assignments and < comparison operators are HTML-encoded inside the pre block. Copy-paste-ready for Vivado, Quartus, or VCS.
// ============================================================= // hbm3_read_path.v // HBM3 Read Data Path — Module 6 // Phase 2 of the HBM3 Controller Build series // EcrioniX · https://ecrionix.org/hbm3-controller/read-path/ // ============================================================= // Parameters // RFIFO_DEPTH : Read FIFO depth (power of 2, default 4) // BL : Burst Length (fixed 4 for HBM3) // DQ_W : DQ bus width per pseudo-channel (32 bits) // CL_TOL : Latency error tolerance in cycles (default 2) // ============================================================= module hbm3_read_path #( parameter RFIFO_DEPTH = 4, // must be power of 2 parameter BL = 4, // burst length — HBM3 fixed BL4 parameter DQ_W = 32, // DQ bus width parameter CL_TOL = 2 // ± cycles for latency error flag ) ( // clock / reset input wire i_clk, input wire i_rst_n, // read command interface input wire i_rd_cmd, // read command pulse from scheduler input wire [7:0] i_cl, // CAS Latency in cycles (default 70) // DQ bus inputs from PHY input wire [31:0] i_dq_in, // DQ data from PHY (one beat) input wire i_dqs_valid, // PHY: DQ is valid this cycle // host read data interface output reg [127:0] o_rd_data, // assembled 128-bit burst output reg o_rd_valid, // read data valid to host output wire o_rfifo_empty, // read FIFO has no data output reg o_latency_err // DQS arrived outside CL window ); // ───────────────────────────────────────────── // Local parameters // ───────────────────────────────────────────── localparam PIPE_W = 128; // shift register width (max CL) localparam PTR_W = $clog2(RFIFO_DEPTH) + 1; // extra bit for full/empty localparam DEPTH_W = $clog2(RFIFO_DEPTH); // ───────────────────────────────────────────── // CL Pipeline — 128-bit shift register // ───────────────────────────────────────────── reg [PIPE_W-1:0] cl_pipe; wire capture_expected; // fires CL cycles after each rd_cmd always @(posedge i_clk or negedge i_rst_n) begin if (!i_rst_n) cl_pipe <= {PIPE_W{1'b0}}; else cl_pipe <= {cl_pipe[PIPE_W-2:0], i_rd_cmd}; end // Tap the pipeline at position i_cl-1 (clipped to PIPE_W-1) assign capture_expected = cl_pipe[i_cl - 8'd1]; // ───────────────────────────────────────────── // Latency Error Detection // Flag if DQS arrives outside CL ± CL_TOL window // ───────────────────────────────────────────── reg [PIPE_W-1:0] cl_window; // OR of ±CL_TOL range integer j; always @(*) begin cl_window = {PIPE_W{1'b0}}; for (j = 0; j <= CL_TOL*2; j = j + 1) begin if ((i_cl - CL_TOL + j) < PIPE_W) cl_window[i_cl - CL_TOL + j] = 1'b1; end end always @(posedge i_clk or negedge i_rst_n) begin if (!i_rst_n) o_latency_err <= 1'b0; else if (i_dqs_valid && |(cl_pipe & ~cl_window) && |cl_pipe) o_latency_err <= 1'b1; // DQS outside tolerance window else o_latency_err <= 1'b0; end // ───────────────────────────────────────────── // Read Burst Deserializer // ───────────────────────────────────────────── reg [127:0] capture_buf; reg [1:0] beat_cnt; reg capture_active; wire push_fifo; // push to RFIFO on last beat assign push_fifo = capture_active && i_dqs_valid && (beat_cnt == 2'd3); always @(posedge i_clk or negedge i_rst_n) begin if (!i_rst_n) begin capture_buf <= 128'b0; beat_cnt <= 2'b0; capture_active <= 1'b0; end else begin if (capture_expected && !capture_active) capture_active <= 1'b1; // arm capture if (capture_active && i_dqs_valid) begin case (beat_cnt) 2'd0: capture_buf[31:0] <= i_dq_in; 2'd1: capture_buf[63:32] <= i_dq_in; 2'd2: capture_buf[95:64] <= i_dq_in; 2'd3: capture_buf[127:96] <= i_dq_in; endcase if (beat_cnt == 2'd3) begin beat_cnt <= 2'b0; capture_active <= 1'b0; end else beat_cnt <= beat_cnt + 1'b1; end end end // ───────────────────────────────────────────── // Read FIFO — depth=4, width=128 // ───────────────────────────────────────────── reg [127:0] rfifo_mem [0:RFIFO_DEPTH-1]; reg [PTR_W-1:0] rfifo_wr_ptr; reg [PTR_W-1:0] rfifo_rd_ptr; wire rfifo_empty_w = (rfifo_wr_ptr == rfifo_rd_ptr); wire rfifo_full_w = (rfifo_wr_ptr[PTR_W-1] != rfifo_rd_ptr[PTR_W-1]) && (rfifo_wr_ptr[DEPTH_W-1:0] == rfifo_rd_ptr[DEPTH_W-1:0]); assign o_rfifo_empty = rfifo_empty_w; // FIFO push (on last beat of deserializer) always @(posedge i_clk) begin if (push_fifo && !rfifo_full_w) begin rfifo_mem[rfifo_wr_ptr[DEPTH_W-1:0]] <= {capture_buf[95:0], i_dq_in}; rfifo_wr_ptr <= rfifo_wr_ptr + 1'b1; end end // FIFO pop + host output always @(posedge i_clk or negedge i_rst_n) begin if (!i_rst_n) begin o_rd_data <= 128'b0; o_rd_valid <= 1'b0; rfifo_wr_ptr <= 0; rfifo_rd_ptr <= 0; end else begin if (!rfifo_empty_w) begin o_rd_data <= rfifo_mem[rfifo_rd_ptr[DEPTH_W-1:0]]; o_rd_valid <= 1'b1; rfifo_rd_ptr <= rfifo_rd_ptr + 1'b1; end else begin o_rd_valid <= 1'b0; end end end endmodule // hbm3_read_path
The testbench drives READ commands, simulates DRAM by asserting i_dqs_valid and i_dq_in exactly CL cycles later, and uses SVA to verify o_rd_valid timing, FIFO integrity, and latency error detection.
// =========================================================== // tb_hbm3_read_path.sv — Self-checking SV testbench // =========================================================== module tb_hbm3_read_path; parameter CLK_PERIOD = 500; // 500 ps = 2 GHz parameter CL = 70; parameter BL = 4; logic i_clk = 0; logic i_rst_n = 0; logic i_rd_cmd = 0; logic [7:0] i_cl = CL; logic [31:0] i_dq_in = 0; logic i_dqs_valid = 0; wire [127:0] o_rd_data; wire o_rd_valid; wire o_rfifo_empty; wire o_latency_err; hbm3_read_path #(.RFIFO_DEPTH(4), .BL(4), .DQ_W(32), .CL_TOL(2)) dut (.*); always #(CLK_PERIOD/2) i_clk = !i_clk; // ── SVA: o_rd_valid must not assert when rfifo_empty ───── property p_valid_needs_data; @(posedge i_clk) o_rd_valid |-> !o_rfifo_empty; endproperty assert property (p_valid_needs_data) else $error("SVA FAIL: o_rd_valid asserted with empty FIFO"); // ── SVA: o_latency_err must not persist ────────────────── property p_err_clears; @(posedge i_clk) $rose(o_latency_err) |-> ##1 !o_latency_err; endproperty assert property (p_err_clears) else $error("SVA FAIL: o_latency_err stuck high"); // ── SVA: rd_data must equal expected after valid ────────── logic [127:0] expected_data; property p_data_correct; @(posedge i_clk) o_rd_valid |-> (o_rd_data == expected_data); endproperty assert property (p_data_correct) else $error("SVA FAIL: o_rd_data mismatch. Got %0h, expected %0h", o_rd_data, expected_data); // ── Task: simulate DRAM returning data after CL cycles ─── task automatic dram_return_data(input logic [127:0] data); repeat (CL) @(posedge i_clk); // wait CL cycles // Drive 4 beats on DQ bus repeat (BL) begin @(posedge i_clk); i_dqs_valid = 1; i_dq_in = data[31:0]; data = {32'b0, data[127:32]}; end @(posedge i_clk); i_dqs_valid = 0; i_dq_in = 0; endtask initial begin $dumpfile("hbm3_read_path.vcd"); $dumpvars(0, tb_hbm3_read_path); repeat (5) @(posedge i_clk); i_rst_n = 1; repeat (3) @(posedge i_clk); // Test 1: single read $display("[T1] Single read transaction"); expected_data = 128'hDEAD_BEEF_CAFE_BABE_1234_5678_9ABC_DEF0; @(posedge i_clk); i_rd_cmd = 1; @(posedge i_clk); i_rd_cmd = 0; fork dram_return_data(expected_data); join repeat (5) @(posedge i_clk); // Test 2: back-to-back pipelined reads $display("[T2] Pipelined reads"); expected_data = 128'hAAAA_BBBB_CCCC_DDDD_EEEE_FFFF_1111_2222; fork begin @(posedge i_clk); i_rd_cmd = 1; @(posedge i_clk); i_rd_cmd = 0; repeat (4) @(posedge i_clk); @(posedge i_clk); i_rd_cmd = 1; @(posedge i_clk); i_rd_cmd = 0; end dram_return_data(expected_data); join repeat (CL + 10) @(posedge i_clk); // Test 3: latency error (DQS arrives 5 cycles late) $display("[T3] Latency error detection — DQS arrives CL+5 late"); @(posedge i_clk); i_rd_cmd = 1; @(posedge i_clk); i_rd_cmd = 0; repeat (CL + 5) @(posedge i_clk); // 5 extra cycles repeat (BL) begin @(posedge i_clk); i_dqs_valid = 1; i_dq_in = 32'hBAD_DA7A; end @(posedge i_clk); i_dqs_valid = 0; repeat (5) @(posedge i_clk); if (o_latency_err) $display("[T3 PASS] latency error flagged correctly"); else $error("[T3 FAIL] latency error not flagged"); $display("[PASS] All tests complete"); $finish; end endmodule
| Parameter | Symbol | Value (2 GHz) | Cycles | Description |
|---|---|---|---|---|
| CAS Latency | CL / RL | 35 ns | 70 | READ command to first DQ data valid |
| Burst Length | BL | 4 beats | 4 | Fixed per HBM3 pseudo-channel |
| Read Preamble | tRPRE | 0.5–1 ns | 1–2 | DQS low before first valid strobe |
| Read Postamble | tRPST | 0.5 ns | 1 | DQS low after last valid strobe |
| DQ output hold | tQH | 0.38 tCK | — | DQ held valid after DQS edge |
| DQ output valid | tDQSQ | 70 ps max | — | DQ to DQS skew at DRAM output |
| Read-to-Read (same bank) | tCCD_S | 4 ns | 8 | Min gap between consecutive READs |
| Row Active Time | tRAS | 32 ns | 64 | Min time row stays open after ACT |
| Port | Dir | Width | Description |
|---|---|---|---|
| i_clk | Input | 1 | Controller clock (2 GHz nominal) |
| i_rst_n | Input | 1 | Active-low synchronous reset |
| i_rd_cmd | Input | 1 | Read command pulse (1 cycle) from scheduler |
| i_cl | Input | 8 | CAS Latency in cycles (default 70; runtime-configurable) |
| i_dq_in | Input | 32 | DQ bus input from PHY — one 32-bit beat per cycle |
| i_dqs_valid | Input | 1 | PHY asserts high for exactly 4 cycles when DQ data is valid |
| o_rd_data | Output | 128 | Assembled 128-bit BL4 burst word to host |
| o_rd_valid | Output | 1 | Read data valid — host may sample o_rd_data |
| o_rfifo_empty | Output | 1 | Read FIFO contains no valid entries |
| o_latency_err | Output | 1 | DQS arrived outside CL ± CL_TOL window; retrain required |
CL is the total latency in clock cycles from the DRAM receiving a READ command to the controller receiving the first valid data beat. At 2 GHz, CL=70 equals 35 ns. The main contributors are: sense amplifier recovery time (sense amps must detect sub-millivolt differential on bitlines — ~10 ns), column path delay through DRAM array logic (~6 ns), output driver enable (~4 ns), TSV and package propagation (~4 ns), and PHY DQS alignment plus margins (~11 ns). CL is fixed for a given die and speed bin — it cannot be reduced through training.
No. At 2 GHz, each DQ bit window is only 250 ps. The DRAM's internal clock phase, the package trace delay, and the controller's internal clock may differ by 100–200 ps. Sampling on the controller clock would place the sample point randomly within the DQ eye with no guaranteed margin. DQS is generated by the DRAM die with a known phase relationship to DQ (center-aligned), so the PHY can lock onto DQS using a DLL and derive an optimal sample point with full setup and hold margin.
Each i_rd_cmd pulse sets a '1' at bit 0 of the shift register. This bit shifts right by one position every clock cycle. After CL cycles it reaches bit position i_cl - 1 and fires capture_expected. Since multiple '1' bits can exist in the register simultaneously (one per in-flight read), the register tracks all in-flight reads without a per-read counter. The only constraint is that reads must be spaced by at least BL=4 cycles (tCCD_S) to avoid their capture windows overlapping on the DQ bus — which is enforced by the scheduler anyway.
o_latency_err means DQS arrived more than CL_TOL=2 cycles outside the expected window. The most common cause is temperature drift shifting the PHY DLL lock point, or a training calibration that has drifted. The recommended response is: (1) pause new READ commands, (2) drain the read FIFO, (3) trigger PHY read leveling re-calibration, (4) update i_cl if the calibrated latency changed, then resume. In some SoC implementations this is handled by a hardware state machine in the initialization controller.
Yes — RFIFO_DEPTH is a parameter. Each entry is 128 bits (16 bytes). Depth 4 = 512 bits (64 bytes) of SRAM. Depth 8 doubles this to 128 bytes. For designs with high read bandwidth and a slow host data bus (e.g. PCIe Gen4 downstream from HBM3), a deeper RFIFO prevents backpressure from the host from starving the DRAM pipeline. The pointer width and full/empty logic scale automatically through $clog2.