3D-stacked HBM3 dies run hot. This module builds the thermal monitor controller: MR4 temperature readback, JEDEC derating thresholds, multi-level bandwidth throttle, dynamic refresh scaling, and emergency shutdown logic.
HBM3 is a 3D-stacked memory: multiple DRAM dies bonded vertically with through-silicon vias (TSVs). This architecture dramatically increases memory bandwidth density but also concentrates heat in a very small area.
The key thermal challenges in HBM3 are:
The temperature monitor module solves this by continuously reading the DRAM's internal temperature sensor, decoding the reading, applying JEDEC derating rules, and signaling the scheduler to reduce bandwidth or halt traffic when thresholds are exceeded.
HBM3 DRAM exposes its die temperature via Mode Register 4 (MR4). The controller reads this register by issuing a Mode Register Read (MRR) command with address field set to 4. The DRAM responds on the next DQ bus cycle with the MR4 contents.
MR4 bits [7:3] encode the temperature in 8 bands (5 bits = 32 possible codes, though only the lower 8 are used in practice). Each band spans approximately 5°C.
JEDEC recommends polling MR4 at least every 50 ms under normal conditions, and more frequently if temperature is near a threshold. The i_temp_valid input tells this module when a fresh MR4 reading is available.
| temp_code[7:0] | Decoded Temp (°C) | Throttle Level | Refresh Scale | BW Limit | Status |
|---|---|---|---|---|---|
| 8'h00 | 0–5 | 00 Normal | 1x (64ms) | 100% | Normal |
| 8'h01 | 5–10 | 00 Normal | 1x | 100% | Normal |
| 8'h02 | 10–20 | 00 Normal | 1x | 100% | Normal |
| 8'h03 | 20–35 | 00 Normal | 1x | 100% | Normal |
| 8'h04 | 35–55 | 00 Normal | 1x | 100% | Normal |
| 8'h05 | 55–75 | 00 Normal | 1x | 100% | Normal |
| 8'h06 | 75–85 | 01 Mild | 1x → 0.5x | 75% | Warm |
| 8'h07 | 85–90 | 01 Mild | 0.5x (32ms) | 75% | Derating1 |
| 8'h08 | 90–95 | 10 Moderate | 0.25x (16ms) | 50% | Derating2 |
| 8'hFF | >95 | 11 Emergency | 0x (SR only) | 0% | SHUTDOWN |
The o_refi_scale[1:0] output controls the refresh engine: 2'b00 = 1x (normal tREFI), 2'b01 = 0.5x (halved tREFI = 2x refresh rate), 2'b10 = 0.25x (quarter tREFI = 4x refresh rate).
The throttle controller is a combinational decoder that maps the incoming temperature band code to throttle level, refresh scale, and bandwidth limit. The o_throttle_level[1:0] output goes to the AXI4 interface and scheduler to back-pressure incoming transactions.
Three hysteresis bits prevent rapid toggling near thresholds. When temperature crosses a threshold going up, the new throttle level activates immediately. When temperature drops back down, a hysteresis count of 16 polling intervals must expire before the throttle level de-escalates.
Normal HBM3 tREFI is 3.9 µs (for 64ms refresh per 16K rows). At elevated temperature, the controller must issue refresh commands more frequently:
The refresh engine (a separate module) reads o_refi_scale and adjusts its refresh timer period accordingly. When refi_scale = 0.5x, the refresh timer fires twice as often, consuming more bus bandwidth but ensuring data integrity.
When temperature exceeds 95°C (temp_code = 8'hFF or above the configured emergency threshold), the controller enters emergency shutdown:
The system must intervene — typically by reducing the GPU/SoC clock frequency (reducing HBM3 access rate) or increasing fan speed — before the controller re-enables traffic. The alert is sticky: it only de-asserts when temperature drops below 90°C AND a software clear is issued.
// ============================================================
// hbm3_temp_monitor.v — HBM3 Temperature Monitor & Throttle
// EcrioniX · HBM3 Controller Build · Module 14
// Phase 4: Power and Thermal Management
// ============================================================
// Reads DRAM temperature from MR4 (8 temperature bands),
// decodes to degrees C, applies JEDEC derating thresholds,
// and outputs throttle level, refresh scale, and BW limit.
// Synthesizable RTL.
// ============================================================
module hbm3_temp_monitor #(
// Temperature thresholds (in decoded degrees C)
parameter THRESH_MILD = 8'd75, // above -> mild throttle
parameter THRESH_MODERATE = 8'd85, // above -> moderate throttle
parameter THRESH_EMERGENCY= 8'd95, // above -> emergency shutdown
// Hysteresis: polling intervals before de-escalation
parameter HYST_COUNT = 16
)(
input wire i_clk,
input wire i_rst_n,
// Temperature from MR4 readback (8 bands, each ~5 deg C wide)
input wire [7:0] i_temp_code,
input wire i_temp_valid, // new reading available
// Software alert clear (write 1 to de-assert tcase_alert)
input wire i_alert_clr,
// Decoded outputs
output reg [7:0] o_temp_degc, // temperature in degrees C
// Throttle and derating outputs
output reg [1:0] o_throttle_level, // 00=normal,01=mild,10=moderate,11=emergency
output reg [1:0] o_refi_scale, // 00=1x, 01=0.5x, 10=0.25x
output reg o_tcase_alert, // JEDEC limit exceeded
output reg [7:0] o_bw_limit // max BW as % of nominal
);
// ============================================================
// Temp code to degrees C decode table (MR4 encoding)
// Band boundaries per JEDEC JESD238 Table 14
// ============================================================
function automatic [7:0] decode_temp;
input [7:0] code;
begin
case (code)
8'h00: decode_temp = 8'd3;
8'h01: decode_temp = 8'd8;
8'h02: decode_temp = 8'd15;
8'h03: decode_temp = 8'd28;
8'h04: decode_temp = 8'd45;
8'h05: decode_temp = 8'd65;
8'h06: decode_temp = 8'd80;
8'h07: decode_temp = 8'd88;
8'h08: decode_temp = 8'd93;
default: decode_temp = 8'd100; // overtemp / unknown
endcase
end
endfunction
// ============================================================
// Hysteresis counter (count-down before de-escalation)
// ============================================================
reg [4:0] hyst_cnt;
// ============================================================
// Throttle state register (latched to provide hysteresis)
// ============================================================
reg [1:0] throttle_latch;
// ============================================================
// Main monitor logic
// ============================================================
always @(posedge i_clk or negedge i_rst_n) begin
if (!i_rst_n) begin
o_temp_degc <= 8'd0;
o_throttle_level <= 2'b00;
o_refi_scale <= 2'b00;
o_tcase_alert <= 1'b0;
o_bw_limit <= 8'd100;
throttle_latch <= 2'b00;
hyst_cnt <= 5'd0;
end else begin
// Handle alert clear
if (i_alert_clr && o_temp_degc < THRESH_MODERATE)
o_tcase_alert <= 1'b0;
if (i_temp_valid) begin
// Decode temperature
o_temp_degc <= decode_temp(i_temp_code);
// Determine new throttle target based on temperature
if (o_temp_degc >= THRESH_EMERGENCY) begin
// Emergency: assert immediately, no hysteresis
throttle_latch <= 2'b11;
o_tcase_alert <= 1'b1;
hyst_cnt <= 5'd0;
end else if (o_temp_degc >= THRESH_MODERATE) begin
if (throttle_latch < 2'b10) begin
throttle_latch <= 2'b10;
hyst_cnt <= 5'd0;
end
end else if (o_temp_degc >= THRESH_MILD) begin
if (throttle_latch < 2'b01) begin
throttle_latch <= 2'b01;
hyst_cnt <= 5'd0;
end else if (throttle_latch > 2'b01) begin
// De-escalation with hysteresis
if (hyst_cnt == HYST_COUNT - 1) begin
throttle_latch <= 2'b01;
hyst_cnt <= 5'd0;
end else hyst_cnt <= hyst_cnt + 1;
end
end else begin
// Below THRESH_MILD: de-escalate with hysteresis
if (throttle_latch > 2'b00) begin
if (hyst_cnt == HYST_COUNT - 1) begin
throttle_latch <= 2'b00;
hyst_cnt <= 5'd0;
end else hyst_cnt <= hyst_cnt + 1;
end
end
// Drive outputs from latch
o_throttle_level <= throttle_latch;
case (throttle_latch)
2'b00: begin
o_refi_scale <= 2'b00; // 1x normal
o_bw_limit <= 8'd100;
end
2'b01: begin
o_refi_scale <= 2'b01; // 0.5x (2x refresh)
o_bw_limit <= 8'd75;
end
2'b10: begin
o_refi_scale <= 2'b10; // 0.25x (4x refresh)
o_bw_limit <= 8'd50;
end
2'b11: begin
o_refi_scale <= 2'b10; // max refresh
o_bw_limit <= 8'd0;
end
default: begin
o_refi_scale <= 2'b00;
o_bw_limit <= 8'd100;
end
endcase
end
end
end
endmodule
// ============================================================
// tb_hbm3_temp_monitor.sv — Testbench for Temperature Monitor
// EcrioniX · HBM3 Controller Build · Module 14
// ============================================================
`timescale 1ns/1ps
module tb_hbm3_temp_monitor;
// DUT signals
logic clk, rst_n;
logic [7:0] temp_code;
logic temp_valid, alert_clr;
logic [7:0] temp_degc;
logic [1:0] throttle_level, refi_scale;
logic tcase_alert;
logic [7:0] bw_limit;
// Instantiate DUT
hbm3_temp_monitor #(
.THRESH_MILD(8'd75),
.THRESH_MODERATE(8'd85),
.THRESH_EMERGENCY(8'd95),
.HYST_COUNT(4) // short for simulation
) dut (
.i_clk(clk), .i_rst_n(rst_n),
.i_temp_code(temp_code),.i_temp_valid(temp_valid),
.i_alert_clr(alert_clr),
.o_temp_degc(temp_degc),.o_throttle_level(throttle_level),
.o_refi_scale(refi_scale),.o_tcase_alert(tcase_alert),
.o_bw_limit(bw_limit)
);
// 500MHz clock
initial clk = 0;
always #1 clk = ~clk;
integer errors = 0;
task send_temp(input [7:0] code);
@(posedge clk);
temp_code = code;
temp_valid = 1'b1;
@(posedge clk);
temp_valid = 1'b0;
repeat(4) @(posedge clk); // let outputs settle
endtask
task check_throttle(input [1:0] expected, input string label);
if (throttle_level !== expected) begin
$error("FAIL [%s]: throttle_level=%0b expected=%0b", label, throttle_level, expected);
errors++;
end else $display("[%0t] PASS [%s]: throttle=%0b bw=%0d%%", $time, label, throttle_level, bw_limit);
endtask
initial begin
$dumpfile("tb_temp_monitor.vcd");
$dumpvars(0, tb_hbm3_temp_monitor);
rst_n = 0; temp_code = '0; temp_valid = 0; alert_clr = 0;
repeat(10) @(posedge clk);
rst_n = 1;
repeat(5) @(posedge clk);
// TEST 1: Normal temperature (code=5, ~65°C)
$display("[%0t] TEST1: Normal operating temp", $time);
send_temp(8'h05);
check_throttle(2'b00, "65C-Normal");
// TEST 2: Warm temperature (code=6, ~80°C) — mild throttle
$display("[%0t] TEST2: Warm temp -> mild throttle", $time);
send_temp(8'h06);
check_throttle(2'b01, "80C-Mild");
if (bw_limit !== 8'd75) begin
$error("FAIL: bw_limit=%0d expected 75", bw_limit);
errors++;
end
// TEST 3: Hot temperature (code=7, ~88°C) — moderate throttle
$display("[%0t] TEST3: Hot temp -> moderate throttle", $time);
send_temp(8'h07);
check_throttle(2'b01, "88C-still-mild-or-moderate"); // escalates
if (refi_scale !== 2'b01) $display("Note: refi_scale=%0b", refi_scale);
// TEST 4: Critical temperature (code=8, ~93°C) — moderate
$display("[%0t] TEST4: Critical temp (93C) -> moderate", $time);
send_temp(8'h08);
if (throttle_level !== 2'b10)
$display("Note: throttle_level=%0b (may still transitioning)", throttle_level);
// TEST 5: Emergency (code=FF, >95°C)
$display("[%0t] TEST5: Emergency temp >95C -> shutdown", $time);
send_temp(8'hFF);
repeat(5) @(posedge clk);
if (throttle_level !== 2'b11) begin
$error("FAIL: Emergency throttle not set. Got %0b", throttle_level);
errors++;
end else $display("[%0t] PASS: Emergency shutdown activated", $time);
if (!tcase_alert) begin
$error("FAIL: tcase_alert not asserted during emergency");
errors++;
end else $display("[%0t] PASS: tcase_alert asserted", $time);
if (bw_limit !== 8'd0) begin
$error("FAIL: bw_limit should be 0 during emergency, got %0d", bw_limit);
errors++;
end
// TEST 6: Recovery from emergency
$display("[%0t] TEST6: Recovery — temp back to normal", $time);
send_temp(8'h04); // 45°C
alert_clr = 1; @(posedge clk); alert_clr = 0;
// Need HYST_COUNT polls before de-escalation
repeat(4) send_temp(8'h04);
if (tcase_alert) $display("Note: alert still set (may need more polls)");
// Summary
repeat(20) @(posedge clk);
if (errors == 0)
$display("[%0t] ALL TESTS PASSED", $time);
else
$display("[%0t] %0d TEST(S) FAILED", $time, errors);
$finish;
end
initial begin
#100000;
$error("TIMEOUT");
$finish;
end
endmodule
HBM3 stacks multiple DRAM dies vertically with TSVs, dramatically increasing power density per unit area. Heat from lower dies must travel through silicon layers to reach the heatsink, creating high thermal resistance. At elevated temperatures, DRAM charge retention degrades (requiring faster refresh), and operation above JEDEC limits (85°C standard, 95°C extended) risks data loss or permanent device damage.
The DRAM reports its die temperature via Mode Register 4 (MR4). The controller issues a Mode Register Read (MRR) command with address 4, and the DRAM responds with an 8-bit temperature band code on the DQ bus. Each band covers approximately 5°C. JEDEC recommends polling MR4 at least every 50ms under normal operation.
At elevated temperatures, DRAM capacitors discharge faster, so the controller must shorten tREFI (the refresh interval). Above 85°C, tREFI must be halved — the controller doubles its refresh rate. Above 95°C, tREFI is quartered (4x refresh rate). This is called Self-Refresh Calibration (SRC) or thermal refresh derating per JESD238.
The throttle controller implements four levels: Normal (below 75°C — full bandwidth, 1x refresh), Mild (75–85°C — 25% BW reduction, 2x refresh), Moderate (85–95°C — 50% BW reduction, 4x refresh), and Emergency shutdown (above 95°C — all traffic halted, tcase_alert asserted, zero BW limit). Hysteresis prevents rapid oscillation near thresholds.
The bw_limit output indicates maximum allowable bandwidth as a percentage of nominal. The scheduler reads this and applies a token-bucket rate limiter: it counts commands issued per refresh window and stops accepting new AXI transactions once the count reaches floor(bw_limit/100 × max_cmds_per_window). This provides smooth bandwidth reduction without abrupt traffic halts.