HomeHBM3 ControllerModule 14 — Temperature Monitor
Phase 4 · Module 14

HBM3 Temperature Monitor & Throttle

3D-stacked HBM3 dies run hot. This module builds the thermal monitor controller: MR4 temperature readback, JEDEC derating thresholds, multi-level bandwidth throttle, dynamic refresh scaling, and emergency shutdown logic.

hbm3_temp_monitor.v tb_hbm3_temp_monitor.sv Synthesizable RTL JEDEC JESD238 Phase 4

1. Why Thermal Management Is Critical in HBM3

HBM3 is a 3D-stacked memory: multiple DRAM dies bonded vertically with through-silicon vias (TSVs). This architecture dramatically increases memory bandwidth density but also concentrates heat in a very small area.

The key thermal challenges in HBM3 are:

The temperature monitor module solves this by continuously reading the DRAM's internal temperature sensor, decoding the reading, applying JEDEC derating rules, and signaling the scheduler to reduce bandwidth or halt traffic when thresholds are exceeded.

The DRAM's temperature sensor is an on-die thermal diode with approximately ±3°C accuracy. The controller should apply a 5°C safety margin when evaluating throttle thresholds.

2. MR4 Temperature Readback

HBM3 DRAM exposes its die temperature via Mode Register 4 (MR4). The controller reads this register by issuing a Mode Register Read (MRR) command with address field set to 4. The DRAM responds on the next DQ bus cycle with the MR4 contents.

MR4 bits [7:3] encode the temperature in 8 bands (5 bits = 32 possible codes, though only the lower 8 are used in practice). Each band spans approximately 5°C.

JEDEC recommends polling MR4 at least every 50 ms under normal conditions, and more frequently if temperature is near a threshold. The i_temp_valid input tells this module when a fresh MR4 reading is available.

MRR commands consume bus cycles. Polling too frequently degrades effective bandwidth. A practical implementation issues MRR during refresh windows when the bus would otherwise be idle.

3. Temperature Bands and JEDEC Thresholds

temp_code[7:0]Decoded Temp (°C)Throttle LevelRefresh ScaleBW LimitStatus
8'h000–500 Normal1x (64ms)100%Normal
8'h015–1000 Normal1x100%Normal
8'h0210–2000 Normal1x100%Normal
8'h0320–3500 Normal1x100%Normal
8'h0435–5500 Normal1x100%Normal
8'h0555–7500 Normal1x100%Normal
8'h0675–8501 Mild1x → 0.5x75%Warm
8'h0785–9001 Mild0.5x (32ms)75%Derating1
8'h0890–9510 Moderate0.25x (16ms)50%Derating2
8'hFF>9511 Emergency0x (SR only)0%SHUTDOWN

The o_refi_scale[1:0] output controls the refresh engine: 2'b00 = 1x (normal tREFI), 2'b01 = 0.5x (halved tREFI = 2x refresh rate), 2'b10 = 0.25x (quarter tREFI = 4x refresh rate).

0°C 55°C 75°C 85°C 90°C 95°C 100% 75% 50% 0% NORMAL — 100% BW MILD -25% -50% HALT Temperature (°C) Bandwidth Limit

4. Thermal Throttle Logic

The throttle controller is a combinational decoder that maps the incoming temperature band code to throttle level, refresh scale, and bandwidth limit. The o_throttle_level[1:0] output goes to the AXI4 interface and scheduler to back-pressure incoming transactions.

Three hysteresis bits prevent rapid toggling near thresholds. When temperature crosses a threshold going up, the new throttle level activates immediately. When temperature drops back down, a hysteresis count of 16 polling intervals must expire before the throttle level de-escalates.

Hysteresis is critical. Without it, a temperature reading that oscillates between 84°C and 86°C causes rapid throttle on/off switching, which itself generates heat and hurts throughput. Hysteresis keeps the system stable near boundaries.

5. Refresh Derating — Faster Refresh at High Temperature

Normal HBM3 tREFI is 3.9 µs (for 64ms refresh per 16K rows). At elevated temperature, the controller must issue refresh commands more frequently:

The refresh engine (a separate module) reads o_refi_scale and adjusts its refresh timer period accordingly. When refi_scale = 0.5x, the refresh timer fires twice as often, consuming more bus bandwidth but ensuring data integrity.

6. Emergency Thermal Shutdown

When temperature exceeds 95°C (temp_code = 8'hFF or above the configured emergency threshold), the controller enters emergency shutdown:

The system must intervene — typically by reducing the GPU/SoC clock frequency (reducing HBM3 access rate) or increasing fan speed — before the controller re-enables traffic. The alert is sticky: it only de-asserts when temperature drops below 90°C AND a software clear is issued.

7. Full Verilog Source

Verilog — hbm3_temp_monitor.v
// ============================================================
// hbm3_temp_monitor.v — HBM3 Temperature Monitor & Throttle
// EcrioniX · HBM3 Controller Build · Module 14
// Phase 4: Power and Thermal Management
// ============================================================
// Reads DRAM temperature from MR4 (8 temperature bands),
// decodes to degrees C, applies JEDEC derating thresholds,
// and outputs throttle level, refresh scale, and BW limit.
// Synthesizable RTL.
// ============================================================

module hbm3_temp_monitor #(
    // Temperature thresholds (in decoded degrees C)
    parameter THRESH_MILD     = 8'd75,  // above -> mild throttle
    parameter THRESH_MODERATE = 8'd85,  // above -> moderate throttle
    parameter THRESH_EMERGENCY= 8'd95,  // above -> emergency shutdown
    // Hysteresis: polling intervals before de-escalation
    parameter HYST_COUNT      = 16
)(
    input  wire        i_clk,
    input  wire        i_rst_n,

    // Temperature from MR4 readback (8 bands, each ~5 deg C wide)
    input  wire [7:0]  i_temp_code,
    input  wire        i_temp_valid,     // new reading available

    // Software alert clear (write 1 to de-assert tcase_alert)
    input  wire        i_alert_clr,

    // Decoded outputs
    output reg  [7:0]  o_temp_degc,      // temperature in degrees C

    // Throttle and derating outputs
    output reg  [1:0]  o_throttle_level, // 00=normal,01=mild,10=moderate,11=emergency
    output reg  [1:0]  o_refi_scale,     // 00=1x, 01=0.5x, 10=0.25x
    output reg         o_tcase_alert,    // JEDEC limit exceeded
    output reg  [7:0]  o_bw_limit        // max BW as % of nominal
);

// ============================================================
// Temp code to degrees C decode table (MR4 encoding)
// Band boundaries per JEDEC JESD238 Table 14
// ============================================================
function automatic [7:0] decode_temp;
    input [7:0] code;
    begin
        case (code)
            8'h00: decode_temp = 8'd3;
            8'h01: decode_temp = 8'd8;
            8'h02: decode_temp = 8'd15;
            8'h03: decode_temp = 8'd28;
            8'h04: decode_temp = 8'd45;
            8'h05: decode_temp = 8'd65;
            8'h06: decode_temp = 8'd80;
            8'h07: decode_temp = 8'd88;
            8'h08: decode_temp = 8'd93;
            default: decode_temp = 8'd100; // overtemp / unknown
        endcase
    end
endfunction

// ============================================================
// Hysteresis counter (count-down before de-escalation)
// ============================================================
reg [4:0] hyst_cnt;

// ============================================================
// Throttle state register (latched to provide hysteresis)
// ============================================================
reg [1:0] throttle_latch;

// ============================================================
// Main monitor logic
// ============================================================
always @(posedge i_clk or negedge i_rst_n) begin
    if (!i_rst_n) begin
        o_temp_degc      <= 8'd0;
        o_throttle_level <= 2'b00;
        o_refi_scale     <= 2'b00;
        o_tcase_alert    <= 1'b0;
        o_bw_limit       <= 8'd100;
        throttle_latch   <= 2'b00;
        hyst_cnt         <= 5'd0;
    end else begin
        // Handle alert clear
        if (i_alert_clr && o_temp_degc < THRESH_MODERATE)
            o_tcase_alert <= 1'b0;

        if (i_temp_valid) begin
            // Decode temperature
            o_temp_degc <= decode_temp(i_temp_code);

            // Determine new throttle target based on temperature
            if (o_temp_degc >= THRESH_EMERGENCY) begin
                // Emergency: assert immediately, no hysteresis
                throttle_latch <= 2'b11;
                o_tcase_alert  <= 1'b1;
                hyst_cnt       <= 5'd0;
            end else if (o_temp_degc >= THRESH_MODERATE) begin
                if (throttle_latch < 2'b10) begin
                    throttle_latch <= 2'b10;
                    hyst_cnt       <= 5'd0;
                end
            end else if (o_temp_degc >= THRESH_MILD) begin
                if (throttle_latch < 2'b01) begin
                    throttle_latch <= 2'b01;
                    hyst_cnt       <= 5'd0;
                end else if (throttle_latch > 2'b01) begin
                    // De-escalation with hysteresis
                    if (hyst_cnt == HYST_COUNT - 1) begin
                        throttle_latch <= 2'b01;
                        hyst_cnt       <= 5'd0;
                    end else hyst_cnt <= hyst_cnt + 1;
                end
            end else begin
                // Below THRESH_MILD: de-escalate with hysteresis
                if (throttle_latch > 2'b00) begin
                    if (hyst_cnt == HYST_COUNT - 1) begin
                        throttle_latch <= 2'b00;
                        hyst_cnt       <= 5'd0;
                    end else hyst_cnt <= hyst_cnt + 1;
                end
            end

            // Drive outputs from latch
            o_throttle_level <= throttle_latch;

            case (throttle_latch)
                2'b00: begin
                    o_refi_scale <= 2'b00; // 1x normal
                    o_bw_limit   <= 8'd100;
                end
                2'b01: begin
                    o_refi_scale <= 2'b01; // 0.5x (2x refresh)
                    o_bw_limit   <= 8'd75;
                end
                2'b10: begin
                    o_refi_scale <= 2'b10; // 0.25x (4x refresh)
                    o_bw_limit   <= 8'd50;
                end
                2'b11: begin
                    o_refi_scale <= 2'b10; // max refresh
                    o_bw_limit   <= 8'd0;
                end
                default: begin
                    o_refi_scale <= 2'b00;
                    o_bw_limit   <= 8'd100;
                end
            endcase
        end
    end
end

endmodule

8. SystemVerilog Testbench

SystemVerilog — tb_hbm3_temp_monitor.sv
// ============================================================
// tb_hbm3_temp_monitor.sv — Testbench for Temperature Monitor
// EcrioniX · HBM3 Controller Build · Module 14
// ============================================================
`timescale 1ns/1ps

module tb_hbm3_temp_monitor;

// DUT signals
logic        clk, rst_n;
logic [7:0]  temp_code;
logic        temp_valid, alert_clr;
logic [7:0]  temp_degc;
logic [1:0]  throttle_level, refi_scale;
logic        tcase_alert;
logic [7:0]  bw_limit;

// Instantiate DUT
hbm3_temp_monitor #(
    .THRESH_MILD(8'd75),
    .THRESH_MODERATE(8'd85),
    .THRESH_EMERGENCY(8'd95),
    .HYST_COUNT(4) // short for simulation
) dut (
    .i_clk(clk),           .i_rst_n(rst_n),
    .i_temp_code(temp_code),.i_temp_valid(temp_valid),
    .i_alert_clr(alert_clr),
    .o_temp_degc(temp_degc),.o_throttle_level(throttle_level),
    .o_refi_scale(refi_scale),.o_tcase_alert(tcase_alert),
    .o_bw_limit(bw_limit)
);

// 500MHz clock
initial clk = 0;
always #1 clk = ~clk;

integer errors = 0;

task send_temp(input [7:0] code);
    @(posedge clk);
    temp_code  = code;
    temp_valid = 1'b1;
    @(posedge clk);
    temp_valid = 1'b0;
    repeat(4) @(posedge clk); // let outputs settle
endtask

task check_throttle(input [1:0] expected, input string label);
    if (throttle_level !== expected) begin
        $error("FAIL [%s]: throttle_level=%0b expected=%0b", label, throttle_level, expected);
        errors++;
    end else $display("[%0t] PASS [%s]: throttle=%0b bw=%0d%%", $time, label, throttle_level, bw_limit);
endtask

initial begin
    $dumpfile("tb_temp_monitor.vcd");
    $dumpvars(0, tb_hbm3_temp_monitor);

    rst_n = 0; temp_code = '0; temp_valid = 0; alert_clr = 0;
    repeat(10) @(posedge clk);
    rst_n = 1;
    repeat(5) @(posedge clk);

    // TEST 1: Normal temperature (code=5, ~65°C)
    $display("[%0t] TEST1: Normal operating temp", $time);
    send_temp(8'h05);
    check_throttle(2'b00, "65C-Normal");

    // TEST 2: Warm temperature (code=6, ~80°C) — mild throttle
    $display("[%0t] TEST2: Warm temp -> mild throttle", $time);
    send_temp(8'h06);
    check_throttle(2'b01, "80C-Mild");
    if (bw_limit !== 8'd75) begin
        $error("FAIL: bw_limit=%0d expected 75", bw_limit);
        errors++;
    end

    // TEST 3: Hot temperature (code=7, ~88°C) — moderate throttle
    $display("[%0t] TEST3: Hot temp -> moderate throttle", $time);
    send_temp(8'h07);
    check_throttle(2'b01, "88C-still-mild-or-moderate"); // escalates
    if (refi_scale !== 2'b01) $display("Note: refi_scale=%0b", refi_scale);

    // TEST 4: Critical temperature (code=8, ~93°C) — moderate
    $display("[%0t] TEST4: Critical temp (93C) -> moderate", $time);
    send_temp(8'h08);
    if (throttle_level !== 2'b10)
        $display("Note: throttle_level=%0b (may still transitioning)", throttle_level);

    // TEST 5: Emergency (code=FF, >95°C)
    $display("[%0t] TEST5: Emergency temp >95C -> shutdown", $time);
    send_temp(8'hFF);
    repeat(5) @(posedge clk);
    if (throttle_level !== 2'b11) begin
        $error("FAIL: Emergency throttle not set. Got %0b", throttle_level);
        errors++;
    end else $display("[%0t] PASS: Emergency shutdown activated", $time);
    if (!tcase_alert) begin
        $error("FAIL: tcase_alert not asserted during emergency");
        errors++;
    end else $display("[%0t] PASS: tcase_alert asserted", $time);
    if (bw_limit !== 8'd0) begin
        $error("FAIL: bw_limit should be 0 during emergency, got %0d", bw_limit);
        errors++;
    end

    // TEST 6: Recovery from emergency
    $display("[%0t] TEST6: Recovery — temp back to normal", $time);
    send_temp(8'h04); // 45°C
    alert_clr = 1; @(posedge clk); alert_clr = 0;
    // Need HYST_COUNT polls before de-escalation
    repeat(4) send_temp(8'h04);
    if (tcase_alert) $display("Note: alert still set (may need more polls)");

    // Summary
    repeat(20) @(posedge clk);
    if (errors == 0)
        $display("[%0t] ALL TESTS PASSED", $time);
    else
        $display("[%0t] %0d TEST(S) FAILED", $time, errors);
    $finish;
end

initial begin
    #100000;
    $error("TIMEOUT");
    $finish;
end

endmodule

Frequently Asked Questions

Why does HBM3 need active thermal management?

HBM3 stacks multiple DRAM dies vertically with TSVs, dramatically increasing power density per unit area. Heat from lower dies must travel through silicon layers to reach the heatsink, creating high thermal resistance. At elevated temperatures, DRAM charge retention degrades (requiring faster refresh), and operation above JEDEC limits (85°C standard, 95°C extended) risks data loss or permanent device damage.

How does the controller read DRAM temperature?

The DRAM reports its die temperature via Mode Register 4 (MR4). The controller issues a Mode Register Read (MRR) command with address 4, and the DRAM responds with an 8-bit temperature band code on the DQ bus. Each band covers approximately 5°C. JEDEC recommends polling MR4 at least every 50ms under normal operation.

What is refresh derating (SRC) in HBM3?

At elevated temperatures, DRAM capacitors discharge faster, so the controller must shorten tREFI (the refresh interval). Above 85°C, tREFI must be halved — the controller doubles its refresh rate. Above 95°C, tREFI is quartered (4x refresh rate). This is called Self-Refresh Calibration (SRC) or thermal refresh derating per JESD238.

What are the HBM3 throttle levels?

The throttle controller implements four levels: Normal (below 75°C — full bandwidth, 1x refresh), Mild (75–85°C — 25% BW reduction, 2x refresh), Moderate (85–95°C — 50% BW reduction, 4x refresh), and Emergency shutdown (above 95°C — all traffic halted, tcase_alert asserted, zero BW limit). Hysteresis prevents rapid oscillation near thresholds.

How does bandwidth limiting actually work in the controller?

The bw_limit output indicates maximum allowable bandwidth as a percentage of nominal. The scheduler reads this and applies a token-bucket rate limiter: it counts commands issued per refresh window and stops accepting new AXI transactions once the count reaches floor(bw_limit/100 × max_cmds_per_window). This provides smooth bandwidth reduction without abrupt traffic halts.