HomeDay 26

Building a MAC Unit

SystemVerilog design for a multiply-accumulate unit. Pipelining, precision, testbenches. From paper to gates.

MAC Unit Specification

ParameterValue
Input precisionINT8 (8-bit signed)
Output precisionINT32 (32-bit signed)
Pipeline depth2 cycles
Clock1 GHz (1 ns cycle)
Area target< 0.01 mm² (5nm process)
Power target< 1 mW @ 1 GHz

Design: Pipelined MAC

Cycle 0: Load A, B ├─ Stage 1: Multiply (A × B) → 16-bit result ├─ Stage 2: Add (result + accumulator) → 32-bit └─ Output C_out Each cycle: - Input: A (8-bit), B (8-bit), C_in (32-bit, from previous MAC) - Output: C_out (32-bit) - Latency: 2 cycles (input to output) - Throughput: 1 result per cycle (after pipeline fills)

SystemVerilog Implementation

module mac_unit #( parameter WIDTH_A = 8, parameter WIDTH_B = 8, parameter WIDTH_C = 32 ) ( input clk, reset, input [WIDTH_A-1:0] a_in, input [WIDTH_B-1:0] b_in, input [WIDTH_C-1:0] c_in, output reg [WIDTH_C-1:0] c_out, output reg valid_out ); // Pipeline stage 1: Multiply reg signed [WIDTH_A-1:0] a_s1; reg signed [WIDTH_B-1:0] b_s1; // Pipeline stage 2: Accumulate reg signed [WIDTH_C-1:0] acc_s2; reg signed [15:0] mult_result; // 8×8 → 16 always @(posedge clk) begin if (reset) begin a_s1 <= 0; b_s1 <= 0; mult_result <= 0; acc_s2 <= 0; c_out <= 0; valid_out <= 0; end else begin // S1: Capture inputs a_s1 <= $signed(a_in); b_s1 <= $signed(b_in); // S1→S2: Compute multiply (combinational) mult_result <= a_s1 * b_s1; // 8×8 signed → 16-bit // S2: Accumulate acc_s2 <= $signed(mult_result) + $signed(c_in); // Output c_out <= acc_s2; valid_out <= 1; // Always valid after first 2 cycles end end endmodule

Area Estimation (5nm)

ComponentGates (approx)Area (μm²)
8×8 multiplier5002
32-bit adder3001.2
Pipeline registers1000.4
Logic & routing2000.8
Total1,1004.4

Testbench

module mac_unit_tb; reg clk, reset; reg signed [7:0] a_in, b_in; reg signed [31:0] c_in; wire signed [31:0] c_out; mac_unit uut ( .clk(clk), .reset(reset), .a_in(a_in), .b_in(b_in), .c_in(c_in), .c_out(c_out) ); initial begin clk = 0; reset = 1; #10 reset = 0; // Test: 3 × 4 = 12, accumulate a_in = 3; b_in = 4; c_in = 0; #10; // Load inputs #10; #10; // Wait for pipeline $display("Output: %d (expect ~12)", c_out); // Test: 2 × 5 + 12 = 22 a_in = 2; b_in = 5; c_in = 12; #10; #10; #10; $display("Output: %d (expect ~22)", c_out); end always #5 clk = ~clk; endmodule

Day 27: Scaling to 4×4 systolic array: connecting 16 MACs with dataflow.