MAC Unit Specification
| Parameter | Value |
|---|---|
| Input precision | INT8 (8-bit signed) |
| Output precision | INT32 (32-bit signed) |
| Pipeline depth | 2 cycles |
| Clock | 1 GHz (1 ns cycle) |
| Area target | < 0.01 mm² (5nm process) |
| Power target | < 1 mW @ 1 GHz |
Design: Pipelined MAC
Cycle 0: Load A, B
├─ Stage 1: Multiply (A × B) → 16-bit result
├─ Stage 2: Add (result + accumulator) → 32-bit
└─ Output C_out
Each cycle:
- Input: A (8-bit), B (8-bit), C_in (32-bit, from previous MAC)
- Output: C_out (32-bit)
- Latency: 2 cycles (input to output)
- Throughput: 1 result per cycle (after pipeline fills)
SystemVerilog Implementation
module mac_unit #(
parameter WIDTH_A = 8,
parameter WIDTH_B = 8,
parameter WIDTH_C = 32
) (
input clk, reset,
input [WIDTH_A-1:0] a_in,
input [WIDTH_B-1:0] b_in,
input [WIDTH_C-1:0] c_in,
output reg [WIDTH_C-1:0] c_out,
output reg valid_out
);
// Pipeline stage 1: Multiply
reg signed [WIDTH_A-1:0] a_s1;
reg signed [WIDTH_B-1:0] b_s1;
// Pipeline stage 2: Accumulate
reg signed [WIDTH_C-1:0] acc_s2;
reg signed [15:0] mult_result; // 8×8 → 16
always @(posedge clk) begin
if (reset) begin
a_s1 <= 0;
b_s1 <= 0;
mult_result <= 0;
acc_s2 <= 0;
c_out <= 0;
valid_out <= 0;
end else begin
// S1: Capture inputs
a_s1 <= $signed(a_in);
b_s1 <= $signed(b_in);
// S1→S2: Compute multiply (combinational)
mult_result <= a_s1 * b_s1; // 8×8 signed → 16-bit
// S2: Accumulate
acc_s2 <= $signed(mult_result) + $signed(c_in);
// Output
c_out <= acc_s2;
valid_out <= 1; // Always valid after first 2 cycles
end
end
endmodule
Area Estimation (5nm)
| Component | Gates (approx) | Area (μm²) |
|---|---|---|
| 8×8 multiplier | 500 | 2 |
| 32-bit adder | 300 | 1.2 |
| Pipeline registers | 100 | 0.4 |
| Logic & routing | 200 | 0.8 |
| Total | 1,100 | 4.4 |
Testbench
module mac_unit_tb;
reg clk, reset;
reg signed [7:0] a_in, b_in;
reg signed [31:0] c_in;
wire signed [31:0] c_out;
mac_unit uut (
.clk(clk), .reset(reset),
.a_in(a_in), .b_in(b_in), .c_in(c_in),
.c_out(c_out)
);
initial begin
clk = 0;
reset = 1; #10 reset = 0;
// Test: 3 × 4 = 12, accumulate
a_in = 3; b_in = 4; c_in = 0; #10; // Load inputs
#10; #10; // Wait for pipeline
$display("Output: %d (expect ~12)", c_out);
// Test: 2 × 5 + 12 = 22
a_in = 2; b_in = 5; c_in = 12; #10;
#10; #10;
$display("Output: %d (expect ~22)", c_out);
end
always #5 clk = ~clk;
endmodule
Day 27: Scaling to 4×4 systolic array: connecting 16 MACs with dataflow.