HomeDay 27

4×4 Systolic Array RTL

Complete SystemVerilog implementation: PE instantiation, dataflow, memory interfaces. Synthesis-ready code.

Architecture Overview

4×4 Systolic Array: ┌────┬────┬────┬────┐ │ M00│ M01│ M02│ M03│ → output ├────┼────┼────┼────┤ │ M10│ M11│ M12│ M13│ → output ├────┼────┼────┼────┤ │ M20│ M21│ M22│ M23│ → output ├────┼────┼────┼────┤ │ M30│ M31│ M32│ M33│ → output └────┴────┴────┴────┘ Data flow: - A values (rows): Enter from left, shift right - B values (columns): Enter from top, shift down - Accumulate at each PE

Top-Level Module

module systolic_4x4 #( parameter WIDTH_A = 8, parameter WIDTH_B = 8, parameter WIDTH_C = 32 ) ( input clk, reset, input [WIDTH_A-1:0] a_in [0:3], // A inputs (left edge) input [WIDTH_B-1:0] b_in [0:3], // B inputs (top edge) output [WIDTH_C-1:0] c_out [0:3], // C outputs (bottom) input valid_in, output valid_out ); // Array of MACs (16 total) wire [WIDTH_A-1:0] a_routes [0:3][0:4]; // Routing: 0-3 data, 4 is output wire [WIDTH_B-1:0] b_routes [0:4][0:3]; // Routing wire [WIDTH_C-1:0] c_routes [0:3][0:3]; // Input/Output muxing assign a_routes[0][0] = a_in[0]; assign a_routes[1][0] = a_in[1]; assign a_routes[2][0] = a_in[2]; assign a_routes[3][0] = a_in[3]; assign b_routes[0][0] = b_in[0]; assign b_routes[0][1] = b_in[1]; assign b_routes[0][2] = b_in[2]; assign b_routes[0][3] = b_in[3]; // Generate 4×4 grid of MACs genvar i, j; generate for (i = 0; i < 4; i++) begin : row for (j = 0; j < 4; j++) begin : col mac_unit #(.WIDTH_A(WIDTH_A), .WIDTH_B(WIDTH_B), .WIDTH_C(WIDTH_C)) pe ( .clk(clk), .reset(reset), .a_in(a_routes[i][j]), .b_in(b_routes[i][j]), .c_in(c_routes[i][j]), .c_out(c_routes[i][j]) ); // A dataflow (rightward) assign a_routes[i][j+1] = a_routes[i][j]; // B dataflow (downward) assign b_routes[i+1][j] = b_routes[i][j]; end end endgenerate // Output capture assign c_out[0] = c_routes[3][0]; assign c_out[1] = c_routes[3][1]; assign c_out[2] = c_routes[3][2]; assign c_out[3] = c_routes[3][3]; assign valid_out = valid_in; // Passthrough endmodule

Key Design Points

Synthesis Considerations

Timing closure: - Critical path: MAC (multiply + adder) = ~0.7 ns @ 1 GHz - Routing: A/B buses = ~0.2 ns - Total: ~0.9 ns (safe margin to 1 ns clock) Area: - 4×4 = 16 MACs - Area per MAC: 4.4 μm² (from Day 26) - Array area: 16 × 4.4 = 70 μm² - Routing/interconnect: +20% = ~85 μm² Power: - 16 MACs × 1 mW = 16 mW sustained - Leakage: ~2 mW - Total: ~18 mW @ 1 GHz

Day 28: Power optimization: clock gating, voltage scaling, precision reduction.