Architecture Overview
4×4 Systolic Array:
┌────┬────┬────┬────┐
│ M00│ M01│ M02│ M03│ → output
├────┼────┼────┼────┤
│ M10│ M11│ M12│ M13│ → output
├────┼────┼────┼────┤
│ M20│ M21│ M22│ M23│ → output
├────┼────┼────┼────┤
│ M30│ M31│ M32│ M33│ → output
└────┴────┴────┴────┘
Data flow:
- A values (rows): Enter from left, shift right
- B values (columns): Enter from top, shift down
- Accumulate at each PE
Top-Level Module
module systolic_4x4 #(
parameter WIDTH_A = 8,
parameter WIDTH_B = 8,
parameter WIDTH_C = 32
) (
input clk, reset,
input [WIDTH_A-1:0] a_in [0:3], // A inputs (left edge)
input [WIDTH_B-1:0] b_in [0:3], // B inputs (top edge)
output [WIDTH_C-1:0] c_out [0:3], // C outputs (bottom)
input valid_in,
output valid_out
);
// Array of MACs (16 total)
wire [WIDTH_A-1:0] a_routes [0:3][0:4]; // Routing: 0-3 data, 4 is output
wire [WIDTH_B-1:0] b_routes [0:4][0:3]; // Routing
wire [WIDTH_C-1:0] c_routes [0:3][0:3];
// Input/Output muxing
assign a_routes[0][0] = a_in[0];
assign a_routes[1][0] = a_in[1];
assign a_routes[2][0] = a_in[2];
assign a_routes[3][0] = a_in[3];
assign b_routes[0][0] = b_in[0];
assign b_routes[0][1] = b_in[1];
assign b_routes[0][2] = b_in[2];
assign b_routes[0][3] = b_in[3];
// Generate 4×4 grid of MACs
genvar i, j;
generate
for (i = 0; i < 4; i++) begin : row
for (j = 0; j < 4; j++) begin : col
mac_unit #(.WIDTH_A(WIDTH_A), .WIDTH_B(WIDTH_B), .WIDTH_C(WIDTH_C))
pe (
.clk(clk), .reset(reset),
.a_in(a_routes[i][j]),
.b_in(b_routes[i][j]),
.c_in(c_routes[i][j]),
.c_out(c_routes[i][j])
);
// A dataflow (rightward)
assign a_routes[i][j+1] = a_routes[i][j];
// B dataflow (downward)
assign b_routes[i+1][j] = b_routes[i][j];
end
end
endgenerate
// Output capture
assign c_out[0] = c_routes[3][0];
assign c_out[1] = c_routes[3][1];
assign c_out[2] = c_routes[3][2];
assign c_out[3] = c_routes[3][3];
assign valid_out = valid_in; // Passthrough
endmodule
Key Design Points
- Pipelining: Each PE pipelines MAC (2 cycles), so full result at cycle N+2
- Dataflow: A routes right (shift register), B routes down
- Scalability: Can parameterize to 16×16 by changing loop bounds
- Reset: All accumulators clear on reset
- Valid signal: Propagates through pipeline
Synthesis Considerations
Timing closure:
- Critical path: MAC (multiply + adder) = ~0.7 ns @ 1 GHz
- Routing: A/B buses = ~0.2 ns
- Total: ~0.9 ns (safe margin to 1 ns clock)
Area:
- 4×4 = 16 MACs
- Area per MAC: 4.4 μm² (from Day 26)
- Array area: 16 × 4.4 = 70 μm²
- Routing/interconnect: +20% = ~85 μm²
Power:
- 16 MACs × 1 mW = 16 mW sustained
- Leakage: ~2 mW
- Total: ~18 mW @ 1 GHz
Day 28: Power optimization: clock gating, voltage scaling, precision reduction.