The architecture inside Google's TPU — brought to FPGA. Weight-stationary dataflow, Processing Element design, skewed data feeding, timing analysis, and a complete 4×4 systolic array in Verilog.
A systolic array is a grid of simple Processing Elements (PEs) that pass data to their neighbors every clock cycle — like blood pumped rhythmically through a heart (hence "systolic"). The data flows through the array in a wave, and each PE does one MAC operation as the data passes.
The key difference from the Day 3 GEMM engine: in a systolic array, there is no shared memory bus. Each PE only talks to its immediate neighbors. This eliminates the memory bandwidth bottleneck entirely and allows the array to scale to thousands of PEs.
Google's TPU v1 (2016) used a 256×256 systolic array = 65,536 MACs firing every cycle at 700 MHz = 92 TOPS. It ran at 10–30× better performance-per-watt than contemporary GPUs for inference. Today's TPU v4 uses 128×128 per core, with multiple cores per chip.
The PE is the atomic unit of a systolic array. Each PE receives data from the left (activation) and from above (partial sum), multiplies by its stored weight, and passes results right and down.
The key insight of weight-stationary: each PE stores one weight permanently. Activations stream through horizontally. Partial sums accumulate vertically. After K cycles, the bottom row holds the complete dot products.
Without skewing, data alignment fails. Row 1 of the activation matrix needs to meet row 1 of the weights (at PE[1][·]), but without a delay, row 1 data arrives at PE[1][0] at the same time as row 0 data — which is wrong.
| Input | Cycle 0 | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 |
|---|---|---|---|---|---|---|
| Row 0 (no delay) | a[0][0] | a[0][1] | a[0][2] | a[0][3] | 0 | 0 |
| Row 1 (+1 cycle) | 0 | a[1][0] | a[1][1] | a[1][2] | a[1][3] | 0 |
| Row 2 (+2 cycles) | 0 | 0 | a[2][0] | a[2][1] | a[2][2] | a[2][3] |
| Row 3 (+3 cycles) | 0 | 0 | 0 | a[3][0] | a[3][1] | a[3][2] |
// pe.v — Systolic Array Processing Element (Weight-Stationary)
module pe #(parameter DW=8, AW=32)(
input wire clk, rst_n,
input wire load_weight, // load weight this cycle
input wire signed [DW-1:0] weight_in, // weight value
input wire signed [DW-1:0] act_in, // activation from left
input wire signed [AW-1:0] psum_in, // partial sum from above
output reg signed [DW-1:0] act_out, // activation to right (registered)
output reg signed [AW-1:0] psum_out // partial sum to below
);
reg signed [DW-1:0] weight;
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
weight <= 0;
act_out <= 0;
psum_out <= 0;
end else begin
if (load_weight) weight <= weight_in; // one-time weight setup
act_out <= act_in; // pass activation right
psum_out <= psum_in + ({{(AW-DW){act_in[DW-1]}}, act_in}
* {{(AW-DW){weight[DW-1]}}, weight});
end
end
endmodule// systolic_4x4.v — 4×4 Weight-Stationary Systolic Array
module systolic_4x4 #(parameter DW=8, AW=32, N=4)(
input wire clk, rst_n,
input wire load_weight,
input wire signed [DW-1:0] weights [0:N-1][0:N-1], // weight matrix
input wire signed [DW-1:0] act_in [0:N-1], // skewed activations
output wire signed [AW-1:0] psum_out[0:N-1] // output columns
);
// Internal wires: act[row][col], psum[row+1][col]
wire signed [DW-1:0] act [0:N][0:N-1];
wire signed [AW-1:0] psum [0:N][0:N-1];
genvar i, j;
generate
for (i = 0; i < N; i++) begin : row
assign act[i][0] = act_in[i]; // connect inputs to left edge
assign psum[0][i] = 0; // top row gets zero psum
for (j = 0; j < N; j++) begin : col
pe #(.DW(DW),.AW(AW)) u_pe (
.clk (clk),
.rst_n (rst_n),
.load_weight(load_weight),
.weight_in (weights[i][j]),
.act_in (act[i][j]),
.psum_in (psum[i][j]),
.act_out (act[i][j+1]), // pass right
.psum_out (psum[i+1][j]) // pass down
);
end
end
endgenerate
// Bottom row outputs = complete dot products
genvar k;
generate
for (k = 0; k < N; k++) assign psum_out[k] = psum[N][k];
endgenerate
endmodule| Property | Day 3 Parallel GEMM | Day 4 Systolic Array |
|---|---|---|
| Memory access | Shared BRAM per cycle | PE-to-PE only (no shared bus) |
| Weight storage | BRAM (reloaded per tile) | Registers inside each PE |
| Scalability | Limited by BRAM bandwidth | Scales to 256×256+ easily |
| Latency (K=256) | 256 cycles | 256 + N - 1 cycles |
| Throughput | N² MACs/cycle | N² MACs/cycle (same!) |
| Control complexity | Simple (start/done) | Moderate (skewing logic) |
| Best for | Small arrays, easy debug | Large arrays, production |
| Used in | Research prototypes | Google TPU, Groq, Graphcore |
Next — Day 5: Convolution Engine — building a hardware 2D convolution unit with line buffers, sliding window logic, kernel weight storage, and im2col transformation.