HomeFPGA Neural NetworkDay 4 — Systolic Array

Systolic Array
Architecture on FPGA

The architecture inside Google's TPU — brought to FPGA. Weight-stationary dataflow, Processing Element design, skewed data feeding, timing analysis, and a complete 4×4 systolic array in Verilog.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,700 words · 15 min read

1. What is a Systolic Array?

A systolic array is a grid of simple Processing Elements (PEs) that pass data to their neighbors every clock cycle — like blood pumped rhythmically through a heart (hence "systolic"). The data flows through the array in a wave, and each PE does one MAC operation as the data passes.

The key difference from the Day 3 GEMM engine: in a systolic array, there is no shared memory bus. Each PE only talks to its immediate neighbors. This eliminates the memory bandwidth bottleneck entirely and allows the array to scale to thousands of PEs.

Google TPU — The Proof

Google's TPU v1 (2016) used a 256×256 systolic array = 65,536 MACs firing every cycle at 700 MHz = 92 TOPS. It ran at 10–30× better performance-per-watt than contemporary GPUs for inference. Today's TPU v4 uses 128×128 per core, with multiple cores per chip.

2. Systolic Array vs Parallel MAC Array

Day 3 Parallel MAC Array
  • 📦 All PEs share BRAM for input
  • 📦 Central bus: memory bandwidth limited
  • 📦 Hard to scale beyond 64×64
  • ✅ Simple control logic
  • ✅ Easy to understand
Systolic Array (This Day)
  • ✅ No shared bus — PE-to-PE only
  • ✅ Scales to 256×256+ easily
  • ✅ Near-100% DSP utilization
  • ✅ Weights loaded once, reused fully
  • ⚠️ Data skewing adds latency complexity

3. Processing Element (PE) Design

The PE is the atomic unit of a systolic array. Each PE receives data from the left (activation) and from above (partial sum), multiplies by its stored weight, and passes results right and down.

Processing Element (PE) — Weight-Stationary
PE[i][j] Weight Register W[i][j] (loaded once, stays fixed) Accumulator psum_out = psum_in + act × W act_in act_out (to next PE) psum_in psum_out (to PE below) weight_load (one-time setup) psum_out = psum_in + (act_in × W)

4. Weight-Stationary Dataflow — How It Works

The key insight of weight-stationary: each PE stores one weight permanently. Activations stream through horizontally. Partial sums accumulate vertically. After K cycles, the bottom row holds the complete dot products.

4×4 Systolic Array — Weight-Stationary Data Flow
W[·][0] W[·][1] W[·][2] W[·][3] PE[0,0] W[0][0] PE[0,1] W[0][1] PE[0,2] W[0][2] PE[0,3] W[0][3] PE[1,0] W[1][0] PE[1,1] W[1][1] PE[1,2] W[1][2] PE[1,3] W[1][3] Row 2: PE[2,0] PE[2,1] PE[2,2] PE[2,3] (W[2][0..3]) Row 3: PE[3,0] PE[3,1] PE[3,2] PE[3,3] (W[3][0..3]) a0[0] a1[0] +1 cycle 0 C[·][0] C[·][1] C[·][2] C[·][3] Output columns — valid after K + N - 1 cycles (K=depth, N=array rows)

5. Data Skewing — Why It's Needed

Without skewing, data alignment fails. Row 1 of the activation matrix needs to meet row 1 of the weights (at PE[1][·]), but without a delay, row 1 data arrives at PE[1][0] at the same time as row 0 data — which is wrong.

Skewed Input Feeding — Timing Diagram (4-row array, K=4)
Input Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Row 0 (no delay) a[0][0] a[0][1] a[0][2] a[0][3] 0 0
Row 1 (+1 cycle) 0 a[1][0] a[1][1] a[1][2] a[1][3] 0
Row 2 (+2 cycles) 0 0 a[2][0] a[2][1] a[2][2] a[2][3]
Row 3 (+3 cycles) 0 0 0 a[3][0] a[3][1] a[3][2]
Total computation cycles = K + N - 1 = 4 + 4 - 1 = 7 cycles for a 4×4 array with K=4. Compare to 16 cycles for the parallel array — systolic has lower latency for large K but same throughput.

6. Processing Element Verilog

// pe.v — Systolic Array Processing Element (Weight-Stationary) module pe #(parameter DW=8, AW=32)( input wire clk, rst_n, input wire load_weight, // load weight this cycle input wire signed [DW-1:0] weight_in, // weight value input wire signed [DW-1:0] act_in, // activation from left input wire signed [AW-1:0] psum_in, // partial sum from above output reg signed [DW-1:0] act_out, // activation to right (registered) output reg signed [AW-1:0] psum_out // partial sum to below ); reg signed [DW-1:0] weight; always @(posedge clk or negedge rst_n) begin if (!rst_n) begin weight <= 0; act_out <= 0; psum_out <= 0; end else begin if (load_weight) weight <= weight_in; // one-time weight setup act_out <= act_in; // pass activation right psum_out <= psum_in + ({{(AW-DW){act_in[DW-1]}}, act_in} * {{(AW-DW){weight[DW-1]}}, weight}); end end endmodule

7. 4×4 Systolic Array Top Level

// systolic_4x4.v — 4×4 Weight-Stationary Systolic Array module systolic_4x4 #(parameter DW=8, AW=32, N=4)( input wire clk, rst_n, input wire load_weight, input wire signed [DW-1:0] weights [0:N-1][0:N-1], // weight matrix input wire signed [DW-1:0] act_in [0:N-1], // skewed activations output wire signed [AW-1:0] psum_out[0:N-1] // output columns ); // Internal wires: act[row][col], psum[row+1][col] wire signed [DW-1:0] act [0:N][0:N-1]; wire signed [AW-1:0] psum [0:N][0:N-1]; genvar i, j; generate for (i = 0; i < N; i++) begin : row assign act[i][0] = act_in[i]; // connect inputs to left edge assign psum[0][i] = 0; // top row gets zero psum for (j = 0; j < N; j++) begin : col pe #(.DW(DW),.AW(AW)) u_pe ( .clk (clk), .rst_n (rst_n), .load_weight(load_weight), .weight_in (weights[i][j]), .act_in (act[i][j]), .psum_in (psum[i][j]), .act_out (act[i][j+1]), // pass right .psum_out (psum[i+1][j]) // pass down ); end end endgenerate // Bottom row outputs = complete dot products genvar k; generate for (k = 0; k < N; k++) assign psum_out[k] = psum[N][k]; endgenerate endmodule

8. Systolic vs GEMM Engine Comparison

PropertyDay 3 Parallel GEMMDay 4 Systolic Array
Memory accessShared BRAM per cyclePE-to-PE only (no shared bus)
Weight storageBRAM (reloaded per tile)Registers inside each PE
ScalabilityLimited by BRAM bandwidthScales to 256×256+ easily
Latency (K=256)256 cycles256 + N - 1 cycles
ThroughputN² MACs/cycleN² MACs/cycle (same!)
Control complexitySimple (start/done)Moderate (skewing logic)
Best forSmall arrays, easy debugLarge arrays, production
Used inResearch prototypesGoogle TPU, Groq, Graphcore

Day 4 — Key Takeaways

Next — Day 5: Convolution Engine — building a hardware 2D convolution unit with line buffers, sliding window logic, kernel weight storage, and im2col transformation.

← Previous
Day 3: Matrix Multiply (GEMM)
Course Home
All 15 Days →