How does weight-stationary dataflow work?

In weight-stationary dataflow, each PE loads one weight value and keeps it fixed throughout the computation. Activation values stream horizontally through the array, and partial sums accumulate vertically. This eliminates repeated weight fetches — the weight is read from memory once and reused for every input in the batch.

How is the Google TPU different from a regular GPU?

The Google TPU uses a 128×128 systolic array specifically designed for matrix multiplication in INT8/BF16, with 65,536 MACs firing every cycle. A GPU has thousands of smaller CUDA cores but with complex scheduling overhead. The TPU achieves much higher efficiency for inference/training workloads with predictable dataflow and no branch prediction hardware.

Why is data skewing needed in a systolic array?

In a systolic array, different PEs receive different rows/columns of the input matrices. Without skewing, row 0 data would arrive at PE[0][0] one cycle before PE[1][0] sees its matching data — but PE[1][0] won't see row 1's data until it propagates from the edge. Skewing delays row i by i cycles at the input, so all PEs see correctly aligned data when it arrives.

Systolic Array Architecture on FPGA — TPU-Style Accelerator

Q: What is a systolic array?

A systolic array is a grid of Processing Elements (PEs) where data flows rhythmically from one PE to the next each clock cycle — like blood pulsing through a heart (hence 'systolic'). Each PE performs a multiply-accumulate and passes data to its neighbor. The result is extremely high compute density with very low memory bandwidth requirements.

1. What is a Systolic Array?

A systolic array is a grid of simple Processing Elements (PEs) that pass data to their neighbors every clock cycle — like blood pumped rhythmically through a heart (hence "systolic"). The data flows through the array in a wave, and each PE does one MAC operation as the data passes.

The key difference from the Day 3 GEMM engine: in a systolic array, there is no shared memory bus. Each PE only talks to its immediate neighbors. This eliminates the memory bandwidth bottleneck entirely and allows the array to scale to thousands of PEs.

Google TPU — The Proof

Google's TPU v1 (2016) used a 256×256 systolic array = 65,536 MACs firing every cycle at 700 MHz = 92 TOPS. It ran at 10–30× better performance-per-watt than contemporary GPUs for inference. Today's TPU v4 uses 128×128 per core, with multiple cores per chip.

2. Systolic Array vs Parallel MAC Array

Day 3 Parallel MAC Array

📦 All PEs share BRAM for input
📦 Central bus: memory bandwidth limited
📦 Hard to scale beyond 64×64
✅ Simple control logic
✅ Easy to understand

Systolic Array (This Day)

✅ No shared bus — PE-to-PE only
✅ Scales to 256×256+ easily
✅ Near-100% DSP utilization
✅ Weights loaded once, reused fully
⚠️ Data skewing adds latency complexity

3. Processing Element (PE) Design

The PE is the atomic unit of a systolic array. Each PE receives data from the left (activation) and from above (partial sum), multiplies by its stored weight, and passes results right and down.

Processing Element (PE) — Weight-Stationary

4. Weight-Stationary Dataflow — How It Works

The key insight of weight-stationary: each PE stores one weight permanently. Activations stream through horizontally. Partial sums accumulate vertically. After K cycles, the bottom row holds the complete dot products.

4×4 Systolic Array — Weight-Stationary Data Flow

5. Data Skewing — Why It's Needed

Without skewing, data alignment fails. Row 1 of the activation matrix needs to meet row 1 of the weights (at PE[1][·]), but without a delay, row 1 data arrives at PE[1][0] at the same time as row 0 data — which is wrong.

Skewed Input Feeding — Timing Diagram (4-row array, K=4)

Input	Cycle 0	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5
Row 0 (no delay)	a[0][0]	a[0][1]	a[0][2]	a[0][3]	0	0
Row 1 (+1 cycle)	0	a[1][0]	a[1][1]	a[1][2]	a[1][3]	0
Row 2 (+2 cycles)	0	0	a[2][0]	a[2][1]	a[2][2]	a[2][3]
Row 3 (+3 cycles)	0	0	0	a[3][0]	a[3][1]	a[3][2]

Total computation cycles = K + N - 1 = 4 + 4 - 1 = 7 cycles for a 4×4 array with K=4. Compare to 16 cycles for the parallel array — systolic has lower latency for large K but same throughput.

6. Processing Element Verilog

// pe.v — Systolic Array Processing Element (Weight-Stationary)
module pe #(parameter DW=8, AW=32)(
  input  wire              clk, rst_n,
  input  wire              load_weight,    // load weight this cycle
  input  wire signed [DW-1:0] weight_in,  // weight value
  input  wire signed [DW-1:0] act_in,     // activation from left
  input  wire signed [AW-1:0] psum_in,    // partial sum from above
  output reg  signed [DW-1:0] act_out,    // activation to right (registered)
  output reg  signed [AW-1:0] psum_out    // partial sum to below
);
  reg signed [DW-1:0] weight;

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      weight   <= 0;
      act_out  <= 0;
      psum_out <= 0;
    end else begin
      if (load_weight) weight <= weight_in;  // one-time weight setup
      act_out  <= act_in;                    // pass activation right
      psum_out <= psum_in + ({{(AW-DW){act_in[DW-1]}}, act_in}
                           * {{(AW-DW){weight[DW-1]}}, weight});
    end
  end
endmodule

7. 4×4 Systolic Array Top Level

// systolic_4x4.v — 4×4 Weight-Stationary Systolic Array
module systolic_4x4 #(parameter DW=8, AW=32, N=4)(
  input  wire              clk, rst_n,
  input  wire              load_weight,
  input  wire signed [DW-1:0] weights [0:N-1][0:N-1], // weight matrix
  input  wire signed [DW-1:0] act_in  [0:N-1],        // skewed activations
  output wire signed [AW-1:0] psum_out[0:N-1]         // output columns
);

  // Internal wires: act[row][col], psum[row+1][col]
  wire signed [DW-1:0] act  [0:N][0:N-1];
  wire signed [AW-1:0] psum [0:N][0:N-1];

  genvar i, j;
  generate
    for (i = 0; i < N; i++) begin : row
      assign act[i][0] = act_in[i];      // connect inputs to left edge
      assign psum[0][i] = 0;             // top row gets zero psum

      for (j = 0; j < N; j++) begin : col
        pe #(.DW(DW),.AW(AW)) u_pe (
          .clk        (clk),
          .rst_n      (rst_n),
          .load_weight(load_weight),
          .weight_in  (weights[i][j]),
          .act_in     (act[i][j]),
          .psum_in    (psum[i][j]),
          .act_out    (act[i][j+1]),     // pass right
          .psum_out   (psum[i+1][j])     // pass down
        );
      end
    end
  endgenerate

  // Bottom row outputs = complete dot products
  genvar k;
  generate
    for (k = 0; k < N; k++) assign psum_out[k] = psum[N][k];
  endgenerate

endmodule

8. Systolic vs GEMM Engine Comparison

Property	Day 3 Parallel GEMM	Day 4 Systolic Array
Memory access	Shared BRAM per cycle	PE-to-PE only (no shared bus)
Weight storage	BRAM (reloaded per tile)	Registers inside each PE
Scalability	Limited by BRAM bandwidth	Scales to 256×256+ easily
Latency (K=256)	256 cycles	256 + N - 1 cycles
Throughput	N² MACs/cycle	N² MACs/cycle (same!)
Control complexity	Simple (start/done)	Moderate (skewing logic)
Best for	Small arrays, easy debug	Large arrays, production
Used in	Research prototypes	Google TPU, Groq, Graphcore

Day 4 — Key Takeaways

✅ Systolic arrays eliminate shared bus — PEs only talk to neighbors
✅ Weight-stationary — each PE holds one weight permanently, activations stream through
✅ PE operation: psum_out = psum_in + (act_in × weight), pass act_in right
✅ Data skewing delays row i by i cycles — ensures correct alignment across rows
✅ Latency = K + N - 1 cycles for NxN array with inner dimension K
✅ Google TPU uses 128×128 systolic array = 16,384 MACs per cycle at 700 MHz
✅ Systolic scales better than parallel MAC because bandwidth grows with perimeter not area
✅ generate/genvar in Verilog makes NxN array definition clean and parameterizable

Next — Day 5: Convolution Engine — building a hardware 2D convolution unit with line buffers, sliding window logic, kernel weight storage, and im2col transformation.

← Previous

Day 3: Matrix Multiply (GEMM)

Course Home

All 15 Days →

Systolic ArrayArchitecture on FPGA