HomeDay 7

From Single MAC to Systolic Arrays

Scale one multiply-accumulate unit to thousands. The architecture that powers Google TPU and Apple Neural Engine.

The Scaling Problem

A single MAC unit does one multiply-accumulate per clock cycle. For a 256×256 matrix multiply:

Way too slow for real-time AI. You need parallelism.

The Tiling Approach

Instead of multiplying full 256×256 matrices, tile them into smaller blocks and multiply in parallel:

Original problem: C[256,256] = A[256,256] @ B[256,256] Tiled (16 tiles): C[256,256] = A[256,16] @ B[16,256] + A[256,32] @ B[32,256] + ... └─ 16 small ops ─┘ └─ 16 small ops ─┘ Each tile: C[16,16] = A[16,16] @ B[16,16] Cost: 16×16×16 = 4,096 cycles per tile With 16 tiles in parallel: 4,096 cycles total Speedup: 4,096× (instead of 16.7M)

The Systolic Array Architecture

Systolic = "systole" (heartbeat). Data pulses through the array like blood through a heart.

2×2 Systolic Array (Simple)

A inputs → A inputs → ↓ ↓ MAC[0,0] → MAC[0,1] ↓ ↓ MAC[1,0] → MAC[1,1] ↓ ↓ C outputs C outputs Data flow on each cycle: - New A values enter from left - B values flow downward - Partial sums propagate right - One new multiply per MAC per cycle!

Real TPU-Style: 8×8 Systolic Array

Google TPU v2 uses a 256×256 systolic array. Each MAC:

Why Systolic Works

The Key Insight: Data Reuse

For C[i,j] = Σ(A[i,k] × B[k,j]) where k=1..256:

In a 256×256 systolic array: - Each A[i,k] value is used by 256 MACs (row i) - Each B[k,j] value is used by 256 MACs (column j) - One A/B value loaded from memory = 256 multiplies!
Memory bandwidth = problem solved. Traditional CPU/GPU: - Load A[i,k], do 1 multiply, move on - Systolic: Load A[i,k], use it 256 times

Implementation in Verilog

// Simple 2×2 systolic element module systolic_pe ( input clk, input [7:0] a_in, b_in, output reg [7:0] a_out, b_out, output reg [15:0] c_out, input [15:0] c_partial_in ); reg [15:0] accumulator; always @(posedge clk) begin // Multiply-accumulate accumulator <= (a_in * b_in) + c_partial_in; // Pass values to neighbors a_out <= a_in; // Forward A rightward b_out <= b_in; // Forward B downward c_out <= accumulator; // Output C result end endmodule // 2×2 Array wire [7:0] a[0:1], b[0:1]; wire [15:0] c[0:1][0:1]; systolic_pe pe00 (clk, a[0], b[0], a_out[0], b_out[0], c[0][0], 0); systolic_pe pe01 (clk, a_out[0], b[1], a_out[1], b_out[1], c[0][1], 0); systolic_pe pe10 (clk, a[1], b_out[0], a_out[1], b_out[2], c[1][0], 0); systolic_pe pe11 (clk, a_out[1], b_out[1], a_out[2], b_out[2], c[1][1], 0);

Real-World Scaling

Google TPU v4:

  • 256×256 systolic array
  • 65,536 MAC units
  • One 256×256 matrix multiply per ~256 cycles
  • 430 TFLOPS theoretical peak (mostly achieved in practice)

Limitations

  • ❌ All MACs must stay busy (stalls if data unavailable)
  • ❌ Can't do variable-length operations efficiently
  • ❌ Optimized for square matrices (256×256, not 256×1000)

Day 8: Deep dive into a 4×4 systolic array—detailed cycle-by-cycle execution.