HomeDay 7 (Enhanced)

Systolic Arrays Deep Dive

How to scale from one MAC unit to 256×256 arrays. The heartbeat of Google TPU and Apple Neural Engine. Complete with math, diagrams, and working hardware code.

The Problem: Why Single MACs Aren't Enough

Imagine a CPU with one multiply-accumulate (MAC) unit computing a matrix multiplication. Let's calculate the cost:

Matrix SizeTotal FLOPsCycles @ 1 GHzActual TimeFeasible?
16×164,0964,0964.1 μs✅ Yes
64×64262,144262 K262 μs✅ Okay
256×25616.7 million16.7 M16.7 ms❌ Too slow
1024×10242.1 billion2.1 B2.1 seconds❌ Unusable

The insight: Neural networks do massive matrix multiplies constantly (every layer!). A single MAC is fundamentally too slow. You need thousands of MACs working in parallel—but wiring them efficiently is the challenge.

The Naive Approach: Why It Fails

First instinct: "Let's add 65,536 MACs all connected to shared memory!"

NAIVE DESIGN: ┌─────────────────────────────┐ │ Shared Memory (4GB) │ └──────────┬──────────────────┘ │ BUS (needs 65K connections!) ┌──────────────┼──────────────────┐ │ │ │ ... [MAC] [MAC] [MAC] [MAC] ... [MAC] │ │ │ └──────────────┼──────────────────┘ PROBLEMS: 1. Wiring nightmare: 65K × 32-bit connections = 2M wires 2. Bus is the bottleneck: Can't feed all MACs simultaneously 3. Power disaster: Every wire consumes power 4. Latency: Signal propagates slowly across entire chip 5. Routing: Becomes impossible at scale (physical design fails)

This is why: Most chips tried this approach and hit a wall around 1000-2000 MACs. The wiring becomes infeasible.

The Breakthrough: Systolic Arrays

Systolic: from Greek "systole" (heartbeat). Instead of random connectivity, data flows in synchronized waves through a regular grid. Like how blood pulses through arteries in rhythm.

SYSTOLIC DESIGN: Input A → ┌─────────────────────┐ │ MAC MAC MAC MAC │ → Output B │ ↓ ↓ ↓ ↓ │ Input B → │ MAC MAC MAC MAC │ │ ↓ ↓ ↓ ↓ │ │ MAC MAC MAC MAC │ │ ↓ ↓ ↓ ↓ │ │ MAC MAC MAC MAC │ └─────────────────────┘ ↓ Output C BENEFITS: ✓ Wiring: Only nearest-neighbor connections (O(N) instead of O(N²)) ✓ Regular pattern: Easy to tile and scale ✓ Data reuse: Each value used 256 times before exiting ✓ Power: Predictable, local connections ✓ Latency: Constant (1-2 gate delays between MACs)

How Systolic Works: The Data Flow

Principle: Stream Data Through, Not Broadcast

Key idea: Instead of loading data once and keeping it stationary, flow data continuously through the array like a pipeline.

For matrix multiply C = A @ B: - Row i of A enters from the left and flows RIGHT - Column j of B enters from the top and flows DOWN - Each MAC[i,j] receives A[i,k] and B[k,j] - Computes partial sum: C[i,j] += A[i,k] × B[k,j] - Passes results to neighbors for next iteration

2×2 Example: Computing C[0,0]

Let's multiply two 2×2 matrices step by step:

Input AInput BOutput C (desired)
0.20.81.52.11.641.92
0.50.10.73.20.821.17

Cycle-by-cycle execution in 2×2 systolic array:

CycleData InMAC[0,0] OpMAC[0,1] OpMAC[1,0] OpMAC[1,1] OpStatus
0A[0,0]=0.2, B[0,0]=1.50 + 0.2×1.5 = 0.3Loading
1A[0,1]=0.8, B[0,1]=2.1, A[1,0]=0.5, B[1,0]=0.70.3 + 0.8×0.7 = 0.860 + 0.8×2.1 = 1.680 + 0.5×1.5 = 0.75Computing
2A[1,1]=0.1, B[1,1]=3.20.86 (C[0,0]!)1.68 + 0.2×2.1 = 2.100.75 + 0.5×0.7 = 1.100 + 0.1×2.1 = 0.21Computing
3Drain2.10 (C[0,1])1.10 + 0.1×0.7 = 1.170.21 + 0.1×3.2 = 0.53Draining
4Done1.17 (C[1,0])0.53 (C[1,1])Complete

Notice: We computed a 2×2 multiply in 4 cycles (load + compute + drain). A sequential CPU would need 8 multiplies = 8 cycles. Systolic got lucky here, but for 256×256, the advantage is massive.

Scaling to 256×256: The Math

DimensionSequential CyclesSystolic CyclesSpeedup
2×2851.6×
4×464115.8×
16×164,0964787×
64×64262,1441911,371×
256×25616.7M76721,700×

The pattern: Systolic time grows as O(3N), sequential grows as O(N³). At scale, systolic is exponentially faster!

The Secret: Data Reuse (Why This Matters)

Here's the real magic of systolic arrays: memory bandwidth reduction through data reuse.

Standard CPU approach: ``` for i=0 to 255: for j=0 to 255: C[i,j] = 0 for k=0 to 255: C[i,j] += A[i,k] * B[k,j] ← Load A[i,k] EVERY ITERATION ``` A[0,0] is loaded 256 times (for j=0..255). Wasteful! Systolic approach: A[0,0] is loaded ONCE from memory, then flows through row 0. All 256 MACs in row 0 use it WITHOUT reloading. Result: • CPU memory bandwidth: 256² × 2 = 131K loads • Systolic memory bandwidth: 256² × 2 / 256 = 512 loads • Reduction: 256× For a 256×256 multiply needing 2 TB/s bandwidth: • CPU needs: 2 TB/s × 256 = 512 TB/s (!!) • Systolic needs: 2 TB/s × 1 = 2 TB/s ✓ This is why TPU roofline model says 430 TFLOPS is achievable with 2 TB/s.

256×256 Systolic Array Architecture

Google TPU v4 systolic core: A[0] ──→ ┌──────────────────────────────────────┐ A[1] ──→ │ [MAC] [MAC] ... [MAC] [MAC] ← B_out │ ... │ ↓ ↓ ↓ ↓ │ A[255]─→ │ [MAC] [MAC] ... [MAC] [MAC] │ │ ↓ ↓ ↓ ↓ │ │ ... ... │ │ ↓ ↓ ↓ ↓ │ │ [MAC] [MAC] ... [MAC] [MAC] │ └──────────────────────────────────────┘ ↓ ↓ ↓ ↓ C_out[0] C_out[255] Key specs: - 256×256 grid = 65,536 MACs total - Each MAC: 8-bit input, 32-bit accumulator - Clock: 2 GHz - Peak: 65K × 2 × 2 = 260 TFLOPS (naive) - Actual: 430 TFLOPS (with INT8 + better utilization) Latency to first result: ~512 cycles (load A/B + pipeline) Throughput: One 256×256 multiply every ~768 cycles

Hardware: SystemVerilog Implementation

// Systolic Processing Element module systolic_pe #(parameter W=8, ACC=32) ( input clk, rst, input signed [W-1:0] a_in, // From left (row broadcast) input signed [W-1:0] b_in, // From top (column broadcast) output reg signed [W-1:0] a_out, // To right output reg signed [W-1:0] b_out, // To bottom input signed [ACC-1:0] c_partial_in, // From left (accumulator) output reg signed [ACC-1:0] c_out // To bottom ); wire signed [2*W-1:0] mult_result; assign mult_result = a_in * b_in; // 8×8 → 16 bits always @(posedge clk) begin if (rst) begin a_out <= 0; b_out <= 0; c_out <= 0; end else begin // Pipeline stage: Forward inputs a_out <= a_in; // Shift A rightward (row) b_out <= b_in; // Shift B downward (column) // Multiply-accumulate c_out <= c_partial_in + mult_result; // C += A×B end end endmodule // 256×256 Systolic Array instantiation genvar i, j; generate for (i = 0; i < 256; i = i + 1) begin : rows for (j = 0; j < 256; j = j + 1) begin : cols systolic_pe #(.W(8), .ACC(32)) pe ( .clk(clk), .rst(rst), .a_in(a_route[i][j]), .b_in(b_route[i][j]), .a_out(a_route[i][j+1]), // Shift right .b_out(b_route[i+1][j]), // Shift down .c_partial_in(c_route[i][j]), .c_out(c_route[i][j+1]) ); end end endgenerate

Why GPUs Lost to Systolic (For This Task)

NVIDIA H100 Tensor Core

Strategy: General-purpose with cache hierarchy

  • Memory: 80GB HBM @ 3 TB/s (1.5× faster than TPU)
  • Peak: 1.45 PFLOPS (3× more than TPU)
  • BUT: Complex caching, branch prediction, control logic
  • Actual matrix multiply: ~70% peak (1 PFLOPS)
  • Power: 700W

Google TPU v4 Systolic

Strategy: Specialized for matrix ops only

  • Memory: 8GB HBM @ 2 TB/s
  • Peak: 430 TFLOPS (smaller headline number)
  • BUT: Optimized dataflow, no caching overhead
  • Actual matrix multiply: ~97% peak (430 TFLOPS!)
  • Power: 200W

Winner: For pure matrix math, TPU beats H100 on efficiency (TFLOPS/Watt). But H100 wins on flexibility (runs any code, any workload).

Limitations of Systolic (Why Not Use It Everywhere?)

1. Requires Regular Data Access Patterns: Can't handle sparse matrices efficiently (zeros still occupy wires).

2. All MACs Must Stay Busy: If you have a 100×100 matrix and 256×256 array, 156 rows are wasted. Stalls = idle hardware.

3. Non-Matrix Ops Are Slow: Activation functions (ReLU, sigmoid, softmax) run on separate control logic, not systolic.

4. Fixed Array Size: A 256×256 systolic optimizes for 256×256 matrices, not 512×512 or 128×128 (though tiling helps).

Takeaway: Why Systolic Rules AI

  • 256× less memory bandwidth via data reuse
  • 97% hardware utilization (vs 20% on CPUs for this task)
  • 21,700× speedup over sequential for 256×256
  • 5× better FLOPS/Watt than GPU
  • O(N) wiring (scales to thousands of MACs)

Next (Day 8): Detailed cycle-by-cycle execution of a real 4×4 systolic array, with timing diagrams and SystemVerilog testbenches.