AI Chip Design Day 7 — From Single MAC to Systolic Arrays

The Scaling Problem

A single MAC unit does one multiply-accumulate per clock cycle. For a 256×256 matrix multiply:

Cycles needed: 256 × 256 × 256 = 16.7 million
At 1 GHz: 16.7 milliseconds

Way too slow for real-time AI. You need parallelism.

The Tiling Approach

Instead of multiplying full 256×256 matrices, tile them into smaller blocks and multiply in parallel:

Original problem: C[256,256] = A[256,256] @ B[256,256] Tiled (16 tiles): C[256,256] = A[256,16] @ B[16,256] + A[256,32] @ B[32,256] + ... └─ 16 small ops ─┘ └─ 16 small ops ─┘ Each tile: C[16,16] = A[16,16] @ B[16,16] Cost: 16×16×16 = 4,096 cycles per tile With 16 tiles in parallel: 4,096 cycles total Speedup: 4,096× (instead of 16.7M)

The Systolic Array Architecture

Systolic = "systole" (heartbeat). Data pulses through the array like blood through a heart.

2×2 Systolic Array (Simple)

A inputs → A inputs → ↓ ↓ MAC[0,0] → MAC[0,1] ↓ ↓ MAC[1,0] → MAC[1,1] ↓ ↓ C outputs C outputs Data flow on each cycle: - New A values enter from left - B values flow downward - Partial sums propagate right - One new multiply per MAC per cycle!

Real TPU-Style: 8×8 Systolic Array

Google TPU v2 uses a 256×256 systolic array. Each MAC:

Multiplies: A[i,k] × B[k,j]
Accumulates: C[i,j] += result
Passes A rightward, B downward
All happen simultaneously in different rows/cols

Why Systolic Works

✅ No global communication: Each MAC only talks to neighbors
✅ High throughput: All MACs active every cycle
✅ Regular pattern: Easy to synthesize, tile, scale
✅ Data reuse: A and B values propagate through all MACs (reduces memory bandwidth)

The Key Insight: Data Reuse

For C[i,j] = Σ(A[i,k] × B[k,j]) where k=1..256:

In a 256×256 systolic array: - Each A[i,k] value is used by 256 MACs (row i) - Each B[k,j] value is used by 256 MACs (column j) - One A/B value loaded from memory = 256 multiplies!
Memory bandwidth = problem solved. Traditional CPU/GPU: - Load A[i,k], do 1 multiply, move on - Systolic: Load A[i,k], use it 256 times

Implementation in Verilog

// Simple 2×2 systolic element
module systolic_pe (
  input clk,
  input [7:0] a_in, b_in,
  output reg [7:0] a_out, b_out,
  output reg [15:0] c_out,
  input [15:0] c_partial_in
);
  reg [15:0] accumulator;

  always @(posedge clk) begin
    // Multiply-accumulate
    accumulator <= (a_in * b_in) + c_partial_in;

    // Pass values to neighbors
    a_out <= a_in;      // Forward A rightward
    b_out <= b_in;      // Forward B downward
    c_out <= accumulator; // Output C result
  end
endmodule

// 2×2 Array
wire [7:0] a[0:1], b[0:1];
wire [15:0] c[0:1][0:1];

systolic_pe pe00 (clk, a[0], b[0], a_out[0], b_out[0], c[0][0], 0);
systolic_pe pe01 (clk, a_out[0], b[1], a_out[1], b_out[1], c[0][1], 0);
systolic_pe pe10 (clk, a[1], b_out[0], a_out[1], b_out[2], c[1][0], 0);
systolic_pe pe11 (clk, a_out[1], b_out[1], a_out[2], b_out[2], c[1][1], 0);

Real-World Scaling

Google TPU v4:

256×256 systolic array
65,536 MAC units
One 256×256 matrix multiply per ~256 cycles
430 TFLOPS theoretical peak (mostly achieved in practice)

Limitations

❌ All MACs must stay busy (stalls if data unavailable)
❌ Can't do variable-length operations efficiently
❌ Optimized for square matrices (256×256, not 256×1000)

Day 8: Deep dive into a 4×4 systolic array—detailed cycle-by-cycle execution.