AI Chip Design Day 7

The Problem: Why Single MACs Aren't Enough

Imagine a CPU with one multiply-accumulate (MAC) unit computing a matrix multiplication. Let's calculate the cost:

Matrix Size	Total FLOPs	Cycles @ 1 GHz	Actual Time	Feasible?
16×16	4,096	4,096	4.1 μs	✅ Yes
64×64	262,144	262 K	262 μs	✅ Okay
256×256	16.7 million	16.7 M	16.7 ms	❌ Too slow
1024×1024	2.1 billion	2.1 B	2.1 seconds	❌ Unusable

The insight: Neural networks do massive matrix multiplies constantly (every layer!). A single MAC is fundamentally too slow. You need thousands of MACs working in parallel—but wiring them efficiently is the challenge.

The Naive Approach: Why It Fails

First instinct: "Let's add 65,536 MACs all connected to shared memory!"

NAIVE DESIGN: ┌─────────────────────────────┐ │ Shared Memory (4GB) │ └──────────┬──────────────────┘ │ BUS (needs 65K connections!) ┌──────────────┼──────────────────┐ │ │ │ ... [MAC] [MAC] [MAC] [MAC] ... [MAC] │ │ │ └──────────────┼──────────────────┘ PROBLEMS: 1. Wiring nightmare: 65K × 32-bit connections = 2M wires 2. Bus is the bottleneck: Can't feed all MACs simultaneously 3. Power disaster: Every wire consumes power 4. Latency: Signal propagates slowly across entire chip 5. Routing: Becomes impossible at scale (physical design fails)

This is why: Most chips tried this approach and hit a wall around 1000-2000 MACs. The wiring becomes infeasible.

The Breakthrough: Systolic Arrays

Systolic: from Greek "systole" (heartbeat). Instead of random connectivity, data flows in synchronized waves through a regular grid. Like how blood pulses through arteries in rhythm.

SYSTOLIC DESIGN: Input A → ┌─────────────────────┐ │ MAC MAC MAC MAC │ → Output B │ ↓ ↓ ↓ ↓ │ Input B → │ MAC MAC MAC MAC │ │ ↓ ↓ ↓ ↓ │ │ MAC MAC MAC MAC │ │ ↓ ↓ ↓ ↓ │ │ MAC MAC MAC MAC │ └─────────────────────┘ ↓ Output C BENEFITS: ✓ Wiring: Only nearest-neighbor connections (O(N) instead of O(N²)) ✓ Regular pattern: Easy to tile and scale ✓ Data reuse: Each value used 256 times before exiting ✓ Power: Predictable, local connections ✓ Latency: Constant (1-2 gate delays between MACs)

How Systolic Works: The Data Flow

Principle: Stream Data Through, Not Broadcast

Key idea: Instead of loading data once and keeping it stationary, flow data continuously through the array like a pipeline.

For matrix multiply C = A @ B: - Row i of A enters from the left and flows RIGHT - Column j of B enters from the top and flows DOWN - Each MAC[i,j] receives A[i,k] and B[k,j] - Computes partial sum: C[i,j] += A[i,k] × B[k,j] - Passes results to neighbors for next iteration

2×2 Example: Computing C[0,0]

Let's multiply two 2×2 matrices step by step:

Input A		Input B		Output C (desired)
0.2	0.8	1.5	2.1	1.64	1.92
0.5	0.1	0.7	3.2	0.82	1.17

Cycle-by-cycle execution in 2×2 systolic array:

Cycle	Data In	MAC[0,0] Op	MAC[0,1] Op	MAC[1,0] Op	MAC[1,1] Op	Status
0	A[0,0]=0.2, B[0,0]=1.5	0 + 0.2×1.5 = 0.3	—	—	—	Loading
1	A[0,1]=0.8, B[0,1]=2.1, A[1,0]=0.5, B[1,0]=0.7	0.3 + 0.8×0.7 = 0.86	0 + 0.8×2.1 = 1.68	0 + 0.5×1.5 = 0.75	—	Computing
2	A[1,1]=0.1, B[1,1]=3.2	0.86 (C[0,0]!)	1.68 + 0.2×2.1 = 2.10	0.75 + 0.5×0.7 = 1.10	0 + 0.1×2.1 = 0.21	Computing
3	Drain	—	2.10 (C[0,1])	1.10 + 0.1×0.7 = 1.17	0.21 + 0.1×3.2 = 0.53	Draining
4	Done	—	—	1.17 (C[1,0])	0.53 (C[1,1])	Complete

Notice: We computed a 2×2 multiply in 4 cycles (load + compute + drain). A sequential CPU would need 8 multiplies = 8 cycles. Systolic got lucky here, but for 256×256, the advantage is massive.

Scaling to 256×256: The Math

Dimension	Sequential Cycles	Systolic Cycles	Speedup
2×2	8	5	1.6×
4×4	64	11	5.8×
16×16	4,096	47	87×
64×64	262,144	191	1,371×
256×256	16.7M	767	21,700×

The pattern: Systolic time grows as O(3N), sequential grows as O(N³). At scale, systolic is exponentially faster!

The Secret: Data Reuse (Why This Matters)

Here's the real magic of systolic arrays: memory bandwidth reduction through data reuse.

Standard CPU approach: ``` for i=0 to 255: for j=0 to 255: C[i,j] = 0 for k=0 to 255: C[i,j] += A[i,k] * B[k,j] ← Load A[i,k] EVERY ITERATION ``` A[0,0] is loaded 256 times (for j=0..255). Wasteful! Systolic approach: A[0,0] is loaded ONCE from memory, then flows through row 0. All 256 MACs in row 0 use it WITHOUT reloading. Result: • CPU memory bandwidth: 256² × 2 = 131K loads • Systolic memory bandwidth: 256² × 2 / 256 = 512 loads • Reduction: 256× For a 256×256 multiply needing 2 TB/s bandwidth: • CPU needs: 2 TB/s × 256 = 512 TB/s (!!) • Systolic needs: 2 TB/s × 1 = 2 TB/s ✓ This is why TPU roofline model says 430 TFLOPS is achievable with 2 TB/s.

256×256 Systolic Array Architecture

Google TPU v4 systolic core: A[0] ──→ ┌──────────────────────────────────────┐ A[1] ──→ │ [MAC] [MAC] ... [MAC] [MAC] ← B_out │ ... │ ↓ ↓ ↓ ↓ │ A[255]─→ │ [MAC] [MAC] ... [MAC] [MAC] │ │ ↓ ↓ ↓ ↓ │ │ ... ... │ │ ↓ ↓ ↓ ↓ │ │ [MAC] [MAC] ... [MAC] [MAC] │ └──────────────────────────────────────┘ ↓ ↓ ↓ ↓ C_out[0] C_out[255] Key specs: - 256×256 grid = 65,536 MACs total - Each MAC: 8-bit input, 32-bit accumulator - Clock: 2 GHz - Peak: 65K × 2 × 2 = 260 TFLOPS (naive) - Actual: 430 TFLOPS (with INT8 + better utilization) Latency to first result: ~512 cycles (load A/B + pipeline) Throughput: One 256×256 multiply every ~768 cycles

Hardware: SystemVerilog Implementation

// Systolic Processing Element
module systolic_pe #(parameter W=8, ACC=32) (
  input clk, rst,
  input signed [W-1:0] a_in,           // From left (row broadcast)
  input signed [W-1:0] b_in,           // From top (column broadcast)

  output reg signed [W-1:0] a_out,     // To right
  output reg signed [W-1:0] b_out,     // To bottom

  input signed [ACC-1:0] c_partial_in, // From left (accumulator)
  output reg signed [ACC-1:0] c_out    // To bottom
);

  wire signed [2*W-1:0] mult_result;
  assign mult_result = a_in * b_in;    // 8×8 → 16 bits

  always @(posedge clk) begin
    if (rst) begin
      a_out <= 0;
      b_out <= 0;
      c_out <= 0;
    end else begin
      // Pipeline stage: Forward inputs
      a_out <= a_in;           // Shift A rightward (row)
      b_out <= b_in;           // Shift B downward (column)

      // Multiply-accumulate
      c_out <= c_partial_in + mult_result;  // C += A×B
    end
  end
endmodule

// 256×256 Systolic Array instantiation
genvar i, j;
generate
  for (i = 0; i < 256; i = i + 1) begin : rows
    for (j = 0; j < 256; j = j + 1) begin : cols
      systolic_pe #(.W(8), .ACC(32)) pe (
        .clk(clk), .rst(rst),
        .a_in(a_route[i][j]),
        .b_in(b_route[i][j]),
        .a_out(a_route[i][j+1]),   // Shift right
        .b_out(b_route[i+1][j]),   // Shift down
        .c_partial_in(c_route[i][j]),
        .c_out(c_route[i][j+1])
      );
    end
  end
endgenerate

Why GPUs Lost to Systolic (For This Task)

NVIDIA H100 Tensor Core

Strategy: General-purpose with cache hierarchy

Memory: 80GB HBM @ 3 TB/s (1.5× faster than TPU)
Peak: 1.45 PFLOPS (3× more than TPU)
BUT: Complex caching, branch prediction, control logic
Actual matrix multiply: ~70% peak (1 PFLOPS)
Power: 700W

Google TPU v4 Systolic

Strategy: Specialized for matrix ops only

Memory: 8GB HBM @ 2 TB/s
Peak: 430 TFLOPS (smaller headline number)
BUT: Optimized dataflow, no caching overhead
Actual matrix multiply: ~97% peak (430 TFLOPS!)
Power: 200W

Winner: For pure matrix math, TPU beats H100 on efficiency (TFLOPS/Watt). But H100 wins on flexibility (runs any code, any workload).

Limitations of Systolic (Why Not Use It Everywhere?)

1. Requires Regular Data Access Patterns: Can't handle sparse matrices efficiently (zeros still occupy wires).

2. All MACs Must Stay Busy: If you have a 100×100 matrix and 256×256 array, 156 rows are wasted. Stalls = idle hardware.

3. Non-Matrix Ops Are Slow: Activation functions (ReLU, sigmoid, softmax) run on separate control logic, not systolic.

4. Fixed Array Size: A 256×256 systolic optimizes for 256×256 matrices, not 512×512 or 128×128 (though tiling helps).

Takeaway: Why Systolic Rules AI

✅ 256× less memory bandwidth via data reuse
✅ 97% hardware utilization (vs 20% on CPUs for this task)
✅ 21,700× speedup over sequential for 256×256
✅ 5× better FLOPS/Watt than GPU
✅ O(N) wiring (scales to thousands of MACs)

Next (Day 8): Detailed cycle-by-cycle execution of a real 4×4 systolic array, with timing diagrams and SystemVerilog testbenches.

Systolic Arrays Deep Dive