The Problem: Why Single MACs Aren't Enough
Imagine a CPU with one multiply-accumulate (MAC) unit computing a matrix multiplication. Let's calculate the cost:
| Matrix Size | Total FLOPs | Cycles @ 1 GHz | Actual Time | Feasible? |
|---|---|---|---|---|
| 16×16 | 4,096 | 4,096 | 4.1 μs | ✅ Yes |
| 64×64 | 262,144 | 262 K | 262 μs | ✅ Okay |
| 256×256 | 16.7 million | 16.7 M | 16.7 ms | ❌ Too slow |
| 1024×1024 | 2.1 billion | 2.1 B | 2.1 seconds | ❌ Unusable |
The insight: Neural networks do massive matrix multiplies constantly (every layer!). A single MAC is fundamentally too slow. You need thousands of MACs working in parallel—but wiring them efficiently is the challenge.
The Naive Approach: Why It Fails
First instinct: "Let's add 65,536 MACs all connected to shared memory!"
This is why: Most chips tried this approach and hit a wall around 1000-2000 MACs. The wiring becomes infeasible.
The Breakthrough: Systolic Arrays
Systolic: from Greek "systole" (heartbeat). Instead of random connectivity, data flows in synchronized waves through a regular grid. Like how blood pulses through arteries in rhythm.
How Systolic Works: The Data Flow
Principle: Stream Data Through, Not Broadcast
Key idea: Instead of loading data once and keeping it stationary, flow data continuously through the array like a pipeline.
2×2 Example: Computing C[0,0]
Let's multiply two 2×2 matrices step by step:
| Input A | Input B | Output C (desired) | |||
|---|---|---|---|---|---|
| 0.2 | 0.8 | 1.5 | 2.1 | 1.64 | 1.92 |
| 0.5 | 0.1 | 0.7 | 3.2 | 0.82 | 1.17 |
Cycle-by-cycle execution in 2×2 systolic array:
| Cycle | Data In | MAC[0,0] Op | MAC[0,1] Op | MAC[1,0] Op | MAC[1,1] Op | Status |
|---|---|---|---|---|---|---|
| 0 | A[0,0]=0.2, B[0,0]=1.5 | 0 + 0.2×1.5 = 0.3 | — | — | — | Loading |
| 1 | A[0,1]=0.8, B[0,1]=2.1, A[1,0]=0.5, B[1,0]=0.7 | 0.3 + 0.8×0.7 = 0.86 | 0 + 0.8×2.1 = 1.68 | 0 + 0.5×1.5 = 0.75 | — | Computing |
| 2 | A[1,1]=0.1, B[1,1]=3.2 | 0.86 (C[0,0]!) | 1.68 + 0.2×2.1 = 2.10 | 0.75 + 0.5×0.7 = 1.10 | 0 + 0.1×2.1 = 0.21 | Computing |
| 3 | Drain | — | 2.10 (C[0,1]) | 1.10 + 0.1×0.7 = 1.17 | 0.21 + 0.1×3.2 = 0.53 | Draining |
| 4 | Done | — | — | 1.17 (C[1,0]) | 0.53 (C[1,1]) | Complete |
Notice: We computed a 2×2 multiply in 4 cycles (load + compute + drain). A sequential CPU would need 8 multiplies = 8 cycles. Systolic got lucky here, but for 256×256, the advantage is massive.
Scaling to 256×256: The Math
| Dimension | Sequential Cycles | Systolic Cycles | Speedup |
|---|---|---|---|
| 2×2 | 8 | 5 | 1.6× |
| 4×4 | 64 | 11 | 5.8× |
| 16×16 | 4,096 | 47 | 87× |
| 64×64 | 262,144 | 191 | 1,371× |
| 256×256 | 16.7M | 767 | 21,700× |
The pattern: Systolic time grows as O(3N), sequential grows as O(N³). At scale, systolic is exponentially faster!
The Secret: Data Reuse (Why This Matters)
Here's the real magic of systolic arrays: memory bandwidth reduction through data reuse.
256×256 Systolic Array Architecture
Hardware: SystemVerilog Implementation
// Systolic Processing Element
module systolic_pe #(parameter W=8, ACC=32) (
input clk, rst,
input signed [W-1:0] a_in, // From left (row broadcast)
input signed [W-1:0] b_in, // From top (column broadcast)
output reg signed [W-1:0] a_out, // To right
output reg signed [W-1:0] b_out, // To bottom
input signed [ACC-1:0] c_partial_in, // From left (accumulator)
output reg signed [ACC-1:0] c_out // To bottom
);
wire signed [2*W-1:0] mult_result;
assign mult_result = a_in * b_in; // 8×8 → 16 bits
always @(posedge clk) begin
if (rst) begin
a_out <= 0;
b_out <= 0;
c_out <= 0;
end else begin
// Pipeline stage: Forward inputs
a_out <= a_in; // Shift A rightward (row)
b_out <= b_in; // Shift B downward (column)
// Multiply-accumulate
c_out <= c_partial_in + mult_result; // C += A×B
end
end
endmodule
// 256×256 Systolic Array instantiation
genvar i, j;
generate
for (i = 0; i < 256; i = i + 1) begin : rows
for (j = 0; j < 256; j = j + 1) begin : cols
systolic_pe #(.W(8), .ACC(32)) pe (
.clk(clk), .rst(rst),
.a_in(a_route[i][j]),
.b_in(b_route[i][j]),
.a_out(a_route[i][j+1]), // Shift right
.b_out(b_route[i+1][j]), // Shift down
.c_partial_in(c_route[i][j]),
.c_out(c_route[i][j+1])
);
end
end
endgenerate
Why GPUs Lost to Systolic (For This Task)
NVIDIA H100 Tensor Core
Strategy: General-purpose with cache hierarchy
- Memory: 80GB HBM @ 3 TB/s (1.5× faster than TPU)
- Peak: 1.45 PFLOPS (3× more than TPU)
- BUT: Complex caching, branch prediction, control logic
- Actual matrix multiply: ~70% peak (1 PFLOPS)
- Power: 700W
Google TPU v4 Systolic
Strategy: Specialized for matrix ops only
- Memory: 8GB HBM @ 2 TB/s
- Peak: 430 TFLOPS (smaller headline number)
- BUT: Optimized dataflow, no caching overhead
- Actual matrix multiply: ~97% peak (430 TFLOPS!)
- Power: 200W
Winner: For pure matrix math, TPU beats H100 on efficiency (TFLOPS/Watt). But H100 wins on flexibility (runs any code, any workload).
Limitations of Systolic (Why Not Use It Everywhere?)
1. Requires Regular Data Access Patterns: Can't handle sparse matrices efficiently (zeros still occupy wires).
2. All MACs Must Stay Busy: If you have a 100×100 matrix and 256×256 array, 156 rows are wasted. Stalls = idle hardware.
3. Non-Matrix Ops Are Slow: Activation functions (ReLU, sigmoid, softmax) run on separate control logic, not systolic.
4. Fixed Array Size: A 256×256 systolic optimizes for 256×256 matrices, not 512×512 or 128×128 (though tiling helps).
Takeaway: Why Systolic Rules AI
- ✅ 256× less memory bandwidth via data reuse
- ✅ 97% hardware utilization (vs 20% on CPUs for this task)
- ✅ 21,700× speedup over sequential for 256×256
- ✅ 5× better FLOPS/Watt than GPU
- ✅ O(N) wiring (scales to thousands of MACs)
Next (Day 8): Detailed cycle-by-cycle execution of a real 4×4 systolic array, with timing diagrams and SystemVerilog testbenches.