The Scaling Problem
A single MAC unit does one multiply-accumulate per clock cycle. For a 256×256 matrix multiply:
- Cycles needed: 256 × 256 × 256 = 16.7 million
- At 1 GHz: 16.7 milliseconds
Way too slow for real-time AI. You need parallelism.
The Tiling Approach
Instead of multiplying full 256×256 matrices, tile them into smaller blocks and multiply in parallel:
Original problem: C[256,256] = A[256,256] @ B[256,256]
Tiled (16 tiles):
C[256,256] = A[256,16] @ B[16,256] + A[256,32] @ B[32,256] + ...
└─ 16 small ops ─┘ └─ 16 small ops ─┘
Each tile: C[16,16] = A[16,16] @ B[16,16]
Cost: 16×16×16 = 4,096 cycles per tile
With 16 tiles in parallel: 4,096 cycles total
Speedup: 4,096× (instead of 16.7M)
The Systolic Array Architecture
Systolic = "systole" (heartbeat). Data pulses through the array like blood through a heart.
2×2 Systolic Array (Simple)
A inputs → A inputs →
↓ ↓
MAC[0,0] → MAC[0,1]
↓ ↓
MAC[1,0] → MAC[1,1]
↓ ↓
C outputs C outputs
Data flow on each cycle:
- New A values enter from left
- B values flow downward
- Partial sums propagate right
- One new multiply per MAC per cycle!
Real TPU-Style: 8×8 Systolic Array
Google TPU v2 uses a 256×256 systolic array. Each MAC:
- Multiplies: A[i,k] × B[k,j]
- Accumulates: C[i,j] += result
- Passes A rightward, B downward
- All happen simultaneously in different rows/cols
Why Systolic Works
- ✅ No global communication: Each MAC only talks to neighbors
- ✅ High throughput: All MACs active every cycle
- ✅ Regular pattern: Easy to synthesize, tile, scale
- ✅ Data reuse: A and B values propagate through all MACs (reduces memory bandwidth)
The Key Insight: Data Reuse
For C[i,j] = Σ(A[i,k] × B[k,j]) where k=1..256:
In a 256×256 systolic array:
- Each A[i,k] value is used by 256 MACs (row i)
- Each B[k,j] value is used by 256 MACs (column j)
- One A/B value loaded from memory = 256 multiplies!
Memory bandwidth = problem solved. Traditional CPU/GPU: - Load A[i,k], do 1 multiply, move on - Systolic: Load A[i,k], use it 256 times
Memory bandwidth = problem solved. Traditional CPU/GPU: - Load A[i,k], do 1 multiply, move on - Systolic: Load A[i,k], use it 256 times
Implementation in Verilog
// Simple 2×2 systolic element
module systolic_pe (
input clk,
input [7:0] a_in, b_in,
output reg [7:0] a_out, b_out,
output reg [15:0] c_out,
input [15:0] c_partial_in
);
reg [15:0] accumulator;
always @(posedge clk) begin
// Multiply-accumulate
accumulator <= (a_in * b_in) + c_partial_in;
// Pass values to neighbors
a_out <= a_in; // Forward A rightward
b_out <= b_in; // Forward B downward
c_out <= accumulator; // Output C result
end
endmodule
// 2×2 Array
wire [7:0] a[0:1], b[0:1];
wire [15:0] c[0:1][0:1];
systolic_pe pe00 (clk, a[0], b[0], a_out[0], b_out[0], c[0][0], 0);
systolic_pe pe01 (clk, a_out[0], b[1], a_out[1], b_out[1], c[0][1], 0);
systolic_pe pe10 (clk, a[1], b_out[0], a_out[1], b_out[2], c[1][0], 0);
systolic_pe pe11 (clk, a_out[1], b_out[1], a_out[2], b_out[2], c[1][1], 0);Real-World Scaling
Google TPU v4:
- 256×256 systolic array
- 65,536 MAC units
- One 256×256 matrix multiply per ~256 cycles
- 430 TFLOPS theoretical peak (mostly achieved in practice)
Limitations
- ❌ All MACs must stay busy (stalls if data unavailable)
- ❌ Can't do variable-length operations efficiently
- ❌ Optimized for square matrices (256×256, not 256×1000)
Day 8: Deep dive into a 4×4 systolic array—detailed cycle-by-cycle execution.