AI Chip Design Day 9

Three Core Patterns

1. Systolic Pattern (Data Streams)

Used by: Google TPU, Apple Neural Engine

A and B flow through array, partial sums accumulate and exit.

Input streams: A: [a1, a2, a3, ...] → B: [b1, b2, b3, ...] ↓ Output: C values flow out sequentially Latency: O(N) cycles for NxN matrix Throughput: One NxN multiply per N cycles (ideal pipelining)

Pros: High data reuse, low memory bandwidth, simple control
Cons: Requires streaming data; poor for irregular patterns

2. Reduction Pattern (Tree Aggregation)

Used by: Systolic arrays with reduction trees, some research designs

Partial results combine hierarchically (like MapReduce).

Layer 0: [R0] [R1] [R2] [R3] (4 reductions in parallel) \ / \ / Layer 1: [R] [R] (2 reductions) \ / Layer 2: [R] (1 final result)

Pros: Good for tree-like computations (conv nets), parallelism
Cons: More complex routing, higher latency for single result

3. Stationary Pattern (MACs Stay Put)

Used by: Some FPGA designs, mobile NPUs with limited memory

A and B are pre-loaded into registers around each MAC. Data stays local.

MAC[0][0]: A[0][0], A[0][1], ... pre-stored locally B[0][0], B[1][0], ... pre-stored locally Compute C[0][0] using local data No streaming - all data in registers from start

Pros: No data movement, maximum locality
Cons: Limited by register file size, requires offline scheduling

Comparison Table

Pattern	Data Movement	Throughput	Latency	Complexity	Best For
Systolic	Streams	N ops/cycle	O(N)	Simple	Matrix ops
Reduction	Tree	log(N) ops	O(log N)	Medium	Aggregations
Stationary	None	M² ops/cycle	O(1)	Complex	Pre-scheduled

Choosing a Pattern

Streaming data? → Systolic (TPU choice)
Variable-size problems? → Reduction (tree parallelism)
Fixed known kernels? → Stationary (FPGA choice)