HomeDay 9

Dataflow Patterns

Systolic, reduction, stationary, and other dataflow architectures for different AI workloads. Choose the right pattern for your problem.

Three Core Patterns

1. Systolic Pattern (Data Streams)

Used by: Google TPU, Apple Neural Engine

A and B flow through array, partial sums accumulate and exit.

Input streams: A: [a1, a2, a3, ...] → B: [b1, b2, b3, ...] ↓ Output: C values flow out sequentially Latency: O(N) cycles for NxN matrix Throughput: One NxN multiply per N cycles (ideal pipelining)

Pros: High data reuse, low memory bandwidth, simple control
Cons: Requires streaming data; poor for irregular patterns

2. Reduction Pattern (Tree Aggregation)

Used by: Systolic arrays with reduction trees, some research designs

Partial results combine hierarchically (like MapReduce).

Layer 0: [R0] [R1] [R2] [R3] (4 reductions in parallel) \ / \ / Layer 1: [R] [R] (2 reductions) \ / Layer 2: [R] (1 final result)

Pros: Good for tree-like computations (conv nets), parallelism
Cons: More complex routing, higher latency for single result

3. Stationary Pattern (MACs Stay Put)

Used by: Some FPGA designs, mobile NPUs with limited memory

A and B are pre-loaded into registers around each MAC. Data stays local.

MAC[0][0]: A[0][0], A[0][1], ... pre-stored locally B[0][0], B[1][0], ... pre-stored locally Compute C[0][0] using local data No streaming - all data in registers from start

Pros: No data movement, maximum locality
Cons: Limited by register file size, requires offline scheduling

Comparison Table

PatternData MovementThroughputLatencyComplexityBest For
SystolicStreamsN ops/cycleO(N)SimpleMatrix ops
ReductionTreelog(N) opsO(log N)MediumAggregations
StationaryNoneM² ops/cycleO(1)ComplexPre-scheduled

Choosing a Pattern

Day 10: Real implementations - how Apple, Google, NVIDIA chose their patterns.