Three Core Patterns
1. Systolic Pattern (Data Streams)
Used by: Google TPU, Apple Neural Engine
A and B flow through array, partial sums accumulate and exit.
Pros: High data reuse, low memory bandwidth, simple control
Cons: Requires streaming data; poor for irregular patterns
2. Reduction Pattern (Tree Aggregation)
Used by: Systolic arrays with reduction trees, some research designs
Partial results combine hierarchically (like MapReduce).
Pros: Good for tree-like computations (conv nets), parallelism
Cons: More complex routing, higher latency for single result
3. Stationary Pattern (MACs Stay Put)
Used by: Some FPGA designs, mobile NPUs with limited memory
A and B are pre-loaded into registers around each MAC. Data stays local.
Pros: No data movement, maximum locality
Cons: Limited by register file size, requires offline scheduling
Comparison Table
| Pattern | Data Movement | Throughput | Latency | Complexity | Best For |
|---|---|---|---|---|---|
| Systolic | Streams | N ops/cycle | O(N) | Simple | Matrix ops |
| Reduction | Tree | log(N) ops | O(log N) | Medium | Aggregations |
| Stationary | None | M² ops/cycle | O(1) | Complex | Pre-scheduled |
Choosing a Pattern
- Streaming data? → Systolic (TPU choice)
- Variable-size problems? → Reduction (tree parallelism)
- Fixed known kernels? → Stationary (FPGA choice)
Day 10: Real implementations - how Apple, Google, NVIDIA chose their patterns.