The Problem: Scaling Systolic Arrays
Can't make one 256×256 systolic array? Split into tiles and connect them:
Single chip (256×256):
┌──────────────┐
│ Systolic │
│ 256×256 │ → All MACs on one die
└──────────────┘
Multi-chip (4×4 grid of 64×64):
┌────┬────┬────┬────┐
│ 64 │ 64 │ 64 │ 64 │
├────┼────┼────┼────┤
│ 64 │ 64 │ 64 │ 64 │
├────┼────┼────┼────┤
│ 64 │ 64 │ 64 │ 64 │
├────┼────┼────┼────┤
│ 64 │ 64 │ 64 │ 64 │
└────┴────┴────┴────┘
Need interconnect (NoC) between tiles
NoC Topologies
1. Mesh Network (Google TPU Pod)
Regular grid, each tile connects to 4 neighbors (North, South, East, West):
[T] [T] [T] [T]
│ │ │ │
[T]─[T]─[T]─[T]
│ │ │ │
[T]─[T]─[T]─[T]
│ │ │ │
[T] [T] [T] [T]
Each link: 64 bits wide, 1 GB/s
Latency: Hops needed (Manhattan distance)
- Adjacent tile: 1 hop
- Opposite corner: 6 hops (grid diameter)
2. Crossbar (NVIDIA H100)
Any-to-any connectivity (fully connected):
[T] [T] [T] [T]
↓ ↓ ↓ ↓
┌─────────────────┐
│ Crossbar │ (256×256 switch matrix)
└─────────────────┘
↑ ↑ ↑ ↑
[T] [T] [T] [T]
Any tile can send to any other tile in 1 hop.
Cost: Area, power (much larger than mesh)
3. All-Reduce Tree (Distributed Training)
For synchronizing gradients across chips:
Leaf (8 chips): All-reduce (gradient sync):
┌─[C]─[C]─[C]─[C]┐ [G] [G] [G] [G] [G] [G] [G] [G]
│ │ \ | / \ | / \ | / \ | /
└─[C]─[C]─[C]─[C]┘ [SUM] [SUM] [SUM] [SUM]
\ | | /
[SUM] [SUM]
\ | /
[SUM]
|
(broadcast)
Real Examples
Google TPU v4 Pod (8 chips):
- Mesh topology (2×2×2 cube in 3D space)
- Links: 1 GB/s per direction
- All-reduce latency: ~1 ms (synchronize across 512 TPUs in full Pod)
NVIDIA H100 (multi-GPU):
- Single GPU: Local Tensor Cores + NVLinks (900 GB/s between GPUs)
- NVLink: 5 bidirectional links per GPU (18 total in DGX)
- Effective: Crossbar-like behavior with 900 GB/s per link
Day 20: Memory-compute co-design: optimizing dataflow for entire chip.