AI Chip Design Day 19

The Problem: Scaling Systolic Arrays

Can't make one 256×256 systolic array? Split into tiles and connect them:

Single chip (256×256): ┌──────────────┐ │ Systolic │ │ 256×256 │ → All MACs on one die └──────────────┘ Multi-chip (4×4 grid of 64×64): ┌────┬────┬────┬────┐ │ 64 │ 64 │ 64 │ 64 │ ├────┼────┼────┼────┤ │ 64 │ 64 │ 64 │ 64 │ ├────┼────┼────┼────┤ │ 64 │ 64 │ 64 │ 64 │ ├────┼────┼────┼────┤ │ 64 │ 64 │ 64 │ 64 │ └────┴────┴────┴────┘ Need interconnect (NoC) between tiles

NoC Topologies

1. Mesh Network (Google TPU Pod)

Regular grid, each tile connects to 4 neighbors (North, South, East, West):

[T] [T] [T] [T] │ │ │ │ [T]─[T]─[T]─[T] │ │ │ │ [T]─[T]─[T]─[T] │ │ │ │ [T] [T] [T] [T] Each link: 64 bits wide, 1 GB/s Latency: Hops needed (Manhattan distance) - Adjacent tile: 1 hop - Opposite corner: 6 hops (grid diameter)

2. Crossbar (NVIDIA H100)

Any-to-any connectivity (fully connected):

[T] [T] [T] [T] ↓ ↓ ↓ ↓ ┌─────────────────┐ │ Crossbar │ (256×256 switch matrix) └─────────────────┘ ↑ ↑ ↑ ↑ [T] [T] [T] [T] Any tile can send to any other tile in 1 hop. Cost: Area, power (much larger than mesh)

3. All-Reduce Tree (Distributed Training)

For synchronizing gradients across chips:

Leaf (8 chips): All-reduce (gradient sync): ┌─[C]─[C]─[C]─[C]┐ [G] [G] [G] [G] [G] [G] [G] [G] │ │ \ | / \ | / \ | / \ | / └─[C]─[C]─[C]─[C]┘ [SUM] [SUM] [SUM] [SUM] \ | | / [SUM] [SUM] \ | / [SUM] | (broadcast)

Real Examples

Google TPU v4 Pod (8 chips): - Mesh topology (2×2×2 cube in 3D space) - Links: 1 GB/s per direction - All-reduce latency: ~1 ms (synchronize across 512 TPUs in full Pod) NVIDIA H100 (multi-GPU): - Single GPU: Local Tensor Cores + NVLinks (900 GB/s between GPUs) - NVLink: 5 bidirectional links per GPU (18 total in DGX) - Effective: Crossbar-like behavior with 900 GB/s per link