Home›Day 10

Real Systolic Implementations

How Apple, Google, and NVIDIA designed their production systolic arrays. Specs, design choices, and real-world performance.

Google TPU v2/v3 (Data Center)

Systolic Array: 256×256

Total MACs: 65,536 (one per PE)
Precision: BF16 + INT8
Peak TFLOPS: 180 (v2), 430 (v3)
Memory: 8 GB HBM per chip
Interconnect: All-reduce tree for reduction operations
Layout: 256×256 systolic core + 24 MB SRAM weight cache

Design Choices

Why 256×256? Fits 65k MACs on die, balances area vs compute
Why BF16? 16-bit brain float: 1 sign + 8 exponent + 7 mantissa (good for neural nets)
No caches? Weight SRAM is the cache - systolic doesn't need L1/L2 hierarchy
All-reduce tree? For reduction operations (summing across distributed training)

Apple Neural Engine (Mobile)

Systolic Array: 16×16

Total MACs: 256 (one per PE) in 16-core cluster
Precision: INT8 + INT16 for inference only
Peak TOPS: 17 TOPS (A17 Pro, 2 GHz)
Memory: Shared with CPU/GPU (no dedicated HBM)
Die area: ~1.5 mm² per neural core
Power: 2W sustained inference

Design Choices

Why 16×16? Small enough for mobile power budget, fits common layer sizes (224×224 images)
Why INT8 only? Inference is quantized; training happens offline on GPU
Shared memory? Uses LPDDR5 (low-power DDR) - acceptable latency for 2W power
16 cores? One per CPU core for task parallelism

NVIDIA H100 (GPU Alternative)

Tensor Cores: 132 sparse arrays

Total MACs: 1.4M (Tensor Cores, not traditional systolic)
Precision: FP32, TF32, FP8 (all workloads)
Peak TFLOPS: 1,450 (sparse mode)
Memory: 80 GB HBM3
Cache hierarchy: L1, L2, shared memory (GPU paradigm)
Flexibility: General-purpose compute (not just AI)

Design Choices

Not pure systolic? Tensor Cores are SIMD arrays with shared memory (more flexible than strict systolic)
Why more memory? Must support variable workloads, batching, multi-GPU distributed training
Cache hierarchy? Needed for non-matrix operations (activation functions, layer norm, attention)
Sparse mode? 2× sparsity → 2× effective throughput

Comparison: Size vs Performance

Design	MACs	FLOPS/W	Workloads	Context
Apple 16×16	256	8.5	Inference only	Mobile
TPU 256×256	65.5K	2.9	Train + Infer	Data center
H100 Tensor	1.4M	2.0	All compute	Data center

Key Insight: Scaling Law

Efficiency scales with problem size.

16×16 array: Good for 224×224 images (ML typical)
256×256 array: Good for transformer layers (8K→8K, all-to-all attention)
Larger arrays hit wire-delay and power limits

Tomorrow (Day 11): Quantization - how FP32 becomes INT8 without losing accuracy.