HomeDay 10

Real Systolic Implementations

How Apple, Google, and NVIDIA designed their production systolic arrays. Specs, design choices, and real-world performance.

Google TPU v2/v3 (Data Center)

Systolic Array: 256×256

  • Total MACs: 65,536 (one per PE)
  • Precision: BF16 + INT8
  • Peak TFLOPS: 180 (v2), 430 (v3)
  • Memory: 8 GB HBM per chip
  • Interconnect: All-reduce tree for reduction operations
  • Layout: 256×256 systolic core + 24 MB SRAM weight cache

Design Choices

Apple Neural Engine (Mobile)

Systolic Array: 16×16

  • Total MACs: 256 (one per PE) in 16-core cluster
  • Precision: INT8 + INT16 for inference only
  • Peak TOPS: 17 TOPS (A17 Pro, 2 GHz)
  • Memory: Shared with CPU/GPU (no dedicated HBM)
  • Die area: ~1.5 mm² per neural core
  • Power: 2W sustained inference

Design Choices

NVIDIA H100 (GPU Alternative)

Tensor Cores: 132 sparse arrays

  • Total MACs: 1.4M (Tensor Cores, not traditional systolic)
  • Precision: FP32, TF32, FP8 (all workloads)
  • Peak TFLOPS: 1,450 (sparse mode)
  • Memory: 80 GB HBM3
  • Cache hierarchy: L1, L2, shared memory (GPU paradigm)
  • Flexibility: General-purpose compute (not just AI)

Design Choices

Comparison: Size vs Performance

DesignMACsFLOPS/WWorkloadsContext
Apple 16×162568.5Inference onlyMobile
TPU 256×25665.5K2.9Train + InferData center
H100 Tensor1.4M2.0All computeData center

Key Insight: Scaling Law

Efficiency scales with problem size.

Tomorrow (Day 11): Quantization - how FP32 becomes INT8 without losing accuracy.