Google TPU v2/v3 (Data Center)
Systolic Array: 256×256
- Total MACs: 65,536 (one per PE)
- Precision: BF16 + INT8
- Peak TFLOPS: 180 (v2), 430 (v3)
- Memory: 8 GB HBM per chip
- Interconnect: All-reduce tree for reduction operations
- Layout: 256×256 systolic core + 24 MB SRAM weight cache
Design Choices
- Why 256×256? Fits 65k MACs on die, balances area vs compute
- Why BF16? 16-bit brain float: 1 sign + 8 exponent + 7 mantissa (good for neural nets)
- No caches? Weight SRAM is the cache - systolic doesn't need L1/L2 hierarchy
- All-reduce tree? For reduction operations (summing across distributed training)
Apple Neural Engine (Mobile)
Systolic Array: 16×16
- Total MACs: 256 (one per PE) in 16-core cluster
- Precision: INT8 + INT16 for inference only
- Peak TOPS: 17 TOPS (A17 Pro, 2 GHz)
- Memory: Shared with CPU/GPU (no dedicated HBM)
- Die area: ~1.5 mm² per neural core
- Power: 2W sustained inference
Design Choices
- Why 16×16? Small enough for mobile power budget, fits common layer sizes (224×224 images)
- Why INT8 only? Inference is quantized; training happens offline on GPU
- Shared memory? Uses LPDDR5 (low-power DDR) - acceptable latency for 2W power
- 16 cores? One per CPU core for task parallelism
NVIDIA H100 (GPU Alternative)
Tensor Cores: 132 sparse arrays
- Total MACs: 1.4M (Tensor Cores, not traditional systolic)
- Precision: FP32, TF32, FP8 (all workloads)
- Peak TFLOPS: 1,450 (sparse mode)
- Memory: 80 GB HBM3
- Cache hierarchy: L1, L2, shared memory (GPU paradigm)
- Flexibility: General-purpose compute (not just AI)
Design Choices
- Not pure systolic? Tensor Cores are SIMD arrays with shared memory (more flexible than strict systolic)
- Why more memory? Must support variable workloads, batching, multi-GPU distributed training
- Cache hierarchy? Needed for non-matrix operations (activation functions, layer norm, attention)
- Sparse mode? 2× sparsity → 2× effective throughput
Comparison: Size vs Performance
| Design | MACs | FLOPS/W | Workloads | Context |
|---|---|---|---|---|
| Apple 16×16 | 256 | 8.5 | Inference only | Mobile |
| TPU 256×256 | 65.5K | 2.9 | Train + Infer | Data center |
| H100 Tensor | 1.4M | 2.0 | All compute | Data center |
Key Insight: Scaling Law
Efficiency scales with problem size.
- 16×16 array: Good for 224×224 images (ML typical)
- 256×256 array: Good for transformer layers (8K→8K, all-to-all attention)
- Larger arrays hit wire-delay and power limits
Tomorrow (Day 11): Quantization - how FP32 becomes INT8 without losing accuracy.