The Design Philosophy: Trade-offs at Scale
When designing an AI chip, engineers face fundamental trade-offs:
- Performance vs Power: Higher clock speeds and more MACs increase power draw exponentially
- Specialization vs Flexibility: Systolic arrays are fast for matrix math but slow at everything else
- Area vs Cost: Larger dies are powerful but expensive (manufacturing yield drops)
- Memory Bandwidth vs Latency: HBM is 10× faster but requires expensive 3D stacking
- Training vs Inference: Supporting both requires more control logic and different precisions
Each company solved these differently based on their market constraints:
Apple Neural Engine (A17 Pro): Inference-Only, Extreme Power Efficiency
Market Constraint: 2W Power Budget
iPhones run on batteries. Total SoC power at peak is ~10W. Neural Engine must fit in 2W—a 20% budget.
Design Decisions
🎯 Architecture: 16×16 Systolic per Core (Not 256×256)
- Why not larger? Power scales with compute. 256×256 = 65K MACs @ 2W each = 130 mW each. Infeasible.
- 16×16 = 256 MACs @ 2 mW each. Multiple cores provide parallelism through batching.
- 16 neural cores total = one per CPU core (task parallelism)
🎯 Precision: INT8 Only (No FP32, No Training)
- Integer multiply uses 4× less power than floating-point
- Training happens offline on GPU, not on phone
- FP32 layer norm and bias addition done on CPU (acceptable latency)
🎯 Memory: Shared LPDDR5X (No HBM)
- HBM requires expensive 3D stacking (adds $200+ per chip)
- LPDDR5X @ 120 GB/s is "slow" (vs TPU's 2 TB/s) but acceptable for small batches
- Shared with CPU/GPU (no dedicated bus = lower power)
| Specification | A17 Pro ANE | A17 Pro Total SoC |
|---|---|---|
| Peak TOPS | 17 (per core) | N/A (CPU: 2.2 TFLOPS) |
| Cores | 16 | 6 (2P + 4E) |
| Total MACs | 4,096 | N/A |
| Die Area | ~1.5 mm² per core | ~170 mm² total |
| Power | 2W sustained | 10W peak |
| Memory | 8 GB LPDDR5X | Shared |
| Precision | INT8 only | FP32 (CPU) |
| Workload | Inference | Training (GPU) |
Real-World Performance
Test: MobileNetV3 inference on iPhone 15 Pro
| Model | Params | Size (INT8) | Latency | Power | Accuracy |
|---|---|---|---|---|---|
| MobileNetV3 Small | 2.5M | 2.5 MB | 8 ms | 0.5W | 67.4% |
| MobileNetV3 Large | 5.4M | 5.5 MB | 15 ms | 1.2W | 72.2% |
| ResNet-50 (pruned) | 25M | 25 MB | 35 ms | 1.8W | 75% |
Actual throughput: 12 TOPS sustained (70% of peak). Why not 100%? Non-matrix ops (layer norm, activation) run on CPU.
---Google TPU v4: Data Center Scale, Both Training & Inference
Market Constraint: Unlimited Power, Maximize TFLOPS
Data centers have cooling and power supplies. Optimize for throughput, not wattage.
Design Decisions
🎯 Architecture: 256×256 Systolic (Maximum Scale)
- Why 256×256? Balances area (80 mm²) vs compute (430 TFLOPS). Larger = diminishing returns (wiring complexity).
- Single chip supports: One independent 256×256 matrix multiply per clock cycle
- Pod configuration: 8 chips in a 2×2×2 cube with all-reduce tree for distributed training
🎯 Precision: BF16 + INT8 (Training + Inference)
- BF16 for forward/backward (matches exponent of FP32, no underflow)
- INT8 for inference (2× faster, 4× less memory)
- FP32 for loss scaling and master weights (kept on chip in SRAM)
🎯 Memory: 8 GB HBM3 + 24 MB SRAM
- HBM = 2 TB/s (256× TPU's needs—reduces memory stalls)
- SRAM = weight cache (holds one layer's parameters)
- No L1/L2/L3 hierarchy needed (dataflow predicts memory access)
| Specification | TPU v4 (Single) | TPU v4 Pod (8 chips) |
|---|---|---|
| Systolic Array | 256×256 | 8× independent |
| Peak TFLOPS | 430 (BF16) | 3,440 |
| Peak INT8 TFLOPS | 860 | 6,880 |
| Memory (HBM) | 8 GB @ 2 TB/s | 64 GB total |
| SRAM | 24 MB | 192 MB total |
| Die Area | 80 mm² | 640 mm² |
| Power | 200W | 1.6 kW |
| Interconnect | N/A | Mesh topology |
Real-World Performance: LLaMA 70B Training
Setup: 8 TPU v4 chips, mixed-precision (BF16 compute, FP32 master)
| Metric | Value | Analysis |
|---|---|---|
| Model size | 70B params | 280 GB @ FP32, 140 GB @ BF16 |
| Global batch | 512 | 64 per TPU × 8 |
| Sequence length | 2,048 | Standard transformer length |
| Peak TFLOPS | 3,440 | All 8 chips @ 430 TFLOPS |
| Actual throughput | 2,500 TFLOPS | 73% peak (73% utilization) |
| Time per epoch | ~8 hours | ~3.5B tokens / 438 tokens/sec |
| Cost per hour | ~$7 (at Google internal rates) | Training cost: $56/epoch |
Why 73% vs 97%? Non-matrix ops (attention softmax, layer norm, all-reduce) reduce effective throughput.
---NVIDIA H100: Maximum Flexibility, Highest Peak TFLOPS
Market Constraint: Support Any Workload
Data center customers want one GPU for training, inference, and HPC. Maximum compatibility wins.
Design Decisions
🎯 Architecture: Tensor Cores (Not Pure Systolic)
- 132 Tensor Cores (vs TPU's single 256×256 systolic)
- Each Tensor Core: SIMD-like (warp computes 16×16 matmul)
- Flexibility: Can do variable-size operations, loops, branches
- Cost: 40% more control logic than TPU (area overhead)
🎯 Precision: FP32, TF32, FP16, FP8, INT8 (All Supported)
- Automatic Mixed Precision (AMP) switches precision per operation
- FP8 support (NEW in Hopper) for ultra-low precision
- Sparsity support: 2:4 structured sparsity = 2× effective throughput
🎯 Memory: 80 GB HBM3 + L1/L2/L3 Cache Hierarchy
- HBM3 @ 3 TB/s (1.5× faster than TPU)
- L1: 128 KB per SM (SM = Streaming Multiprocessor)
- L2: 50 MB shared (fast coherent access)
- L3: Not implemented (GPU-to-GPU coherence via NVLink)
| Specification | H100 (Single) | DGX H100 (8 GPUs) |
|---|---|---|
| Tensor Cores | 132 clusters | 1,056 clusters |
| Peak TFLOPS (FP32) | 1,450 | 11,600 |
| Peak TFLOPS (Sparsity) | 2,900 | 23,200 |
| Peak TFLOPS (FP8) | 1,450 | 11,600 |
| Memory (HBM3) | 80 GB @ 3 TB/s | 640 GB total |
| Die Area | 815 mm² | 6,520 mm² (8 dies) |
| Power | 700W | 5.6 kW |
| Interconnect | N/A | 12× NVLink @ 900 GB/s each |
Real-World Performance: GPT-3 175B Training
Setup: 8 H100 GPUs with Hopper sparsity (2:4 pruning)
| Metric | Value | Analysis |
|---|---|---|
| Model size | 175B params | 700 GB @ FP32, 350 GB @ BF16 |
| Global batch | 1,024 | 128 per GPU × 8 |
| Sequence length | 2,048 | Standard |
| Peak TFLOPS (with sparsity) | 23,200 | All 8 GPUs × 2900 sparse TFLOPS |
| Actual throughput | 8,000 TFLOPS | 34% peak (variability in sparsity pattern) |
| Time per epoch | ~24 hours | Much larger model than TPU example |
| Cost per hour | ~$20 (AWS on-demand) | Training cost: $480/epoch |
Why only 34%? Tensor Cores require careful warp scheduling. Mixed workloads (attention, normalization, activation) can't feed all 132 cores simultaneously.
---Detailed Comparison: Efficiency & Real-World Numbers
🍎 Apple A17 ANE
🔵 Google TPU v4
🟢 NVIDIA H100
Key Insights: Why These Design Choices?
Apple: Inference-Only Wins on TFLOPS/Watt
Why 8.5 TFLOPS/W vs 2.1 for others?
- No training logic (saves power)
- INT8 only (4× less power per multiply)
- Smaller array (lower wiring power)
- Battery-optimized: Designed for low continuous load
Google: Sustained Efficiency Through Systolic Design
Why 73% sustained vs 34% on H100?
- Predictable dataflow (data arrives on schedule)
- Simple control logic (no branch prediction overhead)
- Systolic architecture naturally fills MACs
- Cost: Less flexible (can't do arbitrary algorithms)
NVIDIA: Flexibility at Cost of Efficiency
Why headline 1,450 TFLOPS but only 34% sustained?
- General-purpose GPU (must support any code)
- Control flow overhead (branches, loops, synchronization)
- Cache misses (unpredictable memory access patterns)
- Warp divergence (some threads idle while others compute)
- Benefit: Runs any workload, not just matrix math
The Verdict: Which Is Best?
It depends on the workload:
| Workload | Winner | Reason |
|---|---|---|
| Mobile inference (power-limited) | Apple ANE | 8.5 TFLOPS/W + instant availability |
| Pure matrix multiply training | TPU v4 | 73% sustained + lower cost per TFLOP |
| Mixed workloads (inference + NLP) | H100 | Supports all ops, 11.6 TFLOPS peak in Pod |
| Sparsity-heavy models | H100 | 2× effective throughput with 2:4 sparsity |
Next (Day 11): How floating-point precision affects these designs.