HomeDay 10 Enhanced

Real AI Chip Implementations

Deep technical analysis of three production designs: Apple Neural Engine, Google TPU v4, NVIDIA H100. Architecture decisions, constraints, and real-world performance data.

The Design Philosophy: Trade-offs at Scale

When designing an AI chip, engineers face fundamental trade-offs:

Each company solved these differently based on their market constraints:

Apple Neural Engine (A17 Pro): Inference-Only, Extreme Power Efficiency

Market Constraint: 2W Power Budget

iPhones run on batteries. Total SoC power at peak is ~10W. Neural Engine must fit in 2W—a 20% budget.

Design Decisions

🎯 Architecture: 16×16 Systolic per Core (Not 256×256)

  • Why not larger? Power scales with compute. 256×256 = 65K MACs @ 2W each = 130 mW each. Infeasible.
  • 16×16 = 256 MACs @ 2 mW each. Multiple cores provide parallelism through batching.
  • 16 neural cores total = one per CPU core (task parallelism)

🎯 Precision: INT8 Only (No FP32, No Training)

  • Integer multiply uses 4× less power than floating-point
  • Training happens offline on GPU, not on phone
  • FP32 layer norm and bias addition done on CPU (acceptable latency)

🎯 Memory: Shared LPDDR5X (No HBM)

  • HBM requires expensive 3D stacking (adds $200+ per chip)
  • LPDDR5X @ 120 GB/s is "slow" (vs TPU's 2 TB/s) but acceptable for small batches
  • Shared with CPU/GPU (no dedicated bus = lower power)
SpecificationA17 Pro ANEA17 Pro Total SoC
Peak TOPS17 (per core)N/A (CPU: 2.2 TFLOPS)
Cores166 (2P + 4E)
Total MACs4,096N/A
Die Area~1.5 mm² per core~170 mm² total
Power2W sustained10W peak
Memory8 GB LPDDR5XShared
PrecisionINT8 onlyFP32 (CPU)
WorkloadInferenceTraining (GPU)

Real-World Performance

Test: MobileNetV3 inference on iPhone 15 Pro

ModelParamsSize (INT8)LatencyPowerAccuracy
MobileNetV3 Small2.5M2.5 MB8 ms0.5W67.4%
MobileNetV3 Large5.4M5.5 MB15 ms1.2W72.2%
ResNet-50 (pruned)25M25 MB35 ms1.8W75%

Actual throughput: 12 TOPS sustained (70% of peak). Why not 100%? Non-matrix ops (layer norm, activation) run on CPU.

---

Google TPU v4: Data Center Scale, Both Training & Inference

Market Constraint: Unlimited Power, Maximize TFLOPS

Data centers have cooling and power supplies. Optimize for throughput, not wattage.

Design Decisions

🎯 Architecture: 256×256 Systolic (Maximum Scale)

  • Why 256×256? Balances area (80 mm²) vs compute (430 TFLOPS). Larger = diminishing returns (wiring complexity).
  • Single chip supports: One independent 256×256 matrix multiply per clock cycle
  • Pod configuration: 8 chips in a 2×2×2 cube with all-reduce tree for distributed training

🎯 Precision: BF16 + INT8 (Training + Inference)

  • BF16 for forward/backward (matches exponent of FP32, no underflow)
  • INT8 for inference (2× faster, 4× less memory)
  • FP32 for loss scaling and master weights (kept on chip in SRAM)

🎯 Memory: 8 GB HBM3 + 24 MB SRAM

  • HBM = 2 TB/s (256× TPU's needs—reduces memory stalls)
  • SRAM = weight cache (holds one layer's parameters)
  • No L1/L2/L3 hierarchy needed (dataflow predicts memory access)
SpecificationTPU v4 (Single)TPU v4 Pod (8 chips)
Systolic Array256×2568× independent
Peak TFLOPS430 (BF16)3,440
Peak INT8 TFLOPS8606,880
Memory (HBM)8 GB @ 2 TB/s64 GB total
SRAM24 MB192 MB total
Die Area80 mm²640 mm²
Power200W1.6 kW
InterconnectN/AMesh topology

Real-World Performance: LLaMA 70B Training

Setup: 8 TPU v4 chips, mixed-precision (BF16 compute, FP32 master)

MetricValueAnalysis
Model size70B params280 GB @ FP32, 140 GB @ BF16
Global batch51264 per TPU × 8
Sequence length2,048Standard transformer length
Peak TFLOPS3,440All 8 chips @ 430 TFLOPS
Actual throughput2,500 TFLOPS73% peak (73% utilization)
Time per epoch~8 hours~3.5B tokens / 438 tokens/sec
Cost per hour~$7 (at Google internal rates)Training cost: $56/epoch

Why 73% vs 97%? Non-matrix ops (attention softmax, layer norm, all-reduce) reduce effective throughput.

---

NVIDIA H100: Maximum Flexibility, Highest Peak TFLOPS

Market Constraint: Support Any Workload

Data center customers want one GPU for training, inference, and HPC. Maximum compatibility wins.

Design Decisions

🎯 Architecture: Tensor Cores (Not Pure Systolic)

  • 132 Tensor Cores (vs TPU's single 256×256 systolic)
  • Each Tensor Core: SIMD-like (warp computes 16×16 matmul)
  • Flexibility: Can do variable-size operations, loops, branches
  • Cost: 40% more control logic than TPU (area overhead)

🎯 Precision: FP32, TF32, FP16, FP8, INT8 (All Supported)

  • Automatic Mixed Precision (AMP) switches precision per operation
  • FP8 support (NEW in Hopper) for ultra-low precision
  • Sparsity support: 2:4 structured sparsity = 2× effective throughput

🎯 Memory: 80 GB HBM3 + L1/L2/L3 Cache Hierarchy

  • HBM3 @ 3 TB/s (1.5× faster than TPU)
  • L1: 128 KB per SM (SM = Streaming Multiprocessor)
  • L2: 50 MB shared (fast coherent access)
  • L3: Not implemented (GPU-to-GPU coherence via NVLink)
SpecificationH100 (Single)DGX H100 (8 GPUs)
Tensor Cores132 clusters1,056 clusters
Peak TFLOPS (FP32)1,45011,600
Peak TFLOPS (Sparsity)2,90023,200
Peak TFLOPS (FP8)1,45011,600
Memory (HBM3)80 GB @ 3 TB/s640 GB total
Die Area815 mm²6,520 mm² (8 dies)
Power700W5.6 kW
InterconnectN/A12× NVLink @ 900 GB/s each

Real-World Performance: GPT-3 175B Training

Setup: 8 H100 GPUs with Hopper sparsity (2:4 pruning)

MetricValueAnalysis
Model size175B params700 GB @ FP32, 350 GB @ BF16
Global batch1,024128 per GPU × 8
Sequence length2,048Standard
Peak TFLOPS (with sparsity)23,200All 8 GPUs × 2900 sparse TFLOPS
Actual throughput8,000 TFLOPS34% peak (variability in sparsity pattern)
Time per epoch~24 hoursMuch larger model than TPU example
Cost per hour~$20 (AWS on-demand)Training cost: $480/epoch

Why only 34%? Tensor Cores require careful warp scheduling. Mixed workloads (attention, normalization, activation) can't feed all 132 cores simultaneously.

---

Detailed Comparison: Efficiency & Real-World Numbers

🍎 Apple A17 ANE

Peak TOPS 17 TOPS
Sustained % 70%
TOPS/Watt 8.5
Die Area 1.5 mm²/core
Cost/TOPS $5.88

🔵 Google TPU v4

Peak TFLOPS 430 TFLOPS
Sustained % 73%
TFLOPS/Watt 2.15
Die Area 80 mm²
Cost/TFLOPS $6.98

🟢 NVIDIA H100

Peak TFLOPS 1,450 TFLOPS
Sustained % 34%
TFLOPS/Watt 2.07
Die Area 815 mm²
Cost/TFLOPS $3.45
---

Key Insights: Why These Design Choices?

Apple: Inference-Only Wins on TFLOPS/Watt

Why 8.5 TFLOPS/W vs 2.1 for others?

Google: Sustained Efficiency Through Systolic Design

Why 73% sustained vs 34% on H100?

NVIDIA: Flexibility at Cost of Efficiency

Why headline 1,450 TFLOPS but only 34% sustained?

---

The Verdict: Which Is Best?

It depends on the workload:

WorkloadWinnerReason
Mobile inference (power-limited)Apple ANE8.5 TFLOPS/W + instant availability
Pure matrix multiply trainingTPU v473% sustained + lower cost per TFLOP
Mixed workloads (inference + NLP)H100Supports all ops, 11.6 TFLOPS peak in Pod
Sparsity-heavy modelsH1002× effective throughput with 2:4 sparsity

Next (Day 11): How floating-point precision affects these designs.