AI Chip Design Day 10 Enhanced

The Design Philosophy: Trade-offs at Scale

When designing an AI chip, engineers face fundamental trade-offs:

Performance vs Power: Higher clock speeds and more MACs increase power draw exponentially
Specialization vs Flexibility: Systolic arrays are fast for matrix math but slow at everything else
Area vs Cost: Larger dies are powerful but expensive (manufacturing yield drops)
Memory Bandwidth vs Latency: HBM is 10× faster but requires expensive 3D stacking
Training vs Inference: Supporting both requires more control logic and different precisions

Each company solved these differently based on their market constraints:

Apple Neural Engine (A17 Pro): Inference-Only, Extreme Power Efficiency

Market Constraint: 2W Power Budget

iPhones run on batteries. Total SoC power at peak is ~10W. Neural Engine must fit in 2W—a 20% budget.

Design Decisions

🎯 Architecture: 16×16 Systolic per Core (Not 256×256)

Why not larger? Power scales with compute. 256×256 = 65K MACs @ 2W each = 130 mW each. Infeasible.
16×16 = 256 MACs @ 2 mW each. Multiple cores provide parallelism through batching.
16 neural cores total = one per CPU core (task parallelism)

🎯 Precision: INT8 Only (No FP32, No Training)

Integer multiply uses 4× less power than floating-point
Training happens offline on GPU, not on phone
FP32 layer norm and bias addition done on CPU (acceptable latency)

🎯 Memory: Shared LPDDR5X (No HBM)

HBM requires expensive 3D stacking (adds $200+ per chip)
LPDDR5X @ 120 GB/s is "slow" (vs TPU's 2 TB/s) but acceptable for small batches
Shared with CPU/GPU (no dedicated bus = lower power)

Specification	A17 Pro ANE	A17 Pro Total SoC
Peak TOPS	17 (per core)	N/A (CPU: 2.2 TFLOPS)
Cores	16	6 (2P + 4E)
Total MACs	4,096	N/A
Die Area	~1.5 mm² per core	~170 mm² total
Power	2W sustained	10W peak
Memory	8 GB LPDDR5X	Shared
Precision	INT8 only	FP32 (CPU)
Workload	Inference	Training (GPU)

Real-World Performance

Test: MobileNetV3 inference on iPhone 15 Pro

Model	Params	Size (INT8)	Latency	Power	Accuracy
MobileNetV3 Small	2.5M	2.5 MB	8 ms	0.5W	67.4%
MobileNetV3 Large	5.4M	5.5 MB	15 ms	1.2W	72.2%
ResNet-50 (pruned)	25M	25 MB	35 ms	1.8W	75%

Actual throughput: 12 TOPS sustained (70% of peak). Why not 100%? Non-matrix ops (layer norm, activation) run on CPU.

---

Google TPU v4: Data Center Scale, Both Training & Inference

Market Constraint: Unlimited Power, Maximize TFLOPS

Data centers have cooling and power supplies. Optimize for throughput, not wattage.

Design Decisions

🎯 Architecture: 256×256 Systolic (Maximum Scale)

Why 256×256? Balances area (80 mm²) vs compute (430 TFLOPS). Larger = diminishing returns (wiring complexity).
Single chip supports: One independent 256×256 matrix multiply per clock cycle
Pod configuration: 8 chips in a 2×2×2 cube with all-reduce tree for distributed training

🎯 Precision: BF16 + INT8 (Training + Inference)

BF16 for forward/backward (matches exponent of FP32, no underflow)
INT8 for inference (2× faster, 4× less memory)
FP32 for loss scaling and master weights (kept on chip in SRAM)

🎯 Memory: 8 GB HBM3 + 24 MB SRAM

HBM = 2 TB/s (256× TPU's needs—reduces memory stalls)
SRAM = weight cache (holds one layer's parameters)
No L1/L2/L3 hierarchy needed (dataflow predicts memory access)

Specification	TPU v4 (Single)	TPU v4 Pod (8 chips)
Systolic Array	256×256	8× independent
Peak TFLOPS	430 (BF16)	3,440
Peak INT8 TFLOPS	860	6,880
Memory (HBM)	8 GB @ 2 TB/s	64 GB total
SRAM	24 MB	192 MB total
Die Area	80 mm²	640 mm²
Power	200W	1.6 kW
Interconnect	N/A	Mesh topology

Real-World Performance: LLaMA 70B Training

Setup: 8 TPU v4 chips, mixed-precision (BF16 compute, FP32 master)

Metric	Value	Analysis
Model size	70B params	280 GB @ FP32, 140 GB @ BF16
Global batch	512	64 per TPU × 8
Sequence length	2,048	Standard transformer length
Peak TFLOPS	3,440	All 8 chips @ 430 TFLOPS
Actual throughput	2,500 TFLOPS	73% peak (73% utilization)
Time per epoch	~8 hours	~3.5B tokens / 438 tokens/sec
Cost per hour	~$7 (at Google internal rates)	Training cost: $56/epoch

Why 73% vs 97%? Non-matrix ops (attention softmax, layer norm, all-reduce) reduce effective throughput.

---

NVIDIA H100: Maximum Flexibility, Highest Peak TFLOPS

Market Constraint: Support Any Workload

Data center customers want one GPU for training, inference, and HPC. Maximum compatibility wins.

Design Decisions

🎯 Architecture: Tensor Cores (Not Pure Systolic)

132 Tensor Cores (vs TPU's single 256×256 systolic)
Each Tensor Core: SIMD-like (warp computes 16×16 matmul)
Flexibility: Can do variable-size operations, loops, branches
Cost: 40% more control logic than TPU (area overhead)

🎯 Precision: FP32, TF32, FP16, FP8, INT8 (All Supported)

Automatic Mixed Precision (AMP) switches precision per operation
FP8 support (NEW in Hopper) for ultra-low precision
Sparsity support: 2:4 structured sparsity = 2× effective throughput

🎯 Memory: 80 GB HBM3 + L1/L2/L3 Cache Hierarchy

HBM3 @ 3 TB/s (1.5× faster than TPU)
L1: 128 KB per SM (SM = Streaming Multiprocessor)
L2: 50 MB shared (fast coherent access)
L3: Not implemented (GPU-to-GPU coherence via NVLink)

Specification	H100 (Single)	DGX H100 (8 GPUs)
Tensor Cores	132 clusters	1,056 clusters
Peak TFLOPS (FP32)	1,450	11,600
Peak TFLOPS (Sparsity)	2,900	23,200
Peak TFLOPS (FP8)	1,450	11,600
Memory (HBM3)	80 GB @ 3 TB/s	640 GB total
Die Area	815 mm²	6,520 mm² (8 dies)
Power	700W	5.6 kW
Interconnect	N/A	12× NVLink @ 900 GB/s each

Real-World Performance: GPT-3 175B Training

Setup: 8 H100 GPUs with Hopper sparsity (2:4 pruning)

Metric	Value	Analysis
Model size	175B params	700 GB @ FP32, 350 GB @ BF16
Global batch	1,024	128 per GPU × 8
Sequence length	2,048	Standard
Peak TFLOPS (with sparsity)	23,200	All 8 GPUs × 2900 sparse TFLOPS
Actual throughput	8,000 TFLOPS	34% peak (variability in sparsity pattern)
Time per epoch	~24 hours	Much larger model than TPU example
Cost per hour	~$20 (AWS on-demand)	Training cost: $480/epoch

Why only 34%? Tensor Cores require careful warp scheduling. Mixed workloads (attention, normalization, activation) can't feed all 132 cores simultaneously.

---

Detailed Comparison: Efficiency & Real-World Numbers

🍎 Apple A17 ANE

Peak TOPS 17 TOPS

Sustained % 70%

TOPS/Watt 8.5

Die Area 1.5 mm²/core

Cost/TOPS $5.88

🔵 Google TPU v4

Peak TFLOPS 430 TFLOPS

Sustained % 73%

TFLOPS/Watt 2.15

Die Area 80 mm²

Cost/TFLOPS $6.98

🟢 NVIDIA H100

Peak TFLOPS 1,450 TFLOPS

Sustained % 34%

TFLOPS/Watt 2.07

Die Area 815 mm²

Cost/TFLOPS $3.45

---

Key Insights: Why These Design Choices?

Apple: Inference-Only Wins on TFLOPS/Watt

Why 8.5 TFLOPS/W vs 2.1 for others?

No training logic (saves power)
INT8 only (4× less power per multiply)
Smaller array (lower wiring power)
Battery-optimized: Designed for low continuous load

Google: Sustained Efficiency Through Systolic Design

Why 73% sustained vs 34% on H100?

Predictable dataflow (data arrives on schedule)
Simple control logic (no branch prediction overhead)
Systolic architecture naturally fills MACs
Cost: Less flexible (can't do arbitrary algorithms)

NVIDIA: Flexibility at Cost of Efficiency

Why headline 1,450 TFLOPS but only 34% sustained?

General-purpose GPU (must support any code)
Control flow overhead (branches, loops, synchronization)
Cache misses (unpredictable memory access patterns)
Warp divergence (some threads idle while others compute)
Benefit: Runs any workload, not just matrix math

---

The Verdict: Which Is Best?

It depends on the workload:

Workload	Winner	Reason
Mobile inference (power-limited)	Apple ANE	8.5 TFLOPS/W + instant availability
Pure matrix multiply training	TPU v4	73% sustained + lower cost per TFLOP
Mixed workloads (inference + NLP)	H100	Supports all ops, 11.6 TFLOPS peak in Pod
Sparsity-heavy models	H100	2× effective throughput with 2:4 sparsity

Next (Day 11): How floating-point precision affects these designs.

Real AI Chip Implementations

The Design Philosophy: Trade-offs at Scale

Apple Neural Engine (A17 Pro): Inference-Only, Extreme Power Efficiency

Market Constraint: 2W Power Budget

Design Decisions

🎯 Architecture: 16×16 Systolic per Core (Not 256×256)

🎯 Precision: INT8 Only (No FP32, No Training)

🎯 Memory: Shared LPDDR5X (No HBM)

Real-World Performance

Google TPU v4: Data Center Scale, Both Training & Inference

Market Constraint: Unlimited Power, Maximize TFLOPS

Design Decisions

🎯 Architecture: 256×256 Systolic (Maximum Scale)

🎯 Precision: BF16 + INT8 (Training + Inference)

🎯 Memory: 8 GB HBM3 + 24 MB SRAM

Real-World Performance: LLaMA 70B Training

NVIDIA H100: Maximum Flexibility, Highest Peak TFLOPS

Market Constraint: Support Any Workload

Design Decisions

🎯 Architecture: Tensor Cores (Not Pure Systolic)

🎯 Precision: FP32, TF32, FP16, FP8, INT8 (All Supported)

🎯 Memory: 80 GB HBM3 + L1/L2/L3 Cache Hierarchy

Real-World Performance: GPT-3 175B Training

Detailed Comparison: Efficiency & Real-World Numbers

🍎 Apple A17 ANE

🔵 Google TPU v4

🟢 NVIDIA H100

Key Insights: Why These Design Choices?

Apple: Inference-Only Wins on TFLOPS/Watt

Google: Sustained Efficiency Through Systolic Design

NVIDIA: Flexibility at Cost of Efficiency

The Verdict: Which Is Best?