AI Chip Design Day 22 Enhanced

The Strategic Vision: Why Google Built TPU

Context (2017-2020): TPU v1-v3 were data-center chips for Google's internal use only. GPUs dominated the market. By 2022, Google faced a problem:

Training LLMs at scale (175B+ parameters) cost $5M+ per run
NVIDIA GPUs (V100, A100) were mature, fast, but expensive
Custom silicone cost $100M+ to design and tape-out
But the ROI was clear: If TPU v4 was 2-3× more efficient, Google would save hundreds of millions in training costs

TPU v4's mission: Be the most efficient AI training chip ever built, optimized specifically for transformer models.

The Core Design: 256×256 Systolic Array

🔷 Why 256×256 and Not Smaller or Larger?

Analysis of array size trade-offs:

Array Size	Total MACs	Die Area (mm²)	Peak TFLOPS	TFLOPS/mm²	Wiring Complexity
128×128	16,384	20	107	5.4	Low
256×256	65,536	80	430	5.4	Medium
512×512	262,144	320	1,720	5.4	Extreme

Observation: TFLOPS/mm² is constant! So why pick 256×256?

Power density: Larger array = longer wires = more resistance = more power loss
Synthesis complexity: At 512×512, place-and-route becomes infeasible (routing congestion)
Yield: Larger die = more defects per wafer (exponential cost)
Latency: Signal propagation in a 512×512 array exceeds clock period (timing closure fails)
Sweet spot: 256×256 fits all constraints and provides 430 TFLOPS/chip

The Memory Subsystem: Why 8 GB HBM3 + 24 MB SRAM?

🎯 Memory Capacity vs Bandwidth Optimization

Problem: Large transformer models need massive memory (LLaMA 70B = 280 GB @ FP32, 140 GB @ BF16). A single TPU with 8 GB HBM can't hold the entire model.

Solution: Stream weights from HBM into SRAM cache. Compute on streamed data.

Layer	Size	Bandwidth	Latency	Purpose
Register File	~50 KB	10 TB/s	1 cycle	Immediate operands
SRAM (weight cache)	24 MB	500 GB/s	5-10 cycles	Hold current layer weights
HBM3	8 GB	2 TB/s	50-100 cycles	Full model storage

Design choice: 24 MB SRAM holds ~1 transformer layer (weights + activations). As computation moves to next layer, prefetch its weights into SRAM.

Why 2 TB/s HBM Bandwidth Is Critical

Roofline calculation for TPU v4: Peak compute: 430 TFLOPS Memory bandwidth: 2 TB/s AI knee: 430 / 2 = 215 FLOPS/Byte For matrix multiply with AI > 215, TPU is compute-bound and achieves peak. For attention/normalization with AI < 215, TPU is memory-bound. Benchmark: LLaMA 70B average AI ≈ 150 FLOPS/Byte (memory-bound overall). This is why Google spent $$$ on HBM3: Each GB/s of bandwidth directly translates to throughput improvements.

Design Decisions: Training-Specific Choices

Decision 1: BF16 for Forward & Backward (Not FP16)

🎯 Why BF16 Instead of FP16?

Problem with FP16: Range is only ±10^4. Gradients during backprop go as small as 10^-8. FP16 underflows—gradients become zero, training stalls.

BF16 solution: 1 sign + 8 exponent + 7 mantissa. Shares exponent with FP32, so range is ±10^38. No underflow!

Cost: Lower precision (7 vs 10 bits in mantissa), but neural networks don't need that precision anyway.

Measurement: LLaMA 70B training

FP16 mixed precision: Would require frequent scaling, ~20% overhead
BF16 mixed precision: Clean implementation, near 100% efficiency

Decision 2: All-Reduce Tree for Distributed Training

Problem: A single TPU is 430 TFLOPS. But LLaMA 70B training needs 2,500+ TFLOPS continuously to finish in weeks.

Solution: 8 TPUs in a Pod, connected with an all-reduce tree.

All-Reduce Tree (8 TPU v4 chips, synchronous gradient updates): Level 0 (8 chips compute in parallel): [Chip 0: Gradient] [Chip 1: Gradient] ... [Chip 7: Gradient] ↓ ↓ ↓ Layer 1 (4 sums): [Sum01] [Sum23] [Sum45] [Sum67] ↓ ↓ ↓ ↓ Layer 2 (2 sums): [Sum0123] [Sum4567] ↓ ↓ Layer 3 (Final sum): [GlobalGradient] ↓ (Broadcast back to all chips) Latency per iteration: - Compute: ~100 ms (430T ÷ 4.3B FLOPs per batch) - All-reduce: ~5 ms - Total: ~105 ms per step Training throughput: 8 TPU × 430 TFLOPS ÷ 1.05x overhead = 3.3 PFLOPS sustained

Decision 3: Unified Memory Layout (Not Separate I/D Cache)

Systolic arrays predict memory access patterns perfectly. Unlike CPUs, there's no cache miss problem. Design choice: Single unified weight SRAM. No instruction cache needed (control is off-chip).

Benefit: 100% of SRAM goes to data. No wasted space on instruction buffering.

Real Training Performance: Detailed Numbers

Scenario: Training LLaMA 70B on 8 TPU v4 Pod

Metric	Value	Breakdown
Model Size	70B params	140 GB @ BF16
Global Batch	512	64 per TPU × 8
Sequence Length	2,048 tokens	Standard context window
Layers	80	Attention + MLP per layer
FLOPs per Batch	~4.3B	2×N×d×L + attention

Phase 1: Forward Pass (All 8 TPUs)

Layer Type	FLOPs	Throughput (TFLOPS)	Time (ms)	Utilization
Attention Q @ K (2048²×64)	268M	50	5.4	12%
Softmax	10M	—	2.0	—
Attention × V	268M	50	5.4	12%
Linear (8192→8192)	1.1B	380	3.0	88%
Linear (8192→32768)	4.3B	420	10.2	98%

Total forward pass: ~27 ms @ ~2,200 TFLOPS (67% peak)

Phase 2: Backward Pass (All 8 TPUs)

Backprop is roughly 2× forward pass (gradient computation for weights & activations).

Total backward pass: ~54 ms @ ~2,200 TFLOPS (67% peak)

Phase 3: All-Reduce Synchronization

After each layer's backward pass, gradients must be synchronized across 8 TPUs.

Time: ~1-2 ms (depending on data size)

Complete Iteration: ~85 ms

Phase	Time (ms)	Utilization
Forward	27	67%
Backward	54	67%
All-reduce	2	100% (specialized)
Total per step	83	~68% avg

Sustained throughput: 8 TPUs × 430 TFLOPS × 0.68 = 2,330 TFLOPS

Training speed:

Tokens processed per second: 512 batch × 2048 seq / 0.083 sec = 12.6M tokens/sec Time to train 300B tokens (typical): 300B / 12.6M = 23,800 seconds ≈ 6.6 hours Cost (at Google's internal rates): 6.6 hours × $50/hour = $330 Compare to: AWS H100 × 8 = $20/hour × 6.6 = $132 compute cost, but actual training takes longer (lower utilization), total ~12 hours = $240. TPU Pod is cheaper AND faster in this scenario.

Why This Design Dominates Matrix Math

🔷 The Systolic Advantage Explained

For matrix multiply (core AI operation):

Memory bandwidth bottleneck: Computing C = A @ B requires loading N² elements, but computing N³ multiplies. Ratio = N.
Systolic solution: Stream A rightward, B downward. Each element loaded once but used by N MACs.
Result: Actual memory bandwidth = 2 TB/s / N (where N = 256 in practice).
This enables: 430 TFLOPS sustained for matrix operations despite "only" 2 TB/s HBM.

GPU can't do this: Tensor Cores are SIMD-like with caches. They still load data multiple times per MAC. So they need higher bandwidth (3 TB/s for H100).

Limitations & Trade-offs

1. Systolic Array Can Stall

If data pipeline breaks (network congestion, memory stall), all 65K MACs sit idle. Unlike GPUs, there's no out-of-order execution or warp scheduling to hide latency.

2. Non-Matrix Operations Are Slow

Element-wise ops (ReLU, LayerNorm, Softmax) run on scalar control logic. While matmul gets 98% peak, activation gets 12% peak. Average: 67%.

3. Fixed to BF16

For inference, quantize to INT8. But TPU v4 doesn't have native INT8 compute (v5+ will). So inference still uses BF16 (less efficient than GPU's FP8 support).

4. All-Reduce Tree Synchronization

Distributed training requires gradients synchronized at the end of each batch. This is critical path and limits scaling beyond ~100 TPUs (latency dominates).

Why Companies Choose TPU vs GPU vs Custom Silicon

Choose TPU if...	Choose GPU (H100) if...	Design Custom if...
Doing heavy matrix math (transformers)	Need flexibility (RL, graphs, sparse)	Billion-unit volume (phones, edge)
Training at scale (billions of tokens)	Using cutting-edge frameworks	Single product focus (very high ROI)
Can wait for Google Cloud availability	Need to own hardware (on-prem)	10+ year product lifecycle
Budget optimizes for throughput	Budget optimizes for flexibility	NRE budget > $100M

Takeaway: Why TPU v4 Is a Masterpiece

✅ Every design choice made for one goal: maximize transformer training efficiency
✅ 256×256 systolic is the Goldilocks size (not too big, not too small)
✅ BF16 + FP32 mixed precision is perfect for this workload
✅ 68% sustained utilization on production workloads (vs 34% for GPU)
✅ Cost per TFLOPS is lower than H100 despite headline TFLOPS disparity
❌ Trade-off: Less flexible than GPU (but Google didn't need flexibility)

Next (Day 23): NVIDIA H100—the opposite design philosophy (maximum flexibility vs specialization).

Google TPU v4 Complete Design