AI Chip Design Day 11

The Precision Spectrum

Neural networks don't need 64-bit IEEE double precision. They thrive on 8-16 bit numbers. Different formats optimize for different workloads:

FP64 (64-bit): S[1] E[11] M[52] ← Double precision (scientific computing) FP32 (32-bit): S[1] E[8] M[23] ← Standard neural network precision TF32 (32-bit): S[1] E[8] M[10] ← NVIDIA Tensor Float (fast, low precision) BF16 (16-bit): S[1] E[8] M[7] ← Brain Float (TPU choice, good range) FP16 (16-bit): S[1] E[5] M[10] ← Half precision (older, smaller range) INT8 (8-bit): S[1] Fixed-point ← Integer quantization (inference only)

FP32: The Gold Standard

32 bits: 1 sign + 8 exponent + 23 mantissa

Range: ~10^-38 to 10^38 (massive)

Precision: ~7 decimal digits

Use: Training (loss landscape is smooth, needs precision), research

BF16: The TPU Choice

16 bits: 1 sign + 8 exponent + 7 mantissa (shares exponent with FP32!)

Range: Same as FP32 (~10^-38 to 10^38)

Precision: ~3 decimal digits (lower, but range unchanged)

Why it works: Neural networks don't care about precision, they care about range. Loss landscape is coarse—7 bits of mantissa is enough.

Key insight: BF16 trades mantissa precision for exponent range preservation. This is PERFECT for AI because: - Intermediate values during training can be very small (10^-10) or very large (10^6) - BF16 won't underflow or overflow - FP16 underflows at ~10^-5 (useless for deep nets)

FP16: The Older Choice

16 bits: 1 sign + 5 exponent + 10 mantissa

Range: ~10^-5 to 10^4 (much smaller!)

Precision: ~4 decimal digits

Problem: Underflows on small gradients (backprop gets stuck)

Still used? Yes, but with "mixed precision" (see Day 14)

TF32: NVIDIA's Compromise

32-bit container, 19 effective bits: S[1] E[8] M[10]

Speed: Hardware is optimized for 16-bit operations, but uses 32-bit storage

Accuracy: Between FP16 and FP32

Use: Hopper (H100) automatic mixed precision

Comparison Table

Format	Bits	Exponent	Mantissa	Range	Accuracy	Best For
FP64	64	11	52	±10^308	Perfect	Scientific
FP32	32	8	23	±10^38	Good	Training
TF32	32	8	10	±10^38	Good	Mixed precision
BF16	16	8	7	±10^38	Fair	TPU, Training
FP16	16	5	10	±10^4	Fair	GPU, with scaling
INT8	8	—	7	[-128,127]	Poor	Inference only

Storage Cost

A 1 Billion-parameter model: - FP32: 1B × 4 bytes = 4 GB - TF32: 1B × 4 bytes = 4 GB (same storage, faster ops) - BF16: 1B × 2 bytes = 2 GB (50% smaller!) - INT8: 1B × 1 byte = 1 GB (75% smaller!) Example: GPT-3 (175B params) - FP32: 700 GB (!!) - BF16: 350 GB (fits on TPU Pod) - INT8: 175 GB (fits on 2-4 GPUs)

Why AI Chips Choose Different Precisions

Google TPU: BF16 First

TPU v2/v3 are optimized for BF16 because:

Exponent matches FP32 (no underflow/overflow)
16 bits fit in half the memory bandwidth
Neural networks don't need mantissa precision
Mixed precision: BF16 for activations, FP32 for loss scaling

Apple Neural Engine: INT8 Only

A17 Pro Neural Engine runs inference only (training happens offline):

INT8 is 4× smaller than BF16 (1 byte vs 2 bytes)
Integer multiply is 2× faster than floating-point
Models are quantized post-training on GPU, deployed as INT8
Power: ~2W sustained (critical for battery phones)

NVIDIA H100: All of the Above

H100 supports FP32, TF32, FP8, FP16, BF16—it's flexible:

Automatic mixed precision (AMP): FP32 weights, FP8/TF32 compute
Hopper architecture optimizes for sparsity: 2× throughput if 50% of weights are zero
Flexible for researchers (training) and enterprises (inference)

Design insight: Precision choice is a constraint optimization problem: - Training: Need BF16/TF32 for stability + gradient scaling - Inference: INT8/INT4 fine for deployed models (quantized offline) - Mobile: INT8 dominates (power budget) - Data center: Mix of FP32 + BF16 + INT8

Next: Quantization Techniques

Now you know what FP32 and BF16 are. Tomorrow (Day 12), we'll learn how to convert a BF16 model to INT8 without losing accuracy—the secret sauce of production AI chips.

Floating Point Basics