HomeDay 11

Floating Point Basics

FP32, BF16, FP16, TF32 — precision formats that power neural networks. Why AI chips use different precisions for different tasks.

The Precision Spectrum

Neural networks don't need 64-bit IEEE double precision. They thrive on 8-16 bit numbers. Different formats optimize for different workloads:

FP64 (64-bit): S[1] E[11] M[52] ← Double precision (scientific computing) FP32 (32-bit): S[1] E[8] M[23] ← Standard neural network precision TF32 (32-bit): S[1] E[8] M[10] ← NVIDIA Tensor Float (fast, low precision) BF16 (16-bit): S[1] E[8] M[7] ← Brain Float (TPU choice, good range) FP16 (16-bit): S[1] E[5] M[10] ← Half precision (older, smaller range) INT8 (8-bit): S[1] Fixed-point ← Integer quantization (inference only)

FP32: The Gold Standard

32 bits: 1 sign + 8 exponent + 23 mantissa

Range: ~10^-38 to 10^38 (massive)

Precision: ~7 decimal digits

Use: Training (loss landscape is smooth, needs precision), research

BF16: The TPU Choice

16 bits: 1 sign + 8 exponent + 7 mantissa (shares exponent with FP32!)

Range: Same as FP32 (~10^-38 to 10^38)

Precision: ~3 decimal digits (lower, but range unchanged)

Why it works: Neural networks don't care about precision, they care about range. Loss landscape is coarse—7 bits of mantissa is enough.

Key insight: BF16 trades mantissa precision for exponent range preservation. This is PERFECT for AI because: - Intermediate values during training can be very small (10^-10) or very large (10^6) - BF16 won't underflow or overflow - FP16 underflows at ~10^-5 (useless for deep nets)

FP16: The Older Choice

16 bits: 1 sign + 5 exponent + 10 mantissa

Range: ~10^-5 to 10^4 (much smaller!)

Precision: ~4 decimal digits

Problem: Underflows on small gradients (backprop gets stuck)

Still used? Yes, but with "mixed precision" (see Day 14)

TF32: NVIDIA's Compromise

32-bit container, 19 effective bits: S[1] E[8] M[10]

Speed: Hardware is optimized for 16-bit operations, but uses 32-bit storage

Accuracy: Between FP16 and FP32

Use: Hopper (H100) automatic mixed precision

Comparison Table

FormatBitsExponentMantissaRangeAccuracyBest For
FP64641152±10^308PerfectScientific
FP3232823±10^38GoodTraining
TF3232810±10^38GoodMixed precision
BF161687±10^38FairTPU, Training
FP1616510±10^4FairGPU, with scaling
INT887[-128,127]PoorInference only

Storage Cost

A 1 Billion-parameter model: - FP32: 1B × 4 bytes = 4 GB - TF32: 1B × 4 bytes = 4 GB (same storage, faster ops) - BF16: 1B × 2 bytes = 2 GB (50% smaller!) - INT8: 1B × 1 byte = 1 GB (75% smaller!) Example: GPT-3 (175B params) - FP32: 700 GB (!!) - BF16: 350 GB (fits on TPU Pod) - INT8: 175 GB (fits on 2-4 GPUs)

Why AI Chips Choose Different Precisions

Google TPU: BF16 First

TPU v2/v3 are optimized for BF16 because:

Apple Neural Engine: INT8 Only

A17 Pro Neural Engine runs inference only (training happens offline):

NVIDIA H100: All of the Above

H100 supports FP32, TF32, FP8, FP16, BF16—it's flexible:

Design insight: Precision choice is a constraint optimization problem: - Training: Need BF16/TF32 for stability + gradient scaling - Inference: INT8/INT4 fine for deployed models (quantized offline) - Mobile: INT8 dominates (power budget) - Data center: Mix of FP32 + BF16 + INT8

Next: Quantization Techniques

Now you know what FP32 and BF16 are. Tomorrow (Day 12), we'll learn how to convert a BF16 model to INT8 without losing accuracy—the secret sauce of production AI chips.