The Precision Spectrum
Neural networks don't need 64-bit IEEE double precision. They thrive on 8-16 bit numbers. Different formats optimize for different workloads:
FP32: The Gold Standard
32 bits: 1 sign + 8 exponent + 23 mantissa
Range: ~10^-38 to 10^38 (massive)
Precision: ~7 decimal digits
Use: Training (loss landscape is smooth, needs precision), research
BF16: The TPU Choice
16 bits: 1 sign + 8 exponent + 7 mantissa (shares exponent with FP32!)
Range: Same as FP32 (~10^-38 to 10^38)
Precision: ~3 decimal digits (lower, but range unchanged)
Why it works: Neural networks don't care about precision, they care about range. Loss landscape is coarse—7 bits of mantissa is enough.
FP16: The Older Choice
16 bits: 1 sign + 5 exponent + 10 mantissa
Range: ~10^-5 to 10^4 (much smaller!)
Precision: ~4 decimal digits
Problem: Underflows on small gradients (backprop gets stuck)
Still used? Yes, but with "mixed precision" (see Day 14)
TF32: NVIDIA's Compromise
32-bit container, 19 effective bits: S[1] E[8] M[10]
Speed: Hardware is optimized for 16-bit operations, but uses 32-bit storage
Accuracy: Between FP16 and FP32
Use: Hopper (H100) automatic mixed precision
Comparison Table
| Format | Bits | Exponent | Mantissa | Range | Accuracy | Best For |
|---|---|---|---|---|---|---|
| FP64 | 64 | 11 | 52 | ±10^308 | Perfect | Scientific |
| FP32 | 32 | 8 | 23 | ±10^38 | Good | Training |
| TF32 | 32 | 8 | 10 | ±10^38 | Good | Mixed precision |
| BF16 | 16 | 8 | 7 | ±10^38 | Fair | TPU, Training |
| FP16 | 16 | 5 | 10 | ±10^4 | Fair | GPU, with scaling |
| INT8 | 8 | — | 7 | [-128,127] | Poor | Inference only |
Storage Cost
Why AI Chips Choose Different Precisions
Google TPU: BF16 First
TPU v2/v3 are optimized for BF16 because:
- Exponent matches FP32 (no underflow/overflow)
- 16 bits fit in half the memory bandwidth
- Neural networks don't need mantissa precision
- Mixed precision: BF16 for activations, FP32 for loss scaling
Apple Neural Engine: INT8 Only
A17 Pro Neural Engine runs inference only (training happens offline):
- INT8 is 4× smaller than BF16 (1 byte vs 2 bytes)
- Integer multiply is 2× faster than floating-point
- Models are quantized post-training on GPU, deployed as INT8
- Power: ~2W sustained (critical for battery phones)
NVIDIA H100: All of the Above
H100 supports FP32, TF32, FP8, FP16, BF16—it's flexible:
- Automatic mixed precision (AMP): FP32 weights, FP8/TF32 compute
- Hopper architecture optimizes for sparsity: 2× throughput if 50% of weights are zero
- Flexible for researchers (training) and enterprises (inference)
Next: Quantization Techniques
Now you know what FP32 and BF16 are. Tomorrow (Day 12), we'll learn how to convert a BF16 model to INT8 without losing accuracy—the secret sauce of production AI chips.