1. Floating Point Formats
| Format | Bits | Range | Precision | Hardware Cost | Use Case |
|---|---|---|---|---|---|
| FP32 | 32 | ±3.4e38 | ~7 decimal digits | Baseline | Training reference |
| FP16 | 16 | ±6.5e4 | ~3 decimal digits | 2x smaller, faster | Training (with care) |
| BF16 | 16 | ±3.4e38 | ~2 decimal digits | 2x smaller, faster | Training, inference |
| TF32 | 32 bits, 19 mantissa | ±1.5e38 | ~3 decimal digits | Same as FP32 | Training sweet spot |
| INT8 | 8 | -128 to 127 | Quantized | 4x smaller, 4x faster | Inference only |
2. FP32 (Full Precision)
Standard for training: 1 sign + 8 exponent + 23 mantissa bits
Advantages:
- Wide range (from 1e-38 to 1e38)
- Sufficient precision for gradients
- No special handling needed
Disadvantages:
- Large memory footprint (4 bytes per number)
- Slower compute (more transistors)
- More power consumption
3. FP16 (Half Precision)
Use in training: 1 sign + 5 exponent + 10 mantissa bits
Advantages:
- 2x smaller memory
- 2x faster compute (pipeline can handle 2 ops)
- 2x less bandwidth
Disadvantages:
- Limited range (±6e4, underflow below 6e-5)
- Precision loss (3 decimal digits)
- Gradient underflow (vanishing gradients)
- Requires loss scaling (multiply loss by 1000, then divide gradients)
4. BF16 (Brain Float)
Google's format: 1 sign + 8 exponent + 7 mantissa bits
Trade-off: Same range as FP32, less precision than FP16
Advantages:
- Full FP32 range (no underflow)
- 2x smaller than FP32
- No loss scaling needed
- Widely supported (NVIDIA, Google, Intel)
Disadvantages:
- Lower precision than FP16 (2 digits vs 3)
- Training accuracy slightly lower (but acceptable)
5. TF32 (Tensor Float)
NVIDIA format: Uses 32-bit storage, but only 19-bit precision (8 exponent + 10 mantissa, no sign in hardware)
Unique: Hardware truncates FP32 to TF32 for matrix multiply, keeps FP32 for accumulation
Advantage: ~3x speedup over FP32, minimal accuracy loss (almost imperceptible)
6. INT8 Quantization (Inference)
Key insight: Neural networks are robust to quantization. Can train in FP32, infer in INT8.
7. Mixed Precision Training
Best of both worlds: FP32 for gradients, FP16/BF16 for matrix multiply
- Forward: FP16 (faster)
- Backward: FP16 (faster)
- Gradient accumulation: FP32 (stability)
- Weight update: FP32 (precision)
Result: 2x speedup, minimal accuracy loss
8. Hardware Implications
Multiplier Unit Size
| Precision | Multiplier Bits | Area (Relative) | Delay (Relative) |
|---|---|---|---|
| FP32 | 24×24 | 1.0x | 1.0x |
| FP16 | 11×11 | 0.2x | 0.6x |
| INT8 | 8×8 | 0.1x | 0.4x |
Design choice: Make multiplier for INT8, then use subset for FP16. Total area cost small.
9. Checklist: Precision Selection
- ✅ Training: FP32 (reference), BF16 (recommended), TF32 (sweet spot)
- ✅ Inference: FP32 (high quality), BF16 (medium), INT8 (mobile/edge)
- ✅ Avoid FP16 alone: Needs loss scaling, careful handling
- ✅ INT8 for mobile: 4x reduction in size, power, memory bandwidth
- ✅ Quantization-aware training: Simulate INT8 during training for better accuracy
- ✅ Measure accuracy: Test before deploying lower precision
Next (Day 6): Quantization techniques and post-training quantization.