AI Chip Day 5 Enhanced — Floating Point & Precision

1. Floating Point Formats

Format	Bits	Range	Precision	Hardware Cost	Use Case
FP32	32	±3.4e38	~7 decimal digits	Baseline	Training reference
FP16	16	±6.5e4	~3 decimal digits	2x smaller, faster	Training (with care)
BF16	16	±3.4e38	~2 decimal digits	2x smaller, faster	Training, inference
TF32	32 bits, 19 mantissa	±1.5e38	~3 decimal digits	Same as FP32	Training sweet spot
INT8	8	-128 to 127	Quantized	4x smaller, 4x faster	Inference only

2. FP32 (Full Precision)

Standard for training: 1 sign + 8 exponent + 23 mantissa bits

Advantages:

Wide range (from 1e-38 to 1e38)
Sufficient precision for gradients
No special handling needed

Disadvantages:

Large memory footprint (4 bytes per number)
Slower compute (more transistors)
More power consumption

3. FP16 (Half Precision)

Use in training: 1 sign + 5 exponent + 10 mantissa bits

Advantages:

2x smaller memory
2x faster compute (pipeline can handle 2 ops)
2x less bandwidth

Disadvantages:

Limited range (±6e4, underflow below 6e-5)
Precision loss (3 decimal digits)
Gradient underflow (vanishing gradients)
Requires loss scaling (multiply loss by 1000, then divide gradients)

4. BF16 (Brain Float)

Google's format: 1 sign + 8 exponent + 7 mantissa bits

Trade-off: Same range as FP32, less precision than FP16

Advantages:

Full FP32 range (no underflow)
2x smaller than FP32
No loss scaling needed
Widely supported (NVIDIA, Google, Intel)

Disadvantages:

Lower precision than FP16 (2 digits vs 3)
Training accuracy slightly lower (but acceptable)

5. TF32 (Tensor Float)

NVIDIA format: Uses 32-bit storage, but only 19-bit precision (8 exponent + 10 mantissa, no sign in hardware)

Unique: Hardware truncates FP32 to TF32 for matrix multiply, keeps FP32 for accumulation

Advantage: ~3x speedup over FP32, minimal accuracy loss (almost imperceptible)

6. INT8 Quantization (Inference)

Key insight: Neural networks are robust to quantization. Can train in FP32, infer in INT8.

Quantization: FP32_value = -5.2 Scale = 127 / max_value = 127 / 10 = 12.7 INT8 = round(FP32 * scale) = round(-5.2 * 12.7) = -66 Dequantization: FP32_restored = INT8 / scale = -66 / 12.7 ≈ -5.2 Accuracy: ResNet-50 FP32 = 76%, ResNet-50 INT8 = 75.8% (< 0.2% loss)

7. Mixed Precision Training

Best of both worlds: FP32 for gradients, FP16/BF16 for matrix multiply

Forward: FP16 (faster)
Backward: FP16 (faster)
Gradient accumulation: FP32 (stability)
Weight update: FP32 (precision)

Result: 2x speedup, minimal accuracy loss

8. Hardware Implications

Multiplier Unit Size

Precision	Multiplier Bits	Area (Relative)	Delay (Relative)
FP32	24×24	1.0x	1.0x
FP16	11×11	0.2x	0.6x
INT8	8×8	0.1x	0.4x

Design choice: Make multiplier for INT8, then use subset for FP16. Total area cost small.

9. Checklist: Precision Selection

✅ Training: FP32 (reference), BF16 (recommended), TF32 (sweet spot)
✅ Inference: FP32 (high quality), BF16 (medium), INT8 (mobile/edge)
✅ Avoid FP16 alone: Needs loss scaling, careful handling
✅ INT8 for mobile: 4x reduction in size, power, memory bandwidth
✅ Quantization-aware training: Simulate INT8 during training for better accuracy
✅ Measure accuracy: Test before deploying lower precision

Next (Day 6): Quantization techniques and post-training quantization.