The Core Problem: FP32 → INT8
A typical transformer weight:
FP32: 0.0234567 (32 bits, floating-point)
INT8: 12 (8 bits, integer: -128 to +127)
The challenge: 4× smaller, but still accurate
Fixed-Point Representation
INT8 is a fixed-point number with implicit decimal point:
INT8 bit layout: S[1] | M[7] (Sign + 7-bit magnitude)
Range: -128 to +127
Quantization formula (Symmetric):
INT8_value = round( FP32_value / scale_factor )
scale_factor = max(|FP32_values|) / 127
Example:
FP32 weights: [-0.5, -0.2, 0.1, 0.3, 0.5]
max(abs) = 0.5
scale = 0.5 / 127 = 0.003937
Quantized:
0.5 / 0.003937 = 127 → INT8: 127
0.3 / 0.003937 = 76 → INT8: 76
0.1 / 0.003937 = 25 → INT8: 25
-0.5 / 0.003937 = -127 → INT8: -127
De-quantization (Inference)
When you multiply INT8 values, you get larger integers. Scale back:
During inference:
y = (INT8_x * INT8_w * scale_x * scale_w) + bias
Example computation:
x = 42 (INT8), w = 38 (INT8)
scale_x = 1/2048, scale_w = 1/4096
y = 42 * 38 * (1/2048) * (1/4096) = 1596 / 8388608 ≈ 0.00019
INT8 vs INT4
| Format | Bits | Range | Use Case | Accuracy Loss |
|---|---|---|---|---|
| FP32 | 32 | ±10^38 | Training | — |
| INT8 | 8 | [-128, 127] | Inference (standard) | 0.5-1% accuracy drop |
| INT4 | 4 | [-8, 7] | Mobile/Edge | 2-5% accuracy drop |
| INT2 | 2 | [-2, 1] | Ultra-edge (research) | 10%+ accuracy drop |
Why Hardware Loves INT8
- 8-bit multiply: 8×8→16 bit operation (fits on hardware)
- Accumulation: Add 256 INT8 products in 32-bit accumulator
- Power: Integer multiply is ~4× lower power than floating-point multiply
- Memory: 4× smaller weights (4 GB → 1 GB for billion-parameter model)
- Bandwidth: 4× more weights per memory transaction
Apple Neural Engine (INT8 only): A17 Pro runs inference-only workloads in INT8. Why? 2W power budget for smartphone. INT8 multiply is atomic; no floating-point overhead.
NVIDIA H100 (FP8 support): Hopper added FP8 (1 sign + 5 exponent + 2 mantissa) for ultra-low precision. Still floating-point format, but squeezed into 8 bits.
Asymmetric Quantization
What if weights don't center around zero?
Weight distribution:
Symmetric: [====|====] (centered on 0)
-50 0 50
Asymmetric: [ ========] (skewed right)
-10 0 60
Asymmetric formula:
INT8 = round( (FP32 - zero_point) / scale )
Example: weights range from 0.1 to 0.5
zero_point = 0.1 (minimum)
scale = (0.5 - 0.1) / 255 ≈ 0.00157
0.5 → (0.5-0.1) / 0.00157 = 255 (uses full INT8 range!)
0.3 → (0.3-0.1) / 0.00157 = 127
0.1 → (0.1-0.1) / 0.00157 = 0
Hardware Implementation
INT8 multiplier in hardware (simplified):
module int8_multiplier (
input signed [7:0] a, b,
output signed [15:0] product
);
assign product = a * b; // 8-bit × 8-bit = 16-bit result
endmodule
// In a MAC unit:
always @(posedge clk) begin
mult_result <= a * b; // INT8 × INT8
accum <= accum + mult_result; // Accumulate (32-bit)
end
Quantization Error
The key metric: how much accuracy is lost?
- Top-1 accuracy (ImageNet): FP32=76.5% → INT8=76.2% (0.3% drop)
- BERT language model: FP32=92% → INT8=91.7% (0.3% drop)
- Production tolerance: 0.5-1% accuracy loss is acceptable; re-train if higher
Day 13: How to quantize: post-training quantization vs quantization-aware training.