HomeDay 12

Integer Quantization

From floating-point to fixed-point: INT8, INT4, asymmetric quantization. How Apple Neural Engine and NVIDIA H100 do integer inference.

The Core Problem: FP32 → INT8

A typical transformer weight:

FP32: 0.0234567 (32 bits, floating-point) INT8: 12 (8 bits, integer: -128 to +127) The challenge: 4× smaller, but still accurate

Fixed-Point Representation

INT8 is a fixed-point number with implicit decimal point:

INT8 bit layout: S[1] | M[7] (Sign + 7-bit magnitude) Range: -128 to +127 Quantization formula (Symmetric): INT8_value = round( FP32_value / scale_factor ) scale_factor = max(|FP32_values|) / 127 Example: FP32 weights: [-0.5, -0.2, 0.1, 0.3, 0.5] max(abs) = 0.5 scale = 0.5 / 127 = 0.003937 Quantized: 0.5 / 0.003937 = 127 → INT8: 127 0.3 / 0.003937 = 76 → INT8: 76 0.1 / 0.003937 = 25 → INT8: 25 -0.5 / 0.003937 = -127 → INT8: -127

De-quantization (Inference)

When you multiply INT8 values, you get larger integers. Scale back:

During inference: y = (INT8_x * INT8_w * scale_x * scale_w) + bias Example computation: x = 42 (INT8), w = 38 (INT8) scale_x = 1/2048, scale_w = 1/4096 y = 42 * 38 * (1/2048) * (1/4096) = 1596 / 8388608 ≈ 0.00019

INT8 vs INT4

FormatBitsRangeUse CaseAccuracy Loss
FP3232±10^38Training
INT88[-128, 127]Inference (standard)0.5-1% accuracy drop
INT44[-8, 7]Mobile/Edge2-5% accuracy drop
INT22[-2, 1]Ultra-edge (research)10%+ accuracy drop

Why Hardware Loves INT8

Apple Neural Engine (INT8 only): A17 Pro runs inference-only workloads in INT8. Why? 2W power budget for smartphone. INT8 multiply is atomic; no floating-point overhead. NVIDIA H100 (FP8 support): Hopper added FP8 (1 sign + 5 exponent + 2 mantissa) for ultra-low precision. Still floating-point format, but squeezed into 8 bits.

Asymmetric Quantization

What if weights don't center around zero?

Weight distribution: Symmetric: [====|====] (centered on 0) -50 0 50 Asymmetric: [ ========] (skewed right) -10 0 60 Asymmetric formula: INT8 = round( (FP32 - zero_point) / scale ) Example: weights range from 0.1 to 0.5 zero_point = 0.1 (minimum) scale = (0.5 - 0.1) / 255 ≈ 0.00157 0.5 → (0.5-0.1) / 0.00157 = 255 (uses full INT8 range!) 0.3 → (0.3-0.1) / 0.00157 = 127 0.1 → (0.1-0.1) / 0.00157 = 0

Hardware Implementation

INT8 multiplier in hardware (simplified):

module int8_multiplier ( input signed [7:0] a, b, output signed [15:0] product ); assign product = a * b; // 8-bit × 8-bit = 16-bit result endmodule // In a MAC unit: always @(posedge clk) begin mult_result <= a * b; // INT8 × INT8 accum <= accum + mult_result; // Accumulate (32-bit) end

Quantization Error

The key metric: how much accuracy is lost?

Day 13: How to quantize: post-training quantization vs quantization-aware training.