AI Chip Design Day 12

The Core Problem: FP32 → INT8

A typical transformer weight:

FP32: 0.0234567 (32 bits, floating-point) INT8: 12 (8 bits, integer: -128 to +127) The challenge: 4× smaller, but still accurate

Fixed-Point Representation

INT8 is a fixed-point number with implicit decimal point:

INT8 bit layout: S[1] | M[7] (Sign + 7-bit magnitude) Range: -128 to +127 Quantization formula (Symmetric): INT8_value = round( FP32_value / scale_factor ) scale_factor = max(|FP32_values|) / 127 Example: FP32 weights: [-0.5, -0.2, 0.1, 0.3, 0.5] max(abs) = 0.5 scale = 0.5 / 127 = 0.003937 Quantized: 0.5 / 0.003937 = 127 → INT8: 127 0.3 / 0.003937 = 76 → INT8: 76 0.1 / 0.003937 = 25 → INT8: 25 -0.5 / 0.003937 = -127 → INT8: -127

De-quantization (Inference)

When you multiply INT8 values, you get larger integers. Scale back:

During inference: y = (INT8_x * INT8_w * scale_x * scale_w) + bias Example computation: x = 42 (INT8), w = 38 (INT8) scale_x = 1/2048, scale_w = 1/4096 y = 42 * 38 * (1/2048) * (1/4096) = 1596 / 8388608 ≈ 0.00019

INT8 vs INT4

Format	Bits	Range	Use Case	Accuracy Loss
FP32	32	±10^38	Training	—
INT8	8	[-128, 127]	Inference (standard)	0.5-1% accuracy drop
INT4	4	[-8, 7]	Mobile/Edge	2-5% accuracy drop
INT2	2	[-2, 1]	Ultra-edge (research)	10%+ accuracy drop

Why Hardware Loves INT8

8-bit multiply: 8×8→16 bit operation (fits on hardware)
Accumulation: Add 256 INT8 products in 32-bit accumulator
Power: Integer multiply is ~4× lower power than floating-point multiply
Memory: 4× smaller weights (4 GB → 1 GB for billion-parameter model)
Bandwidth: 4× more weights per memory transaction

Apple Neural Engine (INT8 only): A17 Pro runs inference-only workloads in INT8. Why? 2W power budget for smartphone. INT8 multiply is atomic; no floating-point overhead. NVIDIA H100 (FP8 support): Hopper added FP8 (1 sign + 5 exponent + 2 mantissa) for ultra-low precision. Still floating-point format, but squeezed into 8 bits.

Asymmetric Quantization

What if weights don't center around zero?

Weight distribution: Symmetric: [====|====] (centered on 0) -50 0 50 Asymmetric: [ ========] (skewed right) -10 0 60 Asymmetric formula: INT8 = round( (FP32 - zero_point) / scale ) Example: weights range from 0.1 to 0.5 zero_point = 0.1 (minimum) scale = (0.5 - 0.1) / 255 ≈ 0.00157 0.5 → (0.5-0.1) / 0.00157 = 255 (uses full INT8 range!) 0.3 → (0.3-0.1) / 0.00157 = 127 0.1 → (0.1-0.1) / 0.00157 = 0

Hardware Implementation

INT8 multiplier in hardware (simplified):

module int8_multiplier (
  input signed [7:0] a, b,
  output signed [15:0] product
);
  assign product = a * b;  // 8-bit × 8-bit = 16-bit result
endmodule

// In a MAC unit:
always @(posedge clk) begin
  mult_result <= a * b;  // INT8 × INT8
  accum <= accum + mult_result;  // Accumulate (32-bit)
end

Quantization Error

The key metric: how much accuracy is lost?

Top-1 accuracy (ImageNet): FP32=76.5% → INT8=76.2% (0.3% drop)
BERT language model: FP32=92% → INT8=91.7% (0.3% drop)
Production tolerance: 0.5-1% accuracy loss is acceptable; re-train if higher

Day 13: How to quantize: post-training quantization vs quantization-aware training.

Integer Quantization