Why use INT8 instead of FP32 for FPGA inference?

INT8 uses 4× less memory than FP32, allows 2× more MAC operations per DSP block, and has simpler hardware (no floating-point exponent logic). Research shows neural networks lose less than 1% accuracy when quantized from FP32 to INT8, making it the dominant format for FPGA inference.

What is Q-format in fixed-point arithmetic?

Q-format (Qm.n notation) specifies how many bits represent the integer part (m) and fractional part (n) of a fixed-point number. Q1.7 means 1 sign bit, 0 integer bits, 7 fractional bits — representing values from -1.0 to +0.9921875 in steps of 1/128. The decimal point is fixed at a specific bit position, unlike floating-point where it moves.

What is quantization error?

Quantization error is the difference between the original FP32 value and the nearest representable fixed-point value. For INT8 with range [-1, 1], the step size is 2/256 ≈ 0.0078. Maximum quantization error is half the step size: ±0.0039. Accumulated over thousands of MAC operations, this error must be carefully managed through calibration and fine-tuning.

How do you handle overflow in fixed-point neural networks?

Overflow occurs when a MAC accumulation exceeds the representable range. Solutions include: (1) saturation arithmetic — clamp to max/min instead of wrapping, (2) wider accumulators — use 32-bit accumulators for INT8 multiply-accumulate, (3) requantization after each layer — scale back to INT8 range, (4) careful scale factor selection during quantization calibration.

Fixed-Point Arithmetic & Quantization for FPGA Neural Networks

1. Why Floating-Point is Wrong for FPGA Inference

FP32 (32-bit floating-point) is the format PyTorch and TensorFlow train in. It gives numerical accuracy and large dynamic range — perfect for training where gradients span many orders of magnitude. But for inference on FPGA, FP32 is a terrible choice:

FP32 on FPGA — The Problems

❌ 1 FP32 multiply needs 3–4 DSP blocks
❌ Exponent + mantissa logic eats 1,500+ LUTs
❌ 4× more memory bandwidth vs INT8
❌ Slower clock due to complex rounding
❌ Only 3,072 FP32 MACs/cycle on Alveo U250

INT8 on FPGA — The Wins

✅ 2 INT8 MACs per DSP block per cycle
✅ Simple shift-and-add hardware
✅ 4× less memory bandwidth
✅ Higher clock frequency (simpler logic)
✅ 24,576 INT8 MACs/cycle on Alveo U250 (8×!)

Research Reality Check

Google's landmark 2017 paper showed ResNet-50 loses only 0.5% top-1 accuracy when quantized from FP32 → INT8. MobileNetV2 loses ~0.9%. That's a completely acceptable trade-off for 8× more compute throughput on FPGA.

2. How Numbers Are Represented

2.1 Floating-Point (FP32) — Review

FP32 — 32-bit Floating Point Layout

Value = (-1)ˢ × 2^(E-127) × (1 + M/2²³) — Exponent logic requires expensive hardware on FPGA

2.2 Fixed-Point Representation

Fixed-point removes the exponent entirely. The decimal point is fixed at a known position. This makes hardware trivially simple — just integers with a pre-agreed scaling factor.

Fixed-Point — Q3.4 Format (8-bit signed example)

Example: 0b 0 011 0100

Integer part: 011 = 3
Fraction: 0100 = 4/16 = 0.25
Value = 3.25

Range (Q3.4, signed)

Max: +7.9375
Min: -8.0
Step: 1/16 = 0.0625

3. Q-Format Notation

Q-format (Qm.n) precisely defines how a fixed-point number's bits are split. You will see this notation throughout FPGA AI literature and Verilog codebases.

Q-Format Rules: Qm.n = m integer bits + n fractional bits Total bits = sign bit + m + n (for signed) Common formats for neural networks: Q1.7 → 1 int bit, 7 frac bits → range [-1.0, +0.992] step 0.0078 Q4.4 → 4 int bits, 4 frac bits → range [-8.0, +7.9375] step 0.0625 Q7.8 → 7 int bits, 8 frac bits → range [-128, +127.99] step 0.0039 Q0.7 → 0 int bits, 7 frac bits → range [-0.5, +0.492] step 0.0078 Value formula: x_real = x_int × 2^(-n) x_int = round(x_real × 2^n) Example — convert 0.3 to Q1.7: x_int = round(0.3 × 2^7) = round(0.3 × 128) = round(38.4) = 38 Binary: 0b 0_0100110 = 38 Recovered: 38 / 128 = 0.296875 (error = 0.3 - 0.296875 = 0.003125)

4. INT8 Quantization — The Industry Standard

INT8 quantization maps FP32 weights and activations to 8-bit signed integers. This is what TensorFlow Lite, PyTorch Mobile, and Vitis AI all use by default. Two key parameters define the mapping:

FP32 → INT8 Quantization Mapping

Quantize

q = clamp(
round(x/S + Z),
-128, 127)

Dequantize

x ≈ S × (q - Z)

Scale Factor S

S = (max-min)
/ (2^bits - 1)

5. Quantization Error Analysis

Every quantized value has an error equal to the rounding distance. Understanding this error is critical for knowing when INT8 is safe vs when you need INT16 or FP16.

Quantization Error — Sawtooth Pattern

For INT8 with range [-2.5, +2.5]:
Step size S = 5.0 / 255 ≈ 0.01961 | Max error = S/2 ≈ ±0.0098 | SNR ≈ 49.9 dB (excellent for inference)

6. Overflow and Saturation Handling

Overflow is the most dangerous bug in fixed-point neural networks. When two INT8 values multiply (8×8=16-bit result), then accumulate hundreds of times, the accumulator can easily overflow a 16-bit register — but not a 32-bit one. This is why accumulators must be wider than inputs.

Accumulator Width — Why 32 Bits?

INT8 × INT8 =

8 bits (A)

8 bits (B)

16-bit product

Accumulate 256× =

32-bit accumulator (needed!)

16-bit: OVERFLOW ✗

Requantize output =

32-bit accumulator

→ scale →

INT8 output

Rule: Always use 32-bit accumulators for INT8 MACs. After each layer, requantize back to INT8. This is exactly how DSP58E2's 48-bit accumulator is used in practice.

7. Scale Factor Calibration

Choosing the right scale factor S is what separates good quantization from bad. If S is too large → many values overflow. If S is too small → most values clamp to 0 or ±127 (clipping loss).

Calibration Methods: 1. Min-Max (simplest): S = (max_val - min_val) / (2^bits - 1) Z = -round(min_val / S) Problem: outliers inflate S → poor precision for typical values 2. Percentile (better): Use 99.9th percentile instead of absolute max Clips 0.1% of values but improves precision for 99.9% 3. KL-Divergence (best — used by TensorRT, Vitis AI): Minimize KL divergence between FP32 and INT8 distribution Runs on calibration dataset (1000-5000 representative images) Automatically finds optimal S per layer Python example (PyTorch): import torch.quantization model.qconfig = torch.quantization.get_default_qconfig('fbgemm') model_prepared = torch.quantization.prepare(model) # Run calibration data through model_prepared model_int8 = torch.quantization.convert(model_prepared)

8. Building the INT8 MAC Unit in Verilog

Now let's build it. This is a fully pipelined 3-stage MAC unit using signed INT8 inputs, a 16-bit product register, and a 32-bit accumulator — exactly how a real CNN accelerator works.

// mac_unit.v — Pipelined INT8 MAC Unit for FPGA Neural Network
// 3-stage pipeline: Stage1=Multiply, Stage2=Accumulate, Stage3=Output
// Uses DSP58E2 inference — Xilinx synthesis will map to DSP blocks

module mac_unit #(
  parameter DATA_W  = 8,   // INT8 input width
  parameter PROD_W  = 16,  // 8x8 = 16-bit product
  parameter ACCUM_W = 32   // 32-bit accumulator (safe for 256 accumulations)
)(
  input  wire                  clk,
  input  wire                  rst_n,
  input  wire                  clear,     // clear accumulator (new dot product)
  input  wire                  valid_in,  // input data is valid
  input  wire signed [DATA_W-1:0] a,      // weight (INT8 signed)
  input  wire signed [DATA_W-1:0] b,      // activation (INT8 signed)
  output reg  signed [ACCUM_W-1:0] accum, // accumulated result
  output reg                   valid_out  // output valid
);

  // Pipeline Stage 1: Multiply
  reg signed [PROD_W-1:0]  product;
  reg                       valid_s1;

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      product  <= 0;
      valid_s1 <= 0;
    end else begin
      product  <= a * b;          // 8x8 = 16-bit signed multiply
      valid_s1 <= valid_in;
    end
  end

  // Pipeline Stage 2: Accumulate (sign-extend product to 32-bit)
  reg signed [ACCUM_W-1:0] product_ext;
  reg                       valid_s2;
  reg                       clear_s2;

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      product_ext <= 0;
      valid_s2    <= 0;
      clear_s2    <= 0;
    end else begin
      product_ext <= {{(ACCUM_W-PROD_W){product[PROD_W-1]}}, product}; // sign extend
      valid_s2    <= valid_s1;
      clear_s2    <= clear;
    end
  end

  // Pipeline Stage 3: Accumulate or Clear
  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      accum     <= 0;
      valid_out <= 0;
    end else begin
      valid_out <= valid_s2;
      if (clear_s2)
        accum <= product_ext;           // start new accumulation
      else if (valid_s2)
        accum <= accum + product_ext;   // accumulate
    end
  end

endmodule

Testbench — Verify the MAC Unit

// mac_unit_tb.v — Testbench: compute dot product [2,3,4] · [1,2,3] = 2+6+12 = 20
`timescale 1ns/1ps

module mac_unit_tb;
  reg        clk = 0, rst_n = 0, clear = 0, valid_in = 0;
  reg  signed [7:0] a, b;
  wire signed [31:0] accum;
  wire valid_out;

  mac_unit uut(.clk(clk),.rst_n(rst_n),.clear(clear),
               .valid_in(valid_in),.a(a),.b(b),
               .accum(accum),.valid_out(valid_out));

  always #5 clk = ~clk; // 100MHz clock

  initial begin
    rst_n = 0; #20; rst_n = 1;

    // Dot product: a=[2,3,4], b=[1,2,3] → expected = 20
    @(posedge clk); clear=1; valid_in=1; a=2; b=1; // 2*1=2
    @(posedge clk); clear=0; valid_in=1; a=3; b=2; // 3*2=6
    @(posedge clk); valid_in=1; a=4; b=3;           // 4*3=12
    @(posedge clk); valid_in=0;

    #50;
    $display("Dot product result: %0d (expected 20)", accum);
    if (accum == 20) $display("PASS ✓");
    else             $display("FAIL ✗");
    $finish;
  end
endmodule

9. INT4 Quantization — Pushing Further

INT4 halves memory bandwidth again and fits 4 MAC operations per DSP58 on Xilinx. It's used in LLM inference (GPTQ, AWQ) and some mobile CNNs. The challenge is higher quantization error:

Format	Bits	Levels	Max Error	Accuracy Drop (ResNet-50)	MACs/DSP/cycle
FP32	32	~16M	~1e-7	Baseline	0.25
FP16	16	~65K	~1e-3	<0.1%	0.5
INT8	8	256	±0.010	~0.5%	2
INT4	4	16	±0.031	1–3%	4
INT2	2	4	±0.125	5–15%	8

INT4 Requires Careful Calibration

With only 16 levels, INT4 is sensitive to outliers. Techniques like GPTQ (gradient-based post-training quantization) or AWQ (activation-aware weight quantization) are needed to keep accuracy drops under 2% for transformer models.

10. Practical Quantization Workflow

FP32 → INT8 → FPGA Deployment Pipeline

Train FP32 model — PyTorch / TensorFlow, normal training loop

Calibration — Run 1K–5K representative images to collect activation statistics

PTQ or QAT — Post-training quantization (fast) or quantization-aware training (more accurate)

Vitis AI Quantizer — vai_q_pytorch converts to Xilinx-optimized INT8 format with per-layer scale factors

Deploy on FPGA — Vitis AI compiler maps quantized model to DPU or your custom RTL accelerator

11. Day 2 Checklist

Key Takeaways — Fixed-Point & Quantization

✅ FP32 on FPGA is 8× less efficient than INT8 — avoid it for inference
✅ Q-format (Qm.n) defines integer + fractional bit split
✅ INT8 quantization maps FP32 values to 256 discrete levels with scale factor S
✅ Quantization error = ±S/2 per value; SNR ≈ 50 dB for INT8
✅ Always use 32-bit accumulators for INT8 MAC to prevent overflow
✅ KL-divergence calibration finds optimal scale factor per layer
✅ 3-stage pipelined MAC unit in Verilog maps to DSP58E2 on Xilinx
✅ INT4 achieves 4 MACs/DSP/cycle at the cost of ~1–3% accuracy drop
✅ Vitis AI vai_q_pytorch automates the FP32 → INT8 conversion workflow

Next — Day 3: Matrix Multiply Accelerator — building a 4×4 tiled GEMM engine in Verilog with DSP48 packing, input/output buffering, and throughput analysis.

← Previous

Day 1: FPGA vs GPU vs CPU

Course Home

All 15 Days →

Fixed-Point Arithmetic& Quantization