HomeFPGA Neural NetworkDay 2 — Fixed-Point & Quantization

Fixed-Point Arithmetic
& Quantization

Why FP32 is wasteful on FPGA. INT8/INT4 fixed-point representation, Q-format notation, quantization error analysis, scale factors, overflow handling, and a fully pipelined 8-bit MAC unit in Verilog.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,700 words · 15 min read

1. Why Floating-Point is Wrong for FPGA Inference

FP32 (32-bit floating-point) is the format PyTorch and TensorFlow train in. It gives numerical accuracy and large dynamic range — perfect for training where gradients span many orders of magnitude. But for inference on FPGA, FP32 is a terrible choice:

FP32 on FPGA — The Problems
  • ❌ 1 FP32 multiply needs 3–4 DSP blocks
  • ❌ Exponent + mantissa logic eats 1,500+ LUTs
  • ❌ 4× more memory bandwidth vs INT8
  • ❌ Slower clock due to complex rounding
  • ❌ Only 3,072 FP32 MACs/cycle on Alveo U250
INT8 on FPGA — The Wins
  • ✅ 2 INT8 MACs per DSP block per cycle
  • ✅ Simple shift-and-add hardware
  • ✅ 4× less memory bandwidth
  • ✅ Higher clock frequency (simpler logic)
  • 24,576 INT8 MACs/cycle on Alveo U250 (8×!)

Research Reality Check

Google's landmark 2017 paper showed ResNet-50 loses only 0.5% top-1 accuracy when quantized from FP32 → INT8. MobileNetV2 loses ~0.9%. That's a completely acceptable trade-off for 8× more compute throughput on FPGA.

2. How Numbers Are Represented

2.1 Floating-Point (FP32) — Review

FP32 — 32-bit Floating Point Layout
S 1 bit Exponent (E) 8 bits — range: -126 to +127 Mantissa / Fraction (M) 23 bits — ~7 decimal digits of precision

Value = (-1)ˢ × 2^(E-127) × (1 + M/2²³) — Exponent logic requires expensive hardware on FPGA

2.2 Fixed-Point Representation

Fixed-point removes the exponent entirely. The decimal point is fixed at a known position. This makes hardware trivially simple — just integers with a pre-agreed scaling factor.

Fixed-Point — Q3.4 Format (8-bit signed example)
Sign bit 7 Integer Part (3 bits) bits 6–4 → values 0,1,2,3,4 Fixed Point ↑ Fractional Part (4 bits) bits 3–0 → values ½, ¼, ⅛, 1/16
Example: 0b 0 011 0100
Integer part: 011 = 3
Fraction: 0100 = 4/16 = 0.25
Value = 3.25
Range (Q3.4, signed)
Max: +7.9375
Min: -8.0
Step: 1/16 = 0.0625

3. Q-Format Notation

Q-format (Qm.n) precisely defines how a fixed-point number's bits are split. You will see this notation throughout FPGA AI literature and Verilog codebases.

Q-Format Rules: Qm.n = m integer bits + n fractional bits Total bits = sign bit + m + n (for signed) Common formats for neural networks: Q1.7 → 1 int bit, 7 frac bits → range [-1.0, +0.992] step 0.0078 Q4.4 → 4 int bits, 4 frac bits → range [-8.0, +7.9375] step 0.0625 Q7.8 → 7 int bits, 8 frac bits → range [-128, +127.99] step 0.0039 Q0.7 → 0 int bits, 7 frac bits → range [-0.5, +0.492] step 0.0078 Value formula: x_real = x_int × 2^(-n) x_int = round(x_real × 2^n) Example — convert 0.3 to Q1.7: x_int = round(0.3 × 2^7) = round(0.3 × 128) = round(38.4) = 38 Binary: 0b 0_0100110 = 38 Recovered: 38 / 128 = 0.296875 (error = 0.3 - 0.296875 = 0.003125)

4. INT8 Quantization — The Industry Standard

INT8 quantization maps FP32 weights and activations to 8-bit signed integers. This is what TensorFlow Lite, PyTorch Mobile, and Vitis AI all use by default. Two key parameters define the mapping:

FP32 → INT8 Quantization Mapping
FP32 -2.5 0.0 +2.5 scale = 2.5/127 ≈ 0.01969 zero_point = 0 (symmetric) INT8 -127 0 +127 Each step = 0.01969 (≈ 2% of range) — 256 discrete levels
Quantize
q = clamp(
round(x/S + Z),
-128, 127)
Dequantize
x ≈ S × (q - Z)
Scale Factor S
S = (max-min)
/ (2^bits - 1)

5. Quantization Error Analysis

Every quantized value has an error equal to the rounding distance. Understanding this error is critical for knowing when INT8 is safe vs when you need INT16 or FP16.

Quantization Error — Sawtooth Pattern
+S/2 0 -S/2 Input value (x) Error step=S max error = S/2
For INT8 with range [-2.5, +2.5]:
Step size S = 5.0 / 255 ≈ 0.01961  |  Max error = S/2 ≈ ±0.0098  |  SNR ≈ 49.9 dB (excellent for inference)

6. Overflow and Saturation Handling

Overflow is the most dangerous bug in fixed-point neural networks. When two INT8 values multiply (8×8=16-bit result), then accumulate hundreds of times, the accumulator can easily overflow a 16-bit register — but not a 32-bit one. This is why accumulators must be wider than inputs.

Accumulator Width — Why 32 Bits?
INT8 × INT8 =
8 bits (A)
×
8 bits (B)
=
16-bit product
Accumulate 256× =
32-bit accumulator (needed!)
16-bit: OVERFLOW ✗
Requantize output =
32-bit accumulator
→ scale →
INT8 output
Rule: Always use 32-bit accumulators for INT8 MACs. After each layer, requantize back to INT8. This is exactly how DSP58E2's 48-bit accumulator is used in practice.

7. Scale Factor Calibration

Choosing the right scale factor S is what separates good quantization from bad. If S is too large → many values overflow. If S is too small → most values clamp to 0 or ±127 (clipping loss).

Calibration Methods: 1. Min-Max (simplest): S = (max_val - min_val) / (2^bits - 1) Z = -round(min_val / S) Problem: outliers inflate S → poor precision for typical values 2. Percentile (better): Use 99.9th percentile instead of absolute max Clips 0.1% of values but improves precision for 99.9% 3. KL-Divergence (best — used by TensorRT, Vitis AI): Minimize KL divergence between FP32 and INT8 distribution Runs on calibration dataset (1000-5000 representative images) Automatically finds optimal S per layer Python example (PyTorch): import torch.quantization model.qconfig = torch.quantization.get_default_qconfig('fbgemm') model_prepared = torch.quantization.prepare(model) # Run calibration data through model_prepared model_int8 = torch.quantization.convert(model_prepared)

8. Building the INT8 MAC Unit in Verilog

Now let's build it. This is a fully pipelined 3-stage MAC unit using signed INT8 inputs, a 16-bit product register, and a 32-bit accumulator — exactly how a real CNN accelerator works.

// mac_unit.v — Pipelined INT8 MAC Unit for FPGA Neural Network // 3-stage pipeline: Stage1=Multiply, Stage2=Accumulate, Stage3=Output // Uses DSP58E2 inference — Xilinx synthesis will map to DSP blocks module mac_unit #( parameter DATA_W = 8, // INT8 input width parameter PROD_W = 16, // 8x8 = 16-bit product parameter ACCUM_W = 32 // 32-bit accumulator (safe for 256 accumulations) )( input wire clk, input wire rst_n, input wire clear, // clear accumulator (new dot product) input wire valid_in, // input data is valid input wire signed [DATA_W-1:0] a, // weight (INT8 signed) input wire signed [DATA_W-1:0] b, // activation (INT8 signed) output reg signed [ACCUM_W-1:0] accum, // accumulated result output reg valid_out // output valid ); // Pipeline Stage 1: Multiply reg signed [PROD_W-1:0] product; reg valid_s1; always @(posedge clk or negedge rst_n) begin if (!rst_n) begin product <= 0; valid_s1 <= 0; end else begin product <= a * b; // 8x8 = 16-bit signed multiply valid_s1 <= valid_in; end end // Pipeline Stage 2: Accumulate (sign-extend product to 32-bit) reg signed [ACCUM_W-1:0] product_ext; reg valid_s2; reg clear_s2; always @(posedge clk or negedge rst_n) begin if (!rst_n) begin product_ext <= 0; valid_s2 <= 0; clear_s2 <= 0; end else begin product_ext <= {{(ACCUM_W-PROD_W){product[PROD_W-1]}}, product}; // sign extend valid_s2 <= valid_s1; clear_s2 <= clear; end end // Pipeline Stage 3: Accumulate or Clear always @(posedge clk or negedge rst_n) begin if (!rst_n) begin accum <= 0; valid_out <= 0; end else begin valid_out <= valid_s2; if (clear_s2) accum <= product_ext; // start new accumulation else if (valid_s2) accum <= accum + product_ext; // accumulate end end endmodule

Testbench — Verify the MAC Unit

// mac_unit_tb.v — Testbench: compute dot product [2,3,4] · [1,2,3] = 2+6+12 = 20 `timescale 1ns/1ps module mac_unit_tb; reg clk = 0, rst_n = 0, clear = 0, valid_in = 0; reg signed [7:0] a, b; wire signed [31:0] accum; wire valid_out; mac_unit uut(.clk(clk),.rst_n(rst_n),.clear(clear), .valid_in(valid_in),.a(a),.b(b), .accum(accum),.valid_out(valid_out)); always #5 clk = ~clk; // 100MHz clock initial begin rst_n = 0; #20; rst_n = 1; // Dot product: a=[2,3,4], b=[1,2,3] → expected = 20 @(posedge clk); clear=1; valid_in=1; a=2; b=1; // 2*1=2 @(posedge clk); clear=0; valid_in=1; a=3; b=2; // 3*2=6 @(posedge clk); valid_in=1; a=4; b=3; // 4*3=12 @(posedge clk); valid_in=0; #50; $display("Dot product result: %0d (expected 20)", accum); if (accum == 20) $display("PASS ✓"); else $display("FAIL ✗"); $finish; end endmodule

9. INT4 Quantization — Pushing Further

INT4 halves memory bandwidth again and fits 4 MAC operations per DSP58 on Xilinx. It's used in LLM inference (GPTQ, AWQ) and some mobile CNNs. The challenge is higher quantization error:

FormatBitsLevelsMax ErrorAccuracy Drop (ResNet-50)MACs/DSP/cycle
FP3232~16M~1e-7Baseline0.25
FP1616~65K~1e-3<0.1%0.5
INT88256±0.010~0.5%2
INT4416±0.0311–3%4
INT224±0.1255–15%8

INT4 Requires Careful Calibration

With only 16 levels, INT4 is sensitive to outliers. Techniques like GPTQ (gradient-based post-training quantization) or AWQ (activation-aware weight quantization) are needed to keep accuracy drops under 2% for transformer models.

10. Practical Quantization Workflow

FP32 → INT8 → FPGA Deployment Pipeline
1
Train FP32 model — PyTorch / TensorFlow, normal training loop
2
Calibration — Run 1K–5K representative images to collect activation statistics
3
PTQ or QAT — Post-training quantization (fast) or quantization-aware training (more accurate)
4
Vitis AI Quantizer — vai_q_pytorch converts to Xilinx-optimized INT8 format with per-layer scale factors
5
Deploy on FPGA — Vitis AI compiler maps quantized model to DPU or your custom RTL accelerator

11. Day 2 Checklist

Key Takeaways — Fixed-Point & Quantization

Next — Day 3: Matrix Multiply Accelerator — building a 4×4 tiled GEMM engine in Verilog with DSP48 packing, input/output buffering, and throughput analysis.

← Previous
Day 1: FPGA vs GPU vs CPU
Course Home
All 15 Days →