Why FP32 is wasteful on FPGA. INT8/INT4 fixed-point representation, Q-format notation, quantization error analysis, scale factors, overflow handling, and a fully pipelined 8-bit MAC unit in Verilog.
FP32 (32-bit floating-point) is the format PyTorch and TensorFlow train in. It gives numerical accuracy and large dynamic range — perfect for training where gradients span many orders of magnitude. But for inference on FPGA, FP32 is a terrible choice:
Google's landmark 2017 paper showed ResNet-50 loses only 0.5% top-1 accuracy when quantized from FP32 → INT8. MobileNetV2 loses ~0.9%. That's a completely acceptable trade-off for 8× more compute throughput on FPGA.
Value = (-1)ˢ × 2^(E-127) × (1 + M/2²³) — Exponent logic requires expensive hardware on FPGA
Fixed-point removes the exponent entirely. The decimal point is fixed at a known position. This makes hardware trivially simple — just integers with a pre-agreed scaling factor.
Q-format (Qm.n) precisely defines how a fixed-point number's bits are split. You will see this notation throughout FPGA AI literature and Verilog codebases.
INT8 quantization maps FP32 weights and activations to 8-bit signed integers. This is what TensorFlow Lite, PyTorch Mobile, and Vitis AI all use by default. Two key parameters define the mapping:
Every quantized value has an error equal to the rounding distance. Understanding this error is critical for knowing when INT8 is safe vs when you need INT16 or FP16.
Overflow is the most dangerous bug in fixed-point neural networks. When two INT8 values multiply (8×8=16-bit result), then accumulate hundreds of times, the accumulator can easily overflow a 16-bit register — but not a 32-bit one. This is why accumulators must be wider than inputs.
Choosing the right scale factor S is what separates good quantization from bad. If S is too large → many values overflow. If S is too small → most values clamp to 0 or ±127 (clipping loss).
Now let's build it. This is a fully pipelined 3-stage MAC unit using signed INT8 inputs, a 16-bit product register, and a 32-bit accumulator — exactly how a real CNN accelerator works.
// mac_unit.v — Pipelined INT8 MAC Unit for FPGA Neural Network
// 3-stage pipeline: Stage1=Multiply, Stage2=Accumulate, Stage3=Output
// Uses DSP58E2 inference — Xilinx synthesis will map to DSP blocks
module mac_unit #(
parameter DATA_W = 8, // INT8 input width
parameter PROD_W = 16, // 8x8 = 16-bit product
parameter ACCUM_W = 32 // 32-bit accumulator (safe for 256 accumulations)
)(
input wire clk,
input wire rst_n,
input wire clear, // clear accumulator (new dot product)
input wire valid_in, // input data is valid
input wire signed [DATA_W-1:0] a, // weight (INT8 signed)
input wire signed [DATA_W-1:0] b, // activation (INT8 signed)
output reg signed [ACCUM_W-1:0] accum, // accumulated result
output reg valid_out // output valid
);
// Pipeline Stage 1: Multiply
reg signed [PROD_W-1:0] product;
reg valid_s1;
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
product <= 0;
valid_s1 <= 0;
end else begin
product <= a * b; // 8x8 = 16-bit signed multiply
valid_s1 <= valid_in;
end
end
// Pipeline Stage 2: Accumulate (sign-extend product to 32-bit)
reg signed [ACCUM_W-1:0] product_ext;
reg valid_s2;
reg clear_s2;
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
product_ext <= 0;
valid_s2 <= 0;
clear_s2 <= 0;
end else begin
product_ext <= {{(ACCUM_W-PROD_W){product[PROD_W-1]}}, product}; // sign extend
valid_s2 <= valid_s1;
clear_s2 <= clear;
end
end
// Pipeline Stage 3: Accumulate or Clear
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
accum <= 0;
valid_out <= 0;
end else begin
valid_out <= valid_s2;
if (clear_s2)
accum <= product_ext; // start new accumulation
else if (valid_s2)
accum <= accum + product_ext; // accumulate
end
end
endmodule// mac_unit_tb.v — Testbench: compute dot product [2,3,4] · [1,2,3] = 2+6+12 = 20
`timescale 1ns/1ps
module mac_unit_tb;
reg clk = 0, rst_n = 0, clear = 0, valid_in = 0;
reg signed [7:0] a, b;
wire signed [31:0] accum;
wire valid_out;
mac_unit uut(.clk(clk),.rst_n(rst_n),.clear(clear),
.valid_in(valid_in),.a(a),.b(b),
.accum(accum),.valid_out(valid_out));
always #5 clk = ~clk; // 100MHz clock
initial begin
rst_n = 0; #20; rst_n = 1;
// Dot product: a=[2,3,4], b=[1,2,3] → expected = 20
@(posedge clk); clear=1; valid_in=1; a=2; b=1; // 2*1=2
@(posedge clk); clear=0; valid_in=1; a=3; b=2; // 3*2=6
@(posedge clk); valid_in=1; a=4; b=3; // 4*3=12
@(posedge clk); valid_in=0;
#50;
$display("Dot product result: %0d (expected 20)", accum);
if (accum == 20) $display("PASS ✓");
else $display("FAIL ✗");
$finish;
end
endmoduleINT4 halves memory bandwidth again and fits 4 MAC operations per DSP58 on Xilinx. It's used in LLM inference (GPTQ, AWQ) and some mobile CNNs. The challenge is higher quantization error:
| Format | Bits | Levels | Max Error | Accuracy Drop (ResNet-50) | MACs/DSP/cycle |
|---|---|---|---|---|---|
| FP32 | 32 | ~16M | ~1e-7 | Baseline | 0.25 |
| FP16 | 16 | ~65K | ~1e-3 | <0.1% | 0.5 |
| INT8 | 8 | 256 | ±0.010 | ~0.5% | 2 |
| INT4 | 4 | 16 | ±0.031 | 1–3% | 4 |
| INT2 | 2 | 4 | ±0.125 | 5–15% | 8 |
With only 16 levels, INT4 is sensitive to outliers. Techniques like GPTQ (gradient-based post-training quantization) or AWQ (activation-aware weight quantization) are needed to keep accuracy drops under 2% for transformer models.
Next — Day 3: Matrix Multiply Accelerator — building a 4×4 tiled GEMM engine in Verilog with DSP48 packing, input/output buffering, and throughput analysis.