Is FPGA faster than GPU for inference?

For single-batch, low-latency inference FPGA is typically 5–20× faster than GPU because GPUs have high batch-processing overhead and fixed-point execution pipelines that don't benefit small inference workloads. For large-batch throughput, GPUs win due to thousands of CUDA cores running in parallel.

What is the roofline model for FPGA?

The roofline model plots achievable performance (TOPS or FLOPS) vs arithmetic intensity (operations per byte). For FPGA, the roof is set by DSP compute throughput and memory bandwidth. Operations above the memory bandwidth line are memory-bound; those below the DSP roof are compute-bound.

How much power does an FPGA use for AI inference?

FPGAs typically consume 5–75W for inference depending on size. The Xilinx Kria KV260 uses ~5W total for edge AI. The Alveo U250 uses 75W for datacenter inference. Compare to NVIDIA A100 at 400W — giving FPGAs 5–50× better power efficiency for latency-sensitive workloads.

FPGA vs GPU vs CPU for AI Inference — Why FPGA Wins

1. The Inference Hardware Landscape

Every neural network inference request runs on physical hardware. That hardware choice — CPU, GPU, FPGA, or custom ASIC — determines latency, throughput, power consumption, and cost. Getting this wrong costs companies millions in cloud bills, missed SLAs, or products that drain batteries in an hour.

The inference hardware ecosystem has three main competitors:

CPU (ARM, x86): General-purpose, flexible, terrible at matrix math
GPU (NVIDIA, AMD): Massively parallel, excellent for large batches, power-hungry
FPGA (Xilinx/AMD, Intel): Reconfigurable, deterministic latency, power-efficient
ASIC (TPU, NPU): Maximum efficiency, zero flexibility, $10M+ to build

The Key Insight

GPUs are optimized for training — massive batch sizes, FP32 precision, throughput over latency. Inference is a different problem: small batches (often batch=1), tight latency SLAs, and power budgets that GPUs simply cannot meet. This is the gap FPGAs fill.

2. Architecture Comparison

🖥️ CPU — ARM Cortex-A78

NEON SIMD — 128-bit, 4× INT32/cycle

Peak: 2 TOPS (4-wide SIMD @ 2GHz)

Power: 1–5W mobile / 15–80W server

✗ Sequential — no matrix parallelism

🎮 GPU — NVIDIA A100

Tensor Core — 16×16×16 matrix/cycle
6912 CUDA + 432 Tensor Cores

Peak: 312 TOPS INT8

Power: 400W

✓ Throughput ✗ Latency overhead

⚡ FPGA — Xilinx Alveo U250

Custom datapath — ANY precision
1.7M LUTs · 12,288 DSP58 · 54MB BRAM

Peak: 145 TOPS INT8

Power: 75W

✓ Latency ✓ Power ✓ Flexible

🏭 ASIC — Google TPU v4

Fixed systolic array — 128×128 MACs
Hardwired for matrix multiply

Peak: 275 TOPS INT8

Power: 170W

✓ Max efficiency ✗ Zero flexibility

3. The Latency Advantage — Where FPGA Wins

GPU's dirty secret: it's terrible at low-latency single-inference. When you send one image through ResNet-50, a GPU spends most of its time on kernel launch overhead, memory transfers (CPU→GPU), and waiting for the CUDA scheduler. The actual compute takes ~1ms, but total latency is 5–15ms.

Latency Breakdown — ResNet-50 Single Inference (batch=1)

GPU — NVIDIA A100 Total: 6.5ms

CPU→GPU
2.5ms

Launch
1.0ms

Compute
1.2ms

GPU→CPU
1.8ms

⚠️ Actual compute is only 18% of total time — the rest is overhead

FPGA — Xilinx Alveo U250 Total: 1.1ms

PCIe DMA
0.3ms

Pipeline Compute
0.8ms

✅ No kernel launch overhead — streaming pipeline always running. 6× lower latency than A100

Edge FPGA — Xilinx Kria KV260 ResNet-50: 0.85ms

Pure Pipeline Compute — 0.85ms (ResNet-50) | 0.15ms (MobileNetV2)

✅ Image loaded directly into BRAM — zero transfer overhead

Embedded CPU — ARM Cortex-A78 ×4 ResNet-50: 85ms

Sequential NEON SIMD — 85ms (ResNet-50) | 12ms (MobileNetV2)

❌ No matrix parallelism — NEON SIMD is 4-wide only

4. Throughput vs Latency — The Fundamental Trade-off

Every inference hardware choice sits somewhere on the throughput-latency curve. GPU is optimized for throughput. CPU is low throughput, moderate latency. FPGA sits uniquely at the sweet spot: low latency AND reasonable throughput.

Key Metrics Defined: Latency = time from input_available to output_ready (single request) Units: milliseconds (ms) or microseconds (µs) Critical for: autonomous vehicles, robotics, real-time video Throughput = inferences per second (IPS) or per unit time Units: IPS, TOPS (Tera Operations Per Second) Critical for: data center batch processing, video stream analytics Power Efficiency = throughput / power = TOPS/W Best metric for battery-powered or thermally-constrained devices Latency-Throughput Product (LTP): Lower is better. Measures how efficiently hardware uses its resources. Real numbers (ResNet-50 INT8): ───────────────────────────────────────────────────── Hardware Latency Throughput Power Efficiency ───────────────────────────────────────────────────── ARM A78 ×8 85ms 12 IPS 8W 1.5 IPS/W NVIDIA RTX 4090 2.1ms 3,500 IPS 450W 7.8 IPS/W NVIDIA A100 6.5ms 5,200 IPS 400W 13 IPS/W Intel VPU 4.2ms 90 IPS 4W 22.5 IPS/W Xilinx Kria 0.9ms 400 IPS 5W 80 IPS/W ← Winner (edge) Xilinx Alveo 1.1ms 2,100 IPS 75W 28 IPS/W ───────────────────────────────────────────────────── Kria: 53× more power-efficient than A100 for edge workload!

5. The Roofline Model Applied to FPGA

The roofline model is the most powerful tool for understanding hardware bottlenecks. It tells you whether your neural network inference is compute-bound or memory-bound — and therefore which optimization matters most.

Roofline Model — Xilinx Alveo U250

Layer Type

Intensity

Zone

Bottleneck

GEMM (large)

50–200

Compute

More DSPs

Conv2D (large)

8–50

Compute

More DSPs

Conv2D (small)

2–8

Memory

Better cache

Attention (seq)

1–4

Memory

High BW

6. FPGA Internal Architecture for AI

Understanding FPGA primitives is essential before you start designing neural network accelerators. These are the hardware building blocks you compose to create your custom AI engine.

6.1 DSP Blocks (DSP58E2 on Xilinx UltraScale+)

The workhorse of FPGA neural network inference. Each DSP58E2 can perform a multiply-accumulate (MAC) in one clock cycle at up to 891 MHz.

DSP58E2 — MAC Unit (Xilinx UltraScale+)

Key Specs

Multiply: 27×18-bit signed
Max clock: 891 MHz (UltraScale+)
Pipeline stages: 3 (full throughput)
Count on Alveo U250: 12,288 DSPs

Neural Network Usage

INT8 multiply (8×8): 2 ops/DSP/cycle
INT16 multiply: 1 op/DSP/cycle
FP16 multiply: 3 DSPs needed ✗
→ Always prefer INT8 for FPGA!

6.2 BRAM — Block RAM for Weight Storage

BRAM Sizing for Neural Networks: BRAM36 = 36Kb = 4.5KB per block Alveo U250: 2,688 BRAMs = 12,096 KB = ~12 MB on-chip Weight storage requirement (ResNet-50, INT8): Total weights: 25.6M parameters × 1 byte = 25.6 MB Exceeds BRAM capacity → must stream from DDR4! Weight storage (MobileNetV2, INT8): Total weights: 3.4M × 1 byte = 3.4 MB Fits in BRAM! → zero DDR stalls for weights Only activations need DDR bandwidth Strategy: Small models (≤12MB weights): fit all weights on-chip → fast Large models: tile weights, stream layer-by-layer from DDR → slower Trick: weight compression (pruning + Huffman) to fit large models on-chip

7. Real Benchmark Comparison

Hardware	ResNet-50 Latency	ResNet-50 Throughput	Power	Efficiency	Best Use Case
ARM Cortex-A78 ×4	85ms	12 IPS	8W	1.5 IPS/W	Backup/fallback
Intel Core i9-13900K	8ms	120 IPS	125W	0.96 IPS/W	Development only
NVIDIA RTX 4090	2.1ms	3,500 IPS	450W	7.8 IPS/W	High-throughput DC
NVIDIA A100 SXM4	6.5ms	5,200 IPS	400W	13 IPS/W	Training + inference
Xilinx Kria KV260	0.9ms	400 IPS	5W	80 IPS/W	Edge AI winner
Xilinx Alveo U250	1.1ms	2,100 IPS	75W	28 IPS/W	Datacenter edge
Intel Stratix 10 NX	1.4ms	1,800 IPS	80W	22.5 IPS/W	AI-optimized FP
Google TPU v4	0.5ms	8,000 IPS	170W	47 IPS/W	Google-only scale

8. When to Choose FPGA for AI

FPGA is not always the right answer. Here's an honest guide to when FPGA wins vs when GPU or ASIC should be chosen:

When to Choose Each Inference Hardware

⚡ Choose FPGA when

✅ Latency < 2ms required
✅ Power budget < 75W
✅ Custom precision (INT4/INT8)
✅ Streaming data (camera/sensor)
✅ Deterministic timing needed
✅ Edge deployment (no cloud)

🎮 Choose GPU when

✅ Batch size > 32
✅ Throughput > latency goal
✅ Model changes frequently
✅ No hardware expertise in team
✅ Large cloud compute budget
✅ Training + inference on same HW

🏭 Choose ASIC when

✅ Volume > 100K units/year
✅ Model is fixed (won't change)
✅ Extreme efficiency needed
✅ $10M+ NRE budget available
✅ 3–5 year product lifespan

🖥️ Choose CPU when

✅ < 100 inferences/second
✅ Flexibility is paramount
✅ Simple models (regression, trees)
✅ Development / prototyping
✅ No dedicated AI budget

9. Real-World FPGA AI Deployment Examples

Mobileye EyeQ6 — Autonomous Vehicles

Hardware: Custom FPGA-inspired fixed-function chips (EyeQ series)
Use case: Camera-based object detection, lane keeping, traffic sign recognition
Why not GPU: 400W GPU in a car? Impossible. EyeQ6 delivers at 5–10W
Latency requirement: <20ms end-to-end (camera capture → brake command)
Models running: Multiple CNNs in parallel on same silicon

Xilinx/AMD Alveo in Microsoft Azure

Deployment: Microsoft Azure FPGAs (Project Brainwave) for Bing search AI
Use case: Real-time query expansion, document ranking with neural models
Result: 2ms latency vs 20ms GPU, at 1/5th the power cost
Scale: Millions of inference requests per hour, cost saving: ~$40M/year

Baidu Kunlun — Data Center AI FPGA

Architecture: FPGA-based chips optimized for ERNIE NLP models
Performance: 512 TOPS INT8 per card
Key advantage: Reconfigurable for different model architectures as ERNIE evolves

10. What You'll Build in This Course

FPGA CNN Accelerator Architecture — What You'll Build

📷 Input Image

↓

DMA / AXI4 Memory Interface ← Day 6

↓

Convolution Engine ← Day 5

Line Buffer
(BRAM)

→

Weight Store
(BRAM)

→

MAC Array
(DSP58)

↓

Activation + Pooling — ReLU → BatchNorm → MaxPool (fused) ← Days 7–8

↓

Systolic Array (GEMM) — Fully-connected layers & attention ← Day 4

↓

🎯 Output Class Prediction

Tools

Vitis HLS → Vivado → Vitis AI

Board

Kria KV260 / PYNQ-Z2 / Alveo U250

Control

AXI4-Lite register interface

11. Prerequisites & Setup

What you need before Day 2:

Verilog/VHDL basics: You should know how to write flip-flops, FSMs, and simple pipelines. (If not — do our FPGA From Scratch course first.)
Python + NumPy: For model quantization and generating test vectors
Software to install: Vivado 2023.2 (free for Artix-7 / Zynq), Vitis HLS, Vitis AI (optional for Day 11)
Hardware (optional): PYNQ-Z2 ($65), Kria KV260 ($299), or simulate in Vivado

No Hardware? No Problem.

Every exercise in this course can be completed in Vivado simulation. Hardware boards are optional — they let you verify real-world performance numbers, but all core concepts and designs work in simulation first.

12. Course Roadmap Summary

Day	Topic	Key Deliverable
1	FPGA vs GPU vs CPU (this page)	Architecture intuition
2	Fixed-point arithmetic	8-bit MAC unit in Verilog
3	Matrix multiply accelerator	4×4 GEMM engine
4	Systolic array	8×8 systolic array
5	Convolution engine	3×3 conv2D with line buffer
6	Memory architecture	BRAM + AXI4 DDR interface
7	Activation functions	ReLU, sigmoid hardware
8	Pooling + BatchNorm	Fused BN+ReLU+MaxPool
9	Pipelining & parallelism	Full CNN pipeline
10	Vitis HLS	C++ → RTL conv layer
11	Vitis AI DPU	ResNet-50 on Kria
12	Power optimization	Edge AI under 5W
13	Transformer attention	BERT attention engine
14	Benchmarking	TOPS/W measurement
15	Production deployment	Full edge AI system

Day 1 — Key Takeaways

✅ GPU dominates throughput but has 5–15ms latency overhead per request
✅ FPGA delivers 0.9–1.1ms latency — 6–15× lower than GPU for single inference
✅ Power efficiency: Kria KV260 achieves 80 IPS/W vs 13 IPS/W for A100
✅ Roofline model tells you whether to optimize compute (DSPs) or memory
✅ DSP58E2 blocks are the heart of FPGA neural network computation
✅ INT8 precision is optimal for FPGA inference — 2× ops per DSP vs INT16
✅ FPGA wins when: latency <2ms, power <75W, custom precision, streaming data
✅ GPU wins when: large batch, frequently changing models, throughput > latency
✅ Microsoft Project Brainwave saves $40M/year using FPGA over GPU for Bing

Next — Day 2: Fixed-point arithmetic and quantization — why INT8 instead of FP32, Q-format representation, overflow handling, and building your first 8-bit MAC unit in Verilog.

FPGA vs GPU vs CPU forAI Inference — Why FPGA?