HomeFPGA Neural NetworkDay 1

FPGA vs GPU vs CPU for
AI Inference — Why FPGA?

Architecture deep-dive: latency, throughput, power efficiency, roofline model, and real benchmarks. Understand exactly when — and why — FPGA beats GPU and CPU for neural network inference.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,600 words · 14 min read

1. The Inference Hardware Landscape

Every neural network inference request runs on physical hardware. That hardware choice — CPU, GPU, FPGA, or custom ASIC — determines latency, throughput, power consumption, and cost. Getting this wrong costs companies millions in cloud bills, missed SLAs, or products that drain batteries in an hour.

The inference hardware ecosystem has three main competitors:

The Key Insight

GPUs are optimized for training — massive batch sizes, FP32 precision, throughput over latency. Inference is a different problem: small batches (often batch=1), tight latency SLAs, and power budgets that GPUs simply cannot meet. This is the gap FPGAs fill.

2. Architecture Comparison

🖥️ CPU — ARM Cortex-A78
NEON SIMD — 128-bit, 4× INT32/cycle
Peak: 2 TOPS (4-wide SIMD @ 2GHz)
Power: 1–5W mobile / 15–80W server
✗ Sequential — no matrix parallelism
🎮 GPU — NVIDIA A100
Tensor Core — 16×16×16 matrix/cycle
6912 CUDA + 432 Tensor Cores
Peak: 312 TOPS INT8
Power: 400W
✓ Throughput   ✗ Latency overhead
⚡ FPGA — Xilinx Alveo U250
Custom datapath — ANY precision
1.7M LUTs · 12,288 DSP58 · 54MB BRAM
Peak: 145 TOPS INT8
Power: 75W
✓ Latency   ✓ Power   ✓ Flexible
🏭 ASIC — Google TPU v4
Fixed systolic array — 128×128 MACs
Hardwired for matrix multiply
Peak: 275 TOPS INT8
Power: 170W
✓ Max efficiency   ✗ Zero flexibility

3. The Latency Advantage — Where FPGA Wins

GPU's dirty secret: it's terrible at low-latency single-inference. When you send one image through ResNet-50, a GPU spends most of its time on kernel launch overhead, memory transfers (CPU→GPU), and waiting for the CUDA scheduler. The actual compute takes ~1ms, but total latency is 5–15ms.

Latency Breakdown — ResNet-50 Single Inference (batch=1)
GPU — NVIDIA A100 Total: 6.5ms
CPU→GPU
2.5ms
Launch
1.0ms
Compute
1.2ms
GPU→CPU
1.8ms
⚠️ Actual compute is only 18% of total time — the rest is overhead
FPGA — Xilinx Alveo U250 Total: 1.1ms
PCIe DMA
0.3ms
Pipeline Compute
0.8ms
✅ No kernel launch overhead — streaming pipeline always running. 6× lower latency than A100
Edge FPGA — Xilinx Kria KV260 ResNet-50: 0.85ms
Pure Pipeline Compute — 0.85ms (ResNet-50)  |  0.15ms (MobileNetV2)
✅ Image loaded directly into BRAM — zero transfer overhead
Embedded CPU — ARM Cortex-A78 ×4 ResNet-50: 85ms
Sequential NEON SIMD — 85ms (ResNet-50)  |  12ms (MobileNetV2)
❌ No matrix parallelism — NEON SIMD is 4-wide only

4. Throughput vs Latency — The Fundamental Trade-off

Every inference hardware choice sits somewhere on the throughput-latency curve. GPU is optimized for throughput. CPU is low throughput, moderate latency. FPGA sits uniquely at the sweet spot: low latency AND reasonable throughput.

Key Metrics Defined: Latency = time from input_available to output_ready (single request) Units: milliseconds (ms) or microseconds (µs) Critical for: autonomous vehicles, robotics, real-time video Throughput = inferences per second (IPS) or per unit time Units: IPS, TOPS (Tera Operations Per Second) Critical for: data center batch processing, video stream analytics Power Efficiency = throughput / power = TOPS/W Best metric for battery-powered or thermally-constrained devices Latency-Throughput Product (LTP): Lower is better. Measures how efficiently hardware uses its resources. Real numbers (ResNet-50 INT8): ───────────────────────────────────────────────────── Hardware Latency Throughput Power Efficiency ───────────────────────────────────────────────────── ARM A78 ×8 85ms 12 IPS 8W 1.5 IPS/W NVIDIA RTX 4090 2.1ms 3,500 IPS 450W 7.8 IPS/W NVIDIA A100 6.5ms 5,200 IPS 400W 13 IPS/W Intel VPU 4.2ms 90 IPS 4W 22.5 IPS/W Xilinx Kria 0.9ms 400 IPS 5W 80 IPS/W ← Winner (edge) Xilinx Alveo 1.1ms 2,100 IPS 75W 28 IPS/W ───────────────────────────────────────────────────── Kria: 53× more power-efficient than A100 for edge workload!

5. The Roofline Model Applied to FPGA

The roofline model is the most powerful tool for understanding hardware bottlenecks. It tells you whether your neural network inference is compute-bound or memory-bound — and therefore which optimization matters most.

Roofline Model — Xilinx Alveo U250
Compute Roof — 145 TOPS INT8 DDR4 BW (77 GB/s) Ridge ~19 OPS/byte Memory-Bound Improve data reuse Compute-Bound Add more DSPs Performance (TOPS) Arithmetic Intensity (OPS/byte) 145 70 30 1 4 19 64 128
Layer Type
Intensity
Zone
Bottleneck
GEMM (large)
50–200
Compute
More DSPs
Conv2D (large)
8–50
Compute
More DSPs
Conv2D (small)
2–8
Memory
Better cache
Attention (seq)
1–4
Memory
High BW

6. FPGA Internal Architecture for AI

Understanding FPGA primitives is essential before you start designing neural network accelerators. These are the hardware building blocks you compose to create your custom AI engine.

6.1 DSP Blocks (DSP58E2 on Xilinx UltraScale+)

The workhorse of FPGA neural network inference. Each DSP58E2 can perform a multiply-accumulate (MAC) in one clock cycle at up to 891 MHz.

DSP58E2 — MAC Unit (Xilinx UltraScale+)
A[29:0] D[26:0] Pre-Adder A ± D B[17:0] Multiplier 30×18 = 48-bit C[47:0] Accumulator P = A×B + C P = A×B + P P[47:0]
Key Specs
Multiply: 27×18-bit signed
Max clock: 891 MHz (UltraScale+)
Pipeline stages: 3 (full throughput)
Count on Alveo U250: 12,288 DSPs
Neural Network Usage
INT8 multiply (8×8): 2 ops/DSP/cycle
INT16 multiply: 1 op/DSP/cycle
FP16 multiply: 3 DSPs needed ✗
→ Always prefer INT8 for FPGA!

6.2 BRAM — Block RAM for Weight Storage

BRAM Sizing for Neural Networks: BRAM36 = 36Kb = 4.5KB per block Alveo U250: 2,688 BRAMs = 12,096 KB = ~12 MB on-chip Weight storage requirement (ResNet-50, INT8): Total weights: 25.6M parameters × 1 byte = 25.6 MB Exceeds BRAM capacity → must stream from DDR4! Weight storage (MobileNetV2, INT8): Total weights: 3.4M × 1 byte = 3.4 MB Fits in BRAM! → zero DDR stalls for weights Only activations need DDR bandwidth Strategy: Small models (≤12MB weights): fit all weights on-chip → fast Large models: tile weights, stream layer-by-layer from DDR → slower Trick: weight compression (pruning + Huffman) to fit large models on-chip

7. Real Benchmark Comparison

HardwareResNet-50 LatencyResNet-50 ThroughputPowerEfficiencyBest Use Case
ARM Cortex-A78 ×485ms12 IPS8W1.5 IPS/WBackup/fallback
Intel Core i9-13900K8ms120 IPS125W0.96 IPS/WDevelopment only
NVIDIA RTX 40902.1ms3,500 IPS450W7.8 IPS/WHigh-throughput DC
NVIDIA A100 SXM46.5ms5,200 IPS400W13 IPS/WTraining + inference
Xilinx Kria KV2600.9ms400 IPS5W80 IPS/WEdge AI winner
Xilinx Alveo U2501.1ms2,100 IPS75W28 IPS/WDatacenter edge
Intel Stratix 10 NX1.4ms1,800 IPS80W22.5 IPS/WAI-optimized FP
Google TPU v40.5ms8,000 IPS170W47 IPS/WGoogle-only scale

8. When to Choose FPGA for AI

FPGA is not always the right answer. Here's an honest guide to when FPGA wins vs when GPU or ASIC should be chosen:

When to Choose Each Inference Hardware
⚡ Choose FPGA when
  • ✅ Latency < 2ms required
  • ✅ Power budget < 75W
  • ✅ Custom precision (INT4/INT8)
  • ✅ Streaming data (camera/sensor)
  • ✅ Deterministic timing needed
  • ✅ Edge deployment (no cloud)
🎮 Choose GPU when
  • ✅ Batch size > 32
  • ✅ Throughput > latency goal
  • ✅ Model changes frequently
  • ✅ No hardware expertise in team
  • ✅ Large cloud compute budget
  • ✅ Training + inference on same HW
🏭 Choose ASIC when
  • ✅ Volume > 100K units/year
  • ✅ Model is fixed (won't change)
  • ✅ Extreme efficiency needed
  • ✅ $10M+ NRE budget available
  • ✅ 3–5 year product lifespan
🖥️ Choose CPU when
  • ✅ < 100 inferences/second
  • ✅ Flexibility is paramount
  • ✅ Simple models (regression, trees)
  • ✅ Development / prototyping
  • ✅ No dedicated AI budget

9. Real-World FPGA AI Deployment Examples

Mobileye EyeQ6 — Autonomous Vehicles

Xilinx/AMD Alveo in Microsoft Azure

Baidu Kunlun — Data Center AI FPGA

10. What You'll Build in This Course

FPGA CNN Accelerator Architecture — What You'll Build
📷 Input Image
DMA / AXI4 Memory Interface ← Day 6
Convolution Engine ← Day 5
Line Buffer
(BRAM)
Weight Store
(BRAM)
MAC Array
(DSP58)
Activation + Pooling — ReLU → BatchNorm → MaxPool (fused) ← Days 7–8
Systolic Array (GEMM) — Fully-connected layers & attention ← Day 4
🎯 Output Class Prediction
Tools
Vitis HLS → Vivado → Vitis AI
Board
Kria KV260 / PYNQ-Z2 / Alveo U250
Control
AXI4-Lite register interface

11. Prerequisites & Setup

What you need before Day 2:

No Hardware? No Problem.

Every exercise in this course can be completed in Vivado simulation. Hardware boards are optional — they let you verify real-world performance numbers, but all core concepts and designs work in simulation first.

12. Course Roadmap Summary

DayTopicKey Deliverable
1FPGA vs GPU vs CPU (this page)Architecture intuition
2Fixed-point arithmetic8-bit MAC unit in Verilog
3Matrix multiply accelerator4×4 GEMM engine
4Systolic array8×8 systolic array
5Convolution engine3×3 conv2D with line buffer
6Memory architectureBRAM + AXI4 DDR interface
7Activation functionsReLU, sigmoid hardware
8Pooling + BatchNormFused BN+ReLU+MaxPool
9Pipelining & parallelismFull CNN pipeline
10Vitis HLSC++ → RTL conv layer
11Vitis AI DPUResNet-50 on Kria
12Power optimizationEdge AI under 5W
13Transformer attentionBERT attention engine
14BenchmarkingTOPS/W measurement
15Production deploymentFull edge AI system

Day 1 — Key Takeaways

Next — Day 2: Fixed-point arithmetic and quantization — why INT8 instead of FP32, Q-format representation, overflow handling, and building your first 8-bit MAC unit in Verilog.