Architecture deep-dive: latency, throughput, power efficiency, roofline model, and real benchmarks. Understand exactly when — and why — FPGA beats GPU and CPU for neural network inference.
Every neural network inference request runs on physical hardware. That hardware choice — CPU, GPU, FPGA, or custom ASIC — determines latency, throughput, power consumption, and cost. Getting this wrong costs companies millions in cloud bills, missed SLAs, or products that drain batteries in an hour.
The inference hardware ecosystem has three main competitors:
CPU (ARM, x86): General-purpose, flexible, terrible at matrix math
GPU (NVIDIA, AMD): Massively parallel, excellent for large batches, power-hungry
ASIC (TPU, NPU): Maximum efficiency, zero flexibility, $10M+ to build
The Key Insight
GPUs are optimized for training — massive batch sizes, FP32 precision, throughput over latency. Inference is a different problem: small batches (often batch=1), tight latency SLAs, and power budgets that GPUs simply cannot meet. This is the gap FPGAs fill.
Fixed systolic array — 128×128 MACs Hardwired for matrix multiply
Peak: 275 TOPS INT8
Power: 170W
✓ Max efficiency ✗ Zero flexibility
3. The Latency Advantage — Where FPGA Wins
GPU's dirty secret: it's terrible at low-latency single-inference. When you send one image through ResNet-50, a GPU spends most of its time on kernel launch overhead, memory transfers (CPU→GPU), and waiting for the CUDA scheduler. The actual compute takes ~1ms, but total latency is 5–15ms.
Latency Breakdown — ResNet-50 Single Inference (batch=1)
GPU — NVIDIA A100Total: 6.5ms
CPU→GPU 2.5ms
Launch 1.0ms
Compute 1.2ms
GPU→CPU 1.8ms
⚠️ Actual compute is only 18% of total time — the rest is overhead
FPGA — Xilinx Alveo U250Total: 1.1ms
PCIe DMA 0.3ms
Pipeline Compute 0.8ms
✅ No kernel launch overhead — streaming pipeline always running. 6× lower latency than A100
Edge FPGA — Xilinx Kria KV260ResNet-50: 0.85ms
Pure Pipeline Compute — 0.85ms (ResNet-50) | 0.15ms (MobileNetV2)
✅ Image loaded directly into BRAM — zero transfer overhead
❌ No matrix parallelism — NEON SIMD is 4-wide only
4. Throughput vs Latency — The Fundamental Trade-off
Every inference hardware choice sits somewhere on the throughput-latency curve. GPU is optimized for throughput. CPU is low throughput, moderate latency. FPGA sits uniquely at the sweet spot: low latency AND reasonable throughput.
Key Metrics Defined:
Latency = time from input_available to output_ready (single request)
Units: milliseconds (ms) or microseconds (µs)
Critical for: autonomous vehicles, robotics, real-time video
Throughput = inferences per second (IPS) or per unit time
Units: IPS, TOPS (Tera Operations Per Second)
Critical for: data center batch processing, video stream analytics
Power Efficiency = throughput / power = TOPS/W
Best metric for battery-powered or thermally-constrained devices
Latency-Throughput Product (LTP):
Lower is better. Measures how efficiently hardware uses its resources.
Real numbers (ResNet-50 INT8):
─────────────────────────────────────────────────────
Hardware Latency Throughput Power Efficiency
─────────────────────────────────────────────────────
ARM A78 ×8 85ms 12 IPS 8W 1.5 IPS/W
NVIDIA RTX 4090 2.1ms 3,500 IPS 450W 7.8 IPS/W
NVIDIA A100 6.5ms 5,200 IPS 400W 13 IPS/W
Intel VPU 4.2ms 90 IPS 4W 22.5 IPS/W
Xilinx Kria 0.9ms 400 IPS 5W 80 IPS/W ← Winner (edge)
Xilinx Alveo 1.1ms 2,100 IPS 75W 28 IPS/W
─────────────────────────────────────────────────────
Kria: 53× more power-efficient than A100 for edge workload!
5. The Roofline Model Applied to FPGA
The roofline model is the most powerful tool for understanding hardware bottlenecks. It tells you whether your neural network inference is compute-bound or memory-bound — and therefore which optimization matters most.
Roofline Model — Xilinx Alveo U250
Layer Type
Intensity
Zone
Bottleneck
GEMM (large)
50–200
Compute
More DSPs
Conv2D (large)
8–50
Compute
More DSPs
Conv2D (small)
2–8
Memory
Better cache
Attention (seq)
1–4
Memory
High BW
6. FPGA Internal Architecture for AI
Understanding FPGA primitives is essential before you start designing neural network accelerators. These are the hardware building blocks you compose to create your custom AI engine.
6.1 DSP Blocks (DSP58E2 on Xilinx UltraScale+)
The workhorse of FPGA neural network inference. Each DSP58E2 can perform a multiply-accumulate (MAC) in one clock cycle at up to 891 MHz.
DSP58E2 — MAC Unit (Xilinx UltraScale+)
Key Specs
Multiply: 27×18-bit signed
Max clock: 891 MHz (UltraScale+)
Pipeline stages: 3 (full throughput)
Count on Alveo U250: 12,288 DSPs
Verilog/VHDL basics: You should know how to write flip-flops, FSMs, and simple pipelines. (If not — do our FPGA From Scratch course first.)
Python + NumPy: For model quantization and generating test vectors
Software to install: Vivado 2023.2 (free for Artix-7 / Zynq), Vitis HLS, Vitis AI (optional for Day 11)
Hardware (optional): PYNQ-Z2 ($65), Kria KV260 ($299), or simulate in Vivado
No Hardware? No Problem.
Every exercise in this course can be completed in Vivado simulation. Hardware boards are optional — they let you verify real-world performance numbers, but all core concepts and designs work in simulation first.
12. Course Roadmap Summary
Day
Topic
Key Deliverable
1
FPGA vs GPU vs CPU (this page)
Architecture intuition
2
Fixed-point arithmetic
8-bit MAC unit in Verilog
3
Matrix multiply accelerator
4×4 GEMM engine
4
Systolic array
8×8 systolic array
5
Convolution engine
3×3 conv2D with line buffer
6
Memory architecture
BRAM + AXI4 DDR interface
7
Activation functions
ReLU, sigmoid hardware
8
Pooling + BatchNorm
Fused BN+ReLU+MaxPool
9
Pipelining & parallelism
Full CNN pipeline
10
Vitis HLS
C++ → RTL conv layer
11
Vitis AI DPU
ResNet-50 on Kria
12
Power optimization
Edge AI under 5W
13
Transformer attention
BERT attention engine
14
Benchmarking
TOPS/W measurement
15
Production deployment
Full edge AI system
Day 1 — Key Takeaways
✅ GPU dominates throughput but has 5–15ms latency overhead per request
✅ FPGA delivers 0.9–1.1ms latency — 6–15× lower than GPU for single inference
✅ Power efficiency: Kria KV260 achieves 80 IPS/W vs 13 IPS/W for A100
✅ Roofline model tells you whether to optimize compute (DSPs) or memory
✅ DSP58E2 blocks are the heart of FPGA neural network computation
✅ INT8 precision is optimal for FPGA inference — 2× ops per DSP vs INT16
✅ FPGA wins when: latency <2ms, power <75W, custom precision, streaming data
✅ Microsoft Project Brainwave saves $40M/year using FPGA over GPU for Bing
Next — Day 2: Fixed-point arithmetic and quantization — why INT8 instead of FP32, Q-format representation, overflow handling, and building your first 8-bit MAC unit in Verilog.