What is an FPGA?
A Field-Programmable Gate Array is a chip you program after manufacturing — load a bitstream and it becomes any digital circuit you design. This guide explains the architecture, how every block works, and when to use FPGA vs ASIC vs CPU.
FPGA Architecture Overview
How an FPGA Works
An FPGA ships as a blank slate. You write RTL in Verilog or VHDL, synthesize it to a netlist, place and route on the FPGA's resource grid, then generate a bitstream — a binary file that programs every LUT, connection switch, and I/O standard. Power off and the configuration is lost (SRAM-based); power on again and the bitstream reloads from flash.
The Configurable Logic Block (CLB)
A CLB is the fundamental building block. In modern Xilinx/AMD devices a single CLB slice contains:
| Resource | Count per Slice | Function |
|---|---|---|
| 6-input LUT | 8 | Implements any 6-variable Boolean function (64 SRAM bits) |
| Flip-Flop | 16 | D-type register for sequential logic, clocked by global clock net |
| Carry Chain | 8-bit | Fast ripple-carry for adders and counters without LUT chaining |
| MUX | Various | F7/F8 muxes to merge LUTs for wider functions (7-input, 8-input) |
| Distributed RAM | optional | LUTs configured as 64-bit single-port SRAM |
Block RAM (BRAM)
Dedicated true dual-port SRAM columns embedded in the fabric. Each block is 36Kb (configurable as 18Kb+18Kb). Both ports can read and write independently on different clocks — perfect for async FIFOs, line buffers, and coefficient tables. Synthesizer automatically infers BRAMs when you declare a large array in RTL.
DSP Slices
Hard-wired multiplier-accumulator (MAC) units. A DSP48 slice in Xilinx 7-series contains an 18×18 signed multiplier feeding a 48-bit accumulator with pre-adder. Cascading DSP slices lets you build FIR filters, FFTs, and matrix multipliers running at 500+ MHz without using any LUTs.
Interactive: 3-Input LUT Explorer
A 3-input LUT is an 8-entry truth table stored in SRAM. Click any output cell to toggle it between 0 and 1. The LUT implements whatever function you program into it — the bitstream sets these 8 bits.
| A | B | C | OUT (click) |
|---|
FPGA vs ASIC vs CPU — When to Use Which
FPGA
- Reprogrammable any time
- Parallel hardware execution
- Medium NRE cost (tools only)
- 100× area vs ASIC
- 10–100× more power
- 10× lower clock vs ASIC
ASIC
- Fixed function post tape-out
- Maximum performance (GHz)
- Very high NRE ($1M–$50M)
- Smallest die area
- Lowest power
- Best for 1M+ units
CPU / GPU
- Software-defined, flexible
- Sequential with SIMD
- Zero NRE — buy off shelf
- Lowest latency to market
- Higher power than ASIC
- General-purpose
| Feature | FPGA | ASIC | CPU |
|---|---|---|---|
| Programmable after fab | Yes (always) | No | Software only |
| Typical clock speed | 200–500 MHz | 1–5 GHz | 3–5 GHz |
| Power efficiency | Medium | Best | Medium |
| Time to working HW | Hours to days | 6–18 months | Days (buy + code) |
| Unit cost @ 1M units | $20–$200 | $1–$10 | $50–$500 |
| Parallelism | Massive (HW) | Massive (HW) | Limited (cores) |
Simple Verilog for FPGA
Any synthesizable Verilog maps to FPGA resources. The synthesizer decides how many LUTs, FFs, and BRAMs are needed.
// 4-bit counter — maps to 4 FFs + carry chain
module counter #(parameter N=4) (
input wire clk, rst_n,
output reg [N-1:0] count
);
always @(posedge clk or negedge rst_n)
if (!rst_n) count <= '0;
else count <= count + 1;
endmodule
// 8-bit adder — maps to LUTs + carry chain (no DSP needed)
module adder8 (
input wire [7:0] a, b,
input wire cin,
output wire [7:0] sum,
output wire cout
);
assign {cout, sum} = a + b + cin;
endmodule
// 256-deep FIFO — synthesizer infers BRAM
module fifo256 #(parameter W=8) (
input wire wr_clk, rd_clk, wr_en, rd_en,
input wire [W-1:0] din,
output reg [W-1:0] dout,
output wire full, empty
);
reg [W-1:0] mem [0:255];
reg [7:0] wr_ptr=0, rd_ptr=0;
assign full = (wr_ptr+1 == rd_ptr);
assign empty = (wr_ptr == rd_ptr);
always @(posedge wr_clk) if (wr_en && !full) mem[wr_ptr++] <= din;
always @(posedge rd_clk) if (rd_en && !empty) dout <= mem[rd_ptr++];
endmodule
FPGA Use Cases
ASIC Prototyping
Run real RTL on FPGA before $10M tape-out. Find functional bugs at full speed with real I/O.
Software-Defined Radio
Implement modulation, demodulation, filtering in real-time. Change waveform without new hardware.
High-Frequency Trading
Sub-microsecond order execution. FPGA processes market data and places orders before CPU even wakes up.
Video Processing
4K encode/decode, frame synchronization, multi-channel processing — pixel pipelines in hardware.
Network Offload
Line-rate packet parsing, routing table lookup, encryption at 400Gbps — impossible in software.
ML Inference
Low-latency neural network inference with custom bit-width — between GPU (power) and ASIC (NRE).
Frequently Asked Questions
What does "field-programmable" mean?
"Field" means after leaving the factory — in the field, in your lab, on the production board. The chip arrives blank; you program it yourself by loading a bitstream. This is unlike gate arrays of the 1980s that required a mask revision to change the metal connections.
How many LUTs does a modern FPGA have?
Entry-level: Xilinx Spartan-7 has ~16K LUTs. Mid-range: Xilinx Artix-7 has ~215K LUTs. High-end: Xilinx Virtex UltraScale+ has over 1.7 million LUTs. Each "LUT" in the count is a 6-input LUT that can implement any 6-variable function.
Is Verilog or VHDL better for FPGA?
Both generate identical hardware — the synthesizer doesn't care. Verilog/SystemVerilog is dominant in industry (especially ASIC design, US). VHDL is common in Europe and defense/aerospace. For learning, pick Verilog — more online resources, shorter syntax, and directly transferable to ASIC work.
Can you run Linux on an FPGA?
Yes — you instantiate a soft-core CPU (like RISC-V or MicroBlaze) in the FPGA fabric, then boot Linux on it. Zynq and Zynq UltraScale+ SoC-FPGAs embed a real ARM Cortex-A processor alongside the FPGA fabric, giving you both worlds — a hard CPU running Linux plus custom FPGA hardware accelerators.