Day 2 showed the flexible "soft" fabric of LUTs and flip-flops. But building everything from LUTs would be wasteful — imagine constructing a megabyte of memory or a fast multiplier out of tiny truth tables. So FPGA vendors sprinkle hard blocks — purpose-built silicon for the things every design needs — across the fabric. Knowing them is the difference between a slow design and a fast one.
The LUT fabric is wonderfully general, but generality has a cost. Common functions like large memories, multipliers and clock management are needed in almost every design, and implementing them in soft logic would burn enormous resources and run slowly. The solution: build those functions once as optimised dedicated silicon ("hard" blocks) and scatter them through the chip, ready to use. You get ASIC-like efficiency for the common cases while keeping the fabric free for your custom logic.
Almost every design needs memory: a video frame buffer, a FIFO between clock domains, a packet buffer, a coefficient table. Block RAM is dedicated memory silicon — typically tens of kilobits per block (e.g. 18 Kb or 36 Kb), with many blocks across the chip that you can combine into larger or wider memories.
Key BRAM features:
There's a second kind of memory: distributed RAM, which repurposes the LUTs from Day 2 as tiny memories. The tools pick automatically:
| Block RAM | Distributed RAM (LUT) | |
|---|---|---|
| Built from | dedicated memory blocks | logic-cell LUTs |
| Best for | larger, deeper memories | small, shallow memories near logic |
| Cost | uses a BRAM block | consumes LUTs you might need for logic |
Rule of thumb: small register files → distributed RAM; buffers and FIFOs → Block RAM (we'll build one in Day 15).
Multiplication is the bane of soft logic — a single 18×18 multiplier built from LUTs eats hundreds of cells and runs slowly. Yet multiply-accumulate (MAC) is the core of filters, FFTs, image processing and neural networks. So FPGAs include DSP slices: hard blocks built around a fast hardware multiplier plus a pre-adder, an adder/accumulator, and pipeline registers.
A modern FPGA can have thousands of DSP slices, and their parallel MAC throughput is exactly why FPGAs are used for AI inference and signal processing. When you write a*b+c in HDL, synthesis maps it straight onto a DSP slice — no LUTs wasted. (This is the same MAC idea behind the systolic array.)
A synchronous design's clock must reach thousands of flip-flops at almost the same instant. If it arrived at wildly different times (high skew), flip-flops would sample at the wrong moments and the design would fail. Ordinary routing can't deliver a clock cleanly, so FPGAs have dedicated clocking hardware:
The clock network is like a conductor whose beat must reach every musician (flip-flop) at once. You can't relay the beat person-to-person (ordinary routing) — it'd drift hopelessly. Instead there's a dedicated visual line of sight to all players (the global clock tree), and a PLL is the metronome that sets, multiplies and steadies the tempo.
At the edges sit I/O blocks (IOBs) — configurable pin drivers/receivers supporting dozens of electrical standards (LVCMOS, LVDS, SSTL…), with built-in registers and delay elements for precise timing. High-end FPGAs also include hard gigabit transceivers (SerDes) — the very blocks from our SerDes Lab — for multi-gigabit links like PCIe and Ethernet, far faster than the general fabric could drive.
The ultimate hard block is an entire processor. SoC FPGAs (e.g. AMD Zynq, Intel SoC FPGAs) put hard Arm Cortex-A cores right next to the FPGA fabric on one chip. The CPU runs software (even Linux — recall the MMU/Linux discussion from the ARM course) while the fabric implements custom accelerators, and the two talk over an on-chip AXI bus. It's the best of both worlds: software flexibility plus hardware acceleration.
Beyond the soft LUT/FF fabric, an FPGA includes hard blocks for what every design needs: Block RAM (efficient on-chip dual-port memory), DSP slices (fast multiply-accumulate for signal/AI math), dedicated clocking (global low-skew networks + PLL/MMCM), flexible I/O and gigabit transceivers, and on SoC FPGAs a hard CPU. Using the right hard block instead of soft logic is how you get speed and capacity.
Dedicated on-chip memory hard blocks (tens of Kb each), dual-port and combinable, used for buffers, FIFOs and tables — far more efficient than LUT-based memory.
A hard block with a fast multiplier plus adder/accumulator for multiply-accumulate, giving FPGAs high DSP and AI throughput.
Clocks need ultra-low skew to thousands of flip-flops; dedicated global networks, BUFG buffers and PLLs/MMCMs provide and shape them.
A chip combining hard Arm CPU cores with FPGA fabric (e.g. Zynq), running software and custom hardware together over an AXI bus.