What is Block RAM in an FPGA?

Block RAM, or BRAM, is dedicated on-chip memory built as hard blocks scattered through the FPGA fabric, typically a few kilobits each and combinable into larger memories. It is far more efficient than building memory out of LUTs and flip-flops, supports dual-port access, and is used for buffers, FIFOs, lookup tables and storing data on chip.

What is a DSP slice in an FPGA?

A DSP slice is a dedicated hard block built around a fast hardware multiplier plus an adder and accumulator, optimised for multiply-accumulate operations. Because multiplication is expensive to build from LUTs, DSP slices give FPGAs high arithmetic throughput for digital signal processing, filters, and machine-learning workloads while saving general logic.

Why do FPGAs have special clocking resources?

A clock must reach thousands of flip-flops with very low skew, which ordinary routing cannot do. FPGAs provide dedicated global clock networks, clock buffers such as BUFG, and PLLs or MMCMs that can multiply, divide and phase-shift clocks and remove jitter. The chip is also divided into clock regions so different parts can run on different clocks.

What is the difference between Block RAM and distributed RAM?

Block RAM uses dedicated memory hard blocks and is best for larger, deeper memories. Distributed RAM repurposes the LUTs in logic cells as small memories, which is convenient for tiny, shallow memories close to logic but wasteful for large ones. Tools choose between them based on the size and shape of the memory you describe.

An SoC FPGA combines a hard processor system, such as one or more Arm Cortex-A cores, with FPGA fabric on the same chip. The processor runs software like Linux while the fabric implements custom hardware accelerators, and the two communicate over an on-chip bus. Examples include the AMD Zynq and Intel SoC FPGA families.

DAY 3 · FPGA FOUNDATIONS

Hard Blocks — Block RAM, DSP Slices & Clocking

By EcrioniX · Updated Jun 6, 2026

Day 2 showed the flexible "soft" fabric of LUTs and flip-flops. But building everything from LUTs would be wasteful — imagine constructing a megabyte of memory or a fast multiplier out of tiny truth tables. So FPGA vendors sprinkle hard blocks — purpose-built silicon for the things every design needs — across the fabric. Knowing them is the difference between a slow design and a fast one.

1. Why hard blocks exist

The LUT fabric is wonderfully general, but generality has a cost. Common functions like large memories, multipliers and clock management are needed in almost every design, and implementing them in soft logic would burn enormous resources and run slowly. The solution: build those functions once as optimised dedicated silicon ("hard" blocks) and scatter them through the chip, ready to use. You get ASIC-like efficiency for the common cases while keeping the fabric free for your custom logic.

Figure — Hard blocks (Block RAM, DSP) sit in dedicated columns within the soft CLB fabric, with global clocking and I/O around it.

2. Block RAM (BRAM) — on-chip memory

Almost every design needs memory: a video frame buffer, a FIFO between clock domains, a packet buffer, a coefficient table. Block RAM is dedicated memory silicon — typically tens of kilobits per block (e.g. 18 Kb or 36 Kb), with many blocks across the chip that you can combine into larger or wider memories.

Key BRAM features:

Dual-port — two independent read/write ports, so two parts of your design (or two clock domains) can access it at once. This is what makes BRAM ideal for asynchronous FIFOs.
Configurable width/depth — trade word width against depth (e.g. 512×72 or 1024×36).
Modes — true dual-port, simple dual-port, single-port, and ROM (preloaded from the bitstream).

BRAM vs distributed RAM

There's a second kind of memory: distributed RAM, which repurposes the LUTs from Day 2 as tiny memories. The tools pick automatically:

	Block RAM	Distributed RAM (LUT)
Built from	dedicated memory blocks	logic-cell LUTs
Best for	larger, deeper memories	small, shallow memories near logic
Cost	uses a BRAM block	consumes LUTs you might need for logic

Rule of thumb: small register files → distributed RAM; buffers and FIFOs → Block RAM (we'll build one in Day 15).

3. DSP slices — fast multiply-accumulate

Multiplication is the bane of soft logic — a single 18×18 multiplier built from LUTs eats hundreds of cells and runs slowly. Yet multiply-accumulate (MAC) is the core of filters, FFTs, image processing and neural networks. So FPGAs include DSP slices: hard blocks built around a fast hardware multiplier plus a pre-adder, an adder/accumulator, and pipeline registers.

// A FIR filter tap — exactly what a DSP slice does in one block: acc = acc + (sample * coeff); // multiply-accumulate (MAC) // Hundreds of DSP slices running in parallel = huge DSP/AI throughput.

A modern FPGA can have thousands of DSP slices, and their parallel MAC throughput is exactly why FPGAs are used for AI inference and signal processing. When you write a*b+c in HDL, synthesis maps it straight onto a DSP slice — no LUTs wasted. (This is the same MAC idea behind the systolic array.)

4. Clocking — the most underrated resource

A synchronous design's clock must reach thousands of flip-flops at almost the same instant. If it arrived at wildly different times (high skew), flip-flops would sample at the wrong moments and the design would fail. Ordinary routing can't deliver a clock cleanly, so FPGAs have dedicated clocking hardware:

Global clock networks — special low-skew, high-fanout spines (a tree) that distribute a clock across the whole chip nearly simultaneously.
Clock buffers (e.g. BUFG) — the gateways that drive a signal onto a global clock network. You route real clocks through these, never through normal logic.
PLLs / MMCMs — Phase-Locked Loops / Mixed-Mode Clock Managers that multiply, divide and phase-shift an input clock and clean up jitter. Feed in a 100 MHz crystal and generate 200 MHz, 25 MHz and a 90°-shifted clock — all from one source (Day 18).
Clock regions — the chip is divided into regions so different areas can run on different clocks, with managed crossings between them (CDC, Day 17).

💡 The conductor and the orchestra

The clock network is like a conductor whose beat must reach every musician (flip-flop) at once. You can't relay the beat person-to-person (ordinary routing) — it'd drift hopelessly. Instead there's a dedicated visual line of sight to all players (the global clock tree), and a PLL is the metronome that sets, multiplies and steadies the tempo.

5. I/O blocks & gigabit transceivers

At the edges sit I/O blocks (IOBs) — configurable pin drivers/receivers supporting dozens of electrical standards (LVCMOS, LVDS, SSTL…), with built-in registers and delay elements for precise timing. High-end FPGAs also include hard gigabit transceivers (SerDes) — the very blocks from our SerDes Lab — for multi-gigabit links like PCIe and Ethernet, far faster than the general fabric could drive.

6. SoC FPGAs — a CPU on the same die

The ultimate hard block is an entire processor. SoC FPGAs (e.g. AMD Zynq, Intel SoC FPGAs) put hard Arm Cortex-A cores right next to the FPGA fabric on one chip. The CPU runs software (even Linux — recall the MMU/Linux discussion from the ARM course) while the fabric implements custom accelerators, and the two talk over an on-chip AXI bus. It's the best of both worlds: software flexibility plus hardware acceleration.

✅ The mental model

Beyond the soft LUT/FF fabric, an FPGA includes hard blocks for what every design needs: Block RAM (efficient on-chip dual-port memory), DSP slices (fast multiply-accumulate for signal/AI math), dedicated clocking (global low-skew networks + PLL/MMCM), flexible I/O and gigabit transceivers, and on SoC FPGAs a hard CPU. Using the right hard block instead of soft logic is how you get speed and capacity.

🎯 Day 3 takeaways

Hard blocks = dedicated silicon for common functions; far more efficient than soft LUT logic.
Block RAM — on-chip dual-port memory for buffers/FIFOs; vs distributed RAM (LUT-based) for tiny memories.
DSP slices — hard multiply-accumulate; the engine for DSP and AI throughput.
Clocking — global low-skew networks, BUFG buffers, PLL/MMCM to multiply/divide/phase-shift, and clock regions.
I/O blocks support many standards; gigabit transceivers are hard SerDes.
SoC FPGAs add hard Arm CPUs beside the fabric, linked by AXI.

Quick check

Why not build a large memory or a multiplier from LUTs?
What makes Block RAM ideal for a FIFO between two clock domains?
What can a PLL/MMCM do that ordinary routing cannot?
What does an SoC FPGA combine, and how do the parts communicate?

FAQ

What is Block RAM?

Dedicated on-chip memory hard blocks (tens of Kb each), dual-port and combinable, used for buffers, FIFOs and tables — far more efficient than LUT-based memory.

What is a DSP slice?

A hard block with a fast multiplier plus adder/accumulator for multiply-accumulate, giving FPGAs high DSP and AI throughput.

Why special clocking?

Clocks need ultra-low skew to thousands of flip-flops; dedicated global networks, BUFG buffers and PLLs/MMCMs provide and shape them.

What is an SoC FPGA?

A chip combining hard Arm CPU cores with FPGA fabric (e.g. Zynq), running software and custom hardware together over an AXI bus.

← Back to the full course roadmap