What is the difference between BRAM and DDR on an FPGA?

BRAM (Block RAM) is on-chip memory — tens of MB, single-cycle access, terabytes/sec aggregate bandwidth, but limited capacity. DDR4 is off-chip — gigabytes of capacity but ~20-80 GB/s bandwidth and 100+ ns latency. AI accelerators keep weights and active feature maps in BRAM and stream large data from DDR only when it won't fit on-chip.

What is ping-pong buffering?

Ping-pong (double) buffering uses two memory banks: while the compute engine reads from buffer A, the DMA fills buffer B from DDR. Next cycle they swap. This hides memory latency completely — the compute never stalls waiting for data, as long as DMA can keep up with consumption.

Why is memory bandwidth the bottleneck for neural networks?

Many neural network layers are memory-bound, not compute-bound: they need more bytes of weights/activations per second than DDR can supply. The roofline model shows that below a certain arithmetic intensity (operations per byte), performance is capped by memory bandwidth, not by the number of DSP MACs. Maximizing on-chip data reuse is the key fix.

FPGA Memory Architecture for AI — BRAM vs DDR4, AXI4, Ping-Pong

1. The Memory Wall

The dirty secret of AI accelerators: compute is rarely the bottleneck — memory is. A modern FPGA has thousands of DSP MACs, but if you can't feed them data fast enough, they sit idle. Understanding the memory hierarchy is what separates a 10% utilized design from a 90% one.

Memory Hierarchy — Capacity vs Bandwidth vs Latency

Registers

KB · ~100 TB/s · 0 cycle — DSP/window registers

BRAM (on-chip)

10s of MB · ~10 TB/s aggregate · 1 cycle — weights, line buffers

HBM2 (in-pkg)

8–16 GB · ~460 GB/s · ~100ns — large models (Alveo U280)

DDR4 (off-chip)

GBs · ~20–80 GB/s · 100+ ns — bulk weight storage

2. BRAM — On-Chip Block RAM

BRAM is your most precious resource for AI. It's fast, single-cycle, and massively parallel — every BRAM block can be accessed independently, giving aggregate bandwidth in the terabytes/sec range.

BRAM on Xilinx UltraScale+: BRAM36 = 36 Kbit = 4.5 KB per block Alveo U250: 2,688 BRAMs = ~12 MB total Weight residency check (INT8): MobileNetV2: 3.4M params × 1B = 3.4 MB → FITS in BRAM ✓ ResNet-50: 25.6M params × 1B = 25.6 MB → exceeds 12MB ✗ → must stream from DDR/HBM Strategy: Small model → all weights in BRAM (zero DDR stalls) Large model → tile weights, stream layer-by-layer Trick: weight pruning + compression to fit more on-chip

3. The Bandwidth Bottleneck

The roofline model (from Day 1) tells you whether a layer is memory-bound. If a layer's arithmetic intensity (operations per byte) is below the roofline ridge, no amount of extra DSPs will help — you're starved for data.

Memory-Bound vs Compute-Bound Layers

Depthwise and 1×1 convolutions are often memory-bound — they do few operations per byte loaded. Fix: maximize on-chip reuse so data is fetched once and used many times.

4. Ping-Pong (Double) Buffering

The trick that hides memory latency entirely: use two buffers. While the compute engine drains buffer A, the DMA fills buffer B from DDR. Then swap. The compute never waits.

Ping-Pong Double Buffering

5. The AXI4 Interface

AXI4 is the standard bus for moving bulk data between the FPGA fabric and DDR/HBM. Your accelerator acts as an AXI4 master, issuing burst read/write requests.

AXI4 Variant	Use	Key Feature
AXI4 (full)	Bulk DDR/HBM transfers	Bursts up to 256 beats, high throughput
AXI4-Lite	Control registers	Single transfers, simple — start/stop/status
AXI4-Stream	Data streaming between blocks	No address, pure dataflow (conv→pool→...)

// axi_weight_loader.v — AXI4-Stream sink into ping-pong BRAM
module axi_weight_loader #(
  parameter DW = 64,        // AXI data width (bits)
  parameter DEPTH = 1024    // words per buffer
)(
  input  wire             clk, rst_n,
  // AXI4-Stream input (from DMA reading DDR)
  input  wire [DW-1:0]    s_axis_tdata,
  input  wire             s_axis_tvalid,
  output wire             s_axis_tready,
  input  wire             s_axis_tlast,
  // control
  output reg              buffer_ready,   // a full buffer is ready for compute
  output reg              active_buf      // which buffer compute should read
);

  reg [DW-1:0] bram_a [0:DEPTH-1];
  reg [DW-1:0] bram_b [0:DEPTH-1];
  reg [$clog2(DEPTH):0] wptr;
  reg fill_buf;   // which buffer DMA is filling

  assign s_axis_tready = 1'b1;  // always ready (compute keeps up)

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      wptr <= 0; fill_buf <= 0; active_buf <= 1;
      buffer_ready <= 0;
    end else if (s_axis_tvalid && s_axis_tready) begin
      // write incoming word to the buffer being filled
      if (fill_buf == 0) bram_a[wptr] <= s_axis_tdata;
      else               bram_b[wptr] <= s_axis_tdata;

      if (s_axis_tlast) begin
        // buffer complete → swap roles (ping-pong)
        wptr        <= 0;
        active_buf  <= fill_buf;   // compute now reads what we just filled
        fill_buf    <= ~fill_buf;  // DMA fills the other one next
        buffer_ready<= 1;
      end else begin
        wptr <= wptr + 1;
      end
    end
  end
endmodule

6. Bandwidth Budgeting Example

ResNet-50 inference, batch=1, INT8, target 1000 IPS: Weights to load per inference: 25.6 MB Activations (intermediate): ~10 MB Total per inference: ~35 MB Required bandwidth = 35 MB × 1000 IPS = 35 GB/s DDR4 (single bank): ~20 GB/s → INSUFFICIENT ✗ HBM2 (Alveo U280): ~460 GB/s → plenty ✓ OR: keep weights resident, only stream activations Activation BW = 10 MB × 1000 = 10 GB/s → DDR4 OK ✓ Lesson: weight reuse across the batch + on-chip residency turns a memory-bound problem into a compute-bound one.

HBM Changes the Game

FPGAs like the Alveo U280 integrate HBM2 stacks delivering ~460 GB/s — 10–20× more than DDR4. This is why HBM-equipped FPGAs dominate datacenter AI inference: they can feed thousands of MACs without starving. For edge devices, the answer is instead keeping the whole (small) model in BRAM.

Day 6 — Key Takeaways

✅ Memory is the bottleneck, not compute, for many NN layers
✅ BRAM: ~12MB on Alveo U250, single-cycle, TB/s aggregate — keep weights here
✅ DDR4: GBs of capacity but only 20–80 GB/s, 100+ ns latency
✅ HBM2: ~460 GB/s in-package — the datacenter inference enabler
✅ Roofline: layers below the ridge are memory-bound — maximize reuse
✅ Ping-pong buffering hides DMA latency completely — compute never stalls
✅ AXI4 full for bulk, Lite for control, Stream for block-to-block dataflow
✅ Weight residency turns memory-bound into compute-bound

Next — Day 7: Activation Functions in Hardware — ReLU, Leaky ReLU, Sigmoid, Softmax, LUT approximations, and CORDIC for transcendental functions.

← Previous

Day 5: Convolution Engine

Day 7: Activation Functions

Memory ArchitectureBRAM vs DDR