HomeFPGA Neural NetworkDay 6 — Memory Architecture

Memory Architecture
BRAM vs DDR

Where the real bottleneck lives. On-chip BRAM vs off-chip DDR4/HBM, bandwidth analysis, ping-pong double buffering, and the AXI4 master interface for streaming weights and activations.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,500 words · 14 min read

1. The Memory Wall

The dirty secret of AI accelerators: compute is rarely the bottleneck — memory is. A modern FPGA has thousands of DSP MACs, but if you can't feed them data fast enough, they sit idle. Understanding the memory hierarchy is what separates a 10% utilized design from a 90% one.

Memory Hierarchy — Capacity vs Bandwidth vs Latency
Registers
KB · ~100 TB/s · 0 cycle — DSP/window registers
BRAM (on-chip)
10s of MB · ~10 TB/s aggregate · 1 cycle — weights, line buffers
HBM2 (in-pkg)
8–16 GB · ~460 GB/s · ~100ns — large models (Alveo U280)
DDR4 (off-chip)
GBs · ~20–80 GB/s · 100+ ns — bulk weight storage

2. BRAM — On-Chip Block RAM

BRAM is your most precious resource for AI. It's fast, single-cycle, and massively parallel — every BRAM block can be accessed independently, giving aggregate bandwidth in the terabytes/sec range.

BRAM on Xilinx UltraScale+: BRAM36 = 36 Kbit = 4.5 KB per block Alveo U250: 2,688 BRAMs = ~12 MB total Weight residency check (INT8): MobileNetV2: 3.4M params × 1B = 3.4 MB → FITS in BRAM ✓ ResNet-50: 25.6M params × 1B = 25.6 MB → exceeds 12MB ✗ → must stream from DDR/HBM Strategy: Small model → all weights in BRAM (zero DDR stalls) Large model → tile weights, stream layer-by-layer Trick: weight pruning + compression to fit more on-chip

3. The Bandwidth Bottleneck

The roofline model (from Day 1) tells you whether a layer is memory-bound. If a layer's arithmetic intensity (operations per byte) is below the roofline ridge, no amount of extra DSPs will help — you're starved for data.

Memory-Bound vs Compute-Bound Layers
Compute Roof (DSP limit) Memory BW DW-conv 1×1 conv 3×3 conv large GEMM ← memory-bound compute-bound → Arithmetic Intensity (OPS/byte)
Depthwise and 1×1 convolutions are often memory-bound — they do few operations per byte loaded. Fix: maximize on-chip reuse so data is fetched once and used many times.

4. Ping-Pong (Double) Buffering

The trick that hides memory latency entirely: use two buffers. While the compute engine drains buffer A, the DMA fills buffer B from DDR. Then swap. The compute never waits.

Ping-Pong Double Buffering
Phase 1 Buffer A → COMPUTE reads Buffer B ← DMA fills Phase 2 (swap) Buffer A ← DMA fills Buffer B → COMPUTE reads Overlapped Timeline Compute A DMA fill B (hidden!) Compute B DMA fill A (hidden!) Compute never stalls — memory latency is fully hidden

5. The AXI4 Interface

AXI4 is the standard bus for moving bulk data between the FPGA fabric and DDR/HBM. Your accelerator acts as an AXI4 master, issuing burst read/write requests.

AXI4 VariantUseKey Feature
AXI4 (full)Bulk DDR/HBM transfersBursts up to 256 beats, high throughput
AXI4-LiteControl registersSingle transfers, simple — start/stop/status
AXI4-StreamData streaming between blocksNo address, pure dataflow (conv→pool→...)
// axi_weight_loader.v — AXI4-Stream sink into ping-pong BRAM module axi_weight_loader #( parameter DW = 64, // AXI data width (bits) parameter DEPTH = 1024 // words per buffer )( input wire clk, rst_n, // AXI4-Stream input (from DMA reading DDR) input wire [DW-1:0] s_axis_tdata, input wire s_axis_tvalid, output wire s_axis_tready, input wire s_axis_tlast, // control output reg buffer_ready, // a full buffer is ready for compute output reg active_buf // which buffer compute should read ); reg [DW-1:0] bram_a [0:DEPTH-1]; reg [DW-1:0] bram_b [0:DEPTH-1]; reg [$clog2(DEPTH):0] wptr; reg fill_buf; // which buffer DMA is filling assign s_axis_tready = 1'b1; // always ready (compute keeps up) always @(posedge clk or negedge rst_n) begin if (!rst_n) begin wptr <= 0; fill_buf <= 0; active_buf <= 1; buffer_ready <= 0; end else if (s_axis_tvalid && s_axis_tready) begin // write incoming word to the buffer being filled if (fill_buf == 0) bram_a[wptr] <= s_axis_tdata; else bram_b[wptr] <= s_axis_tdata; if (s_axis_tlast) begin // buffer complete → swap roles (ping-pong) wptr <= 0; active_buf <= fill_buf; // compute now reads what we just filled fill_buf <= ~fill_buf; // DMA fills the other one next buffer_ready<= 1; end else begin wptr <= wptr + 1; end end end endmodule

6. Bandwidth Budgeting Example

ResNet-50 inference, batch=1, INT8, target 1000 IPS: Weights to load per inference: 25.6 MB Activations (intermediate): ~10 MB Total per inference: ~35 MB Required bandwidth = 35 MB × 1000 IPS = 35 GB/s DDR4 (single bank): ~20 GB/s → INSUFFICIENT ✗ HBM2 (Alveo U280): ~460 GB/s → plenty ✓ OR: keep weights resident, only stream activations Activation BW = 10 MB × 1000 = 10 GB/s → DDR4 OK ✓ Lesson: weight reuse across the batch + on-chip residency turns a memory-bound problem into a compute-bound one.

HBM Changes the Game

FPGAs like the Alveo U280 integrate HBM2 stacks delivering ~460 GB/s — 10–20× more than DDR4. This is why HBM-equipped FPGAs dominate datacenter AI inference: they can feed thousands of MACs without starving. For edge devices, the answer is instead keeping the whole (small) model in BRAM.

Day 6 — Key Takeaways

Next — Day 7: Activation Functions in Hardware — ReLU, Leaky ReLU, Sigmoid, Softmax, LUT approximations, and CORDIC for transcendental functions.

← Previous
Day 5: Convolution Engine
Next →
Day 7: Activation Functions