Where the real bottleneck lives. On-chip BRAM vs off-chip DDR4/HBM, bandwidth analysis, ping-pong double buffering, and the AXI4 master interface for streaming weights and activations.
The dirty secret of AI accelerators: compute is rarely the bottleneck — memory is. A modern FPGA has thousands of DSP MACs, but if you can't feed them data fast enough, they sit idle. Understanding the memory hierarchy is what separates a 10% utilized design from a 90% one.
BRAM is your most precious resource for AI. It's fast, single-cycle, and massively parallel — every BRAM block can be accessed independently, giving aggregate bandwidth in the terabytes/sec range.
The roofline model (from Day 1) tells you whether a layer is memory-bound. If a layer's arithmetic intensity (operations per byte) is below the roofline ridge, no amount of extra DSPs will help — you're starved for data.
The trick that hides memory latency entirely: use two buffers. While the compute engine drains buffer A, the DMA fills buffer B from DDR. Then swap. The compute never waits.
AXI4 is the standard bus for moving bulk data between the FPGA fabric and DDR/HBM. Your accelerator acts as an AXI4 master, issuing burst read/write requests.
| AXI4 Variant | Use | Key Feature |
|---|---|---|
| AXI4 (full) | Bulk DDR/HBM transfers | Bursts up to 256 beats, high throughput |
| AXI4-Lite | Control registers | Single transfers, simple — start/stop/status |
| AXI4-Stream | Data streaming between blocks | No address, pure dataflow (conv→pool→...) |
// axi_weight_loader.v — AXI4-Stream sink into ping-pong BRAM
module axi_weight_loader #(
parameter DW = 64, // AXI data width (bits)
parameter DEPTH = 1024 // words per buffer
)(
input wire clk, rst_n,
// AXI4-Stream input (from DMA reading DDR)
input wire [DW-1:0] s_axis_tdata,
input wire s_axis_tvalid,
output wire s_axis_tready,
input wire s_axis_tlast,
// control
output reg buffer_ready, // a full buffer is ready for compute
output reg active_buf // which buffer compute should read
);
reg [DW-1:0] bram_a [0:DEPTH-1];
reg [DW-1:0] bram_b [0:DEPTH-1];
reg [$clog2(DEPTH):0] wptr;
reg fill_buf; // which buffer DMA is filling
assign s_axis_tready = 1'b1; // always ready (compute keeps up)
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
wptr <= 0; fill_buf <= 0; active_buf <= 1;
buffer_ready <= 0;
end else if (s_axis_tvalid && s_axis_tready) begin
// write incoming word to the buffer being filled
if (fill_buf == 0) bram_a[wptr] <= s_axis_tdata;
else bram_b[wptr] <= s_axis_tdata;
if (s_axis_tlast) begin
// buffer complete → swap roles (ping-pong)
wptr <= 0;
active_buf <= fill_buf; // compute now reads what we just filled
fill_buf <= ~fill_buf; // DMA fills the other one next
buffer_ready<= 1;
end else begin
wptr <= wptr + 1;
end
end
end
endmoduleFPGAs like the Alveo U280 integrate HBM2 stacks delivering ~460 GB/s — 10–20× more than DDR4. This is why HBM-equipped FPGAs dominate datacenter AI inference: they can feed thousands of MACs without starving. For edge devices, the answer is instead keeping the whole (small) model in BRAM.
Next — Day 7: Activation Functions in Hardware — ReLU, Leaky ReLU, Sigmoid, Softmax, LUT approximations, and CORDIC for transcendental functions.