HomeRISC-V + AcceleratorDay 6 — Memory Architecture
RISC-V + Accelerator · Day 06 of 15

Memory Architecture
Scratchpad, DMA & Cache Coherency

By EcrioniX · Updated June 2026 · ~45 min read
Scratchpad (SPM)DMA EngineAXI4 Burst Cache CoherencyDouble-BufferingTCMSRAM

Why Memory Architecture Determines Accelerator Performance

The systolic array from Days 4–5 can execute N² MACs per cycle — but it's useless if it's starved for data. Memory bandwidth is almost always the bottleneck in real accelerator deployments. A 128×128 INT8 systolic array needs 128² × 2 = 32,768 bytes of new data per cycle to stay fully utilised. At 1 GHz that's 32 TB/s — far beyond any off-chip memory. The solution is a hierarchy of on-chip memory that feeds the array fast enough.

Roofline Model Rule

Performance is bounded by min(compute_roof, bandwidth × arithmetic_intensity). For matrix multiply, arithmetic intensity = N (ops/byte). A scratchpad that fits the working set eliminates off-chip bandwidth as the bottleneck by keeping data on-chip across multiple reuse passes.

Scratchpad vs Cache

PropertyCache (L1/L2)Scratchpad (SPM/TCM)
ManagementHardware (HW-managed eviction)Software (explicit DMA load/store)
Access latencyVariable (hit=1–10cyc, miss=100+cyc)Fixed (1–3 cycles always)
Area efficiencyLower — tag arrays consume 15–20% areaHigher — all bits are data
CoherencyHardware coherence protocol (MESI)Software flush/invalidate
PredictabilityLow (miss rate varies)High (deterministic latency)
Best forGeneral-purpose CPU workloadsAccelerators with known access patterns
ExampleARM Cortex-A L1D cacheGoogle TPU unified buffer (256 MB SPM)

DMA Engine Design

A DMA (Direct Memory Access) engine transfers data between DRAM and the scratchpad without CPU intervention. The accelerator controller configures the DMA (source address, destination, byte count) and the DMA runs autonomously, interrupting or setting a done flag when complete.

Verilog — Simple AXI4-Lite DMA engine
module dma_engine #(parameter AW=32, DW=64, LEN_W=16) (
  input               clk, rst,
  // Config interface (from accelerator controller)
  input               start,
  input  [AW-1:0]    src_addr,   // DRAM source
  input  [AW-1:0]    dst_addr,   // scratchpad destination
  input  [LEN_W-1:0] byte_len,   // bytes to transfer
  output reg          done,
  // AXI4 Read Address Channel → DRAM
  output reg          arvalid,
  input               arready,
  output [AW-1:0]    araddr,
  output [7:0]        arlen,      // burst length - 1
  // AXI4 Read Data Channel
  input               rvalid,
  output              rready,
  input  [DW-1:0]    rdata,
  input               rlast,
  // Scratchpad write port
  output reg          spm_wen,
  output reg [AW-1:0] spm_waddr,
  output [DW-1:0]    spm_wdata
);
  localparam IDLE=2'd0, AR=2'd1, RDATA=2'd2, DONE=2'd3;
  reg [1:0]    state;
  reg [AW-1:0] cur_src, cur_dst;
  reg [LEN_W-1:0] remaining;

  assign araddr   = cur_src;
  assign arlen    = (remaining >> 3) - 1;   // 64-bit bursts
  assign rready   = (state == RDATA);
  assign spm_wdata= rdata;

  always @(posedge clk) begin
    if (rst) begin state<=IDLE; done<=0; end
    else case(state)
      IDLE: if(start) begin
        cur_src <= src_addr; cur_dst <= dst_addr;
        remaining <= byte_len; arvalid <= 1; done <= 0; state <= AR;
      end
      AR: if(arready) begin arvalid<=0; state<=RDATA; end
      RDATA: if(rvalid) begin
        spm_wen <= 1; spm_waddr <= cur_dst;
        cur_dst <= cur_dst + 8; remaining <= remaining - 8;
        if(rlast) state <= (remaining>8) ? AR : DONE;
      end else spm_wen <= 0;
      DONE: begin done<=1; state<=IDLE; end
    endcase
  end
endmodule

Double-Buffering for Zero-Stall Throughput

Without double-buffering: compute waits for DMA, DMA waits for compute → 50% utilisation. With double-buffering, two scratchpad banks alternate — while the accelerator reads bank A, the DMA loads the next tile into bank B. When compute finishes, swap banks instantly.

C — Double-buffering control logic
#define SPM_BANK_A  0x50000000UL
#define SPM_BANK_B  0x50080000UL  // second half of scratchpad
#define TILE_BYTES  (128*128)     // one 128×128 INT8 tile

void double_buffered_matmul(const int8_t *A, int num_tiles) {
  uint64_t *bank[2] = {(uint64_t*)SPM_BANK_A, (uint64_t*)SPM_BANK_B};
  int cur = 0;

  // Prefetch first tile into bank[0]
  dma_start((uint64_t)A, (uint64_t)bank[0], TILE_BYTES);
  dma_wait();

  for (int t = 0; t < num_tiles; t++) {
    int nxt = 1 - cur;
    // Prefetch NEXT tile into other bank (overlaps with compute)
    if (t + 1 < num_tiles)
      dma_start((uint64_t)(A + (t+1)*TILE_BYTES), (uint64_t)bank[nxt], TILE_BYTES);

    // Compute on current bank — runs while DMA loads next tile
    sys_matmul((int8_t*)bank[cur], result_buf);

    dma_wait();   // ensure next tile is ready before swap
    cur = nxt;    // swap banks
  }
}

Cache Coherency Issues

When an accelerator accesses memory directly (via DMA), the CPU's cache may have stale copies of the same data. This causes coherency bugs that are among the hardest to debug in SoC design.

ScenarioProblemFix
CPU writes A[], accelerator reads A[]CPU's dirty cache line not yet flushed to DRAM — accelerator reads stale dataCPU must flush/clean cache lines for A[] before starting DMA
Accelerator writes C[], CPU reads C[]CPU cache has old C[] — reads stale value after accelerator writesCPU must invalidate cache lines for C[] after DMA completes
Accelerator uses hardware-coherent bus (CCI-500)None — snooping maintains coherency automaticallyNo software action needed; area/power cost for coherency
C — Cache flush/invalidate for coherency (bare-metal RISC-V)
// Flush CPU cache lines before DMA reads (D$ → DRAM)
static inline void cache_flush(void *addr, size_t len) {
  uintptr_t a = (uintptr_t)addr & ~63UL;  // align to 64-byte cache line
  for (; a < (uintptr_t)addr + len; a += 64)
    __asm__ volatile("cbo.flush (%0)" :: "r"(a)); // RISC-V Zicbom extension
  __asm__ volatile("fence");
}

// Invalidate CPU cache lines after DMA writes (DRAM → D$)
static inline void cache_inval(void *addr, size_t len) {
  uintptr_t a = (uintptr_t)addr & ~63UL;
  for (; a < (uintptr_t)addr + len; a += 64)
    __asm__ volatile("cbo.inval (%0)" :: "r"(a));
  __asm__ volatile("fence");
}

// Correct usage pattern:
void safe_accelerator_run(int8_t *A, int32_t *C, size_t n) {
  cache_flush(A, n*n);              // ensure A is in DRAM before DMA reads
  cache_inval(C, n*n*sizeof(int32_t)); // invalidate C before accel writes
  sys_matmul(A, C);                 // run accelerator
  // After return: C in DRAM. CPU will re-fetch from DRAM on next access.
}

Day 6 — Interview Questions

Q1Why do AI accelerators use scratchpad memory instead of caches?
Caches are designed for general-purpose workloads with unpredictable access patterns. AI accelerators have highly structured, predictable access patterns (tiled matrix multiply, convolution) that can be explicitly scheduled by software. A scratchpad gives: (1) deterministic latency (no miss penalty), (2) higher area efficiency (no tag arrays — 15–20% area savings), (3) simpler coherency (software-managed means no hardware protocol overhead), and (4) higher bandwidth (can be designed as multi-bank SRAM with parallel access). Google's TPU uses a 256 MB software-managed unified buffer — larger than most L3 caches — as the primary on-chip memory for both weights and activations.
Q2What is double-buffering in an accelerator context and why does it matter?
Double-buffering uses two scratchpad banks (A and B) that alternate roles. While the accelerator computes on data in bank A, the DMA simultaneously loads the next data tile into bank B. When compute finishes, the banks swap roles instantly (just a pointer swap). This hides the DMA latency completely behind computation, achieving near-100% compute utilisation compared to ~50% without double-buffering. The cost is 2× scratchpad area. Triple-buffering (adding a third bank) further helps when DMA latency exceeds compute time, but 2× is the standard choice for most accelerators.
Q3What is a cache coherency bug in accelerator design and how do you prevent it?
A cache coherency bug occurs when the CPU and accelerator have inconsistent views of the same memory. Example: CPU writes input array A to memory, but the write is held in the CPU's dirty L1 cache. The DMA then reads A from DRAM — it sees stale data because the dirty cache line was never written back. Result: the accelerator computes with wrong inputs, and the bug is data-dependent and non-deterministic. Prevention: (1) CPU must flush dirty cache lines for the input buffer to DRAM before starting DMA (cbo.flush on RISC-V, DCCIVAC on ARM). (2) CPU must invalidate cache lines for the output buffer after DMA writes, so the CPU re-fetches from DRAM (cbo.inval). (3) Use a hardware-coherent interconnect (CCI-400/500, ARM ACE) which handles this automatically via snooping.
Q4What is AXI4 burst transfer and why is it important for DMA?
AXI4 burst allows a single address transaction to transfer up to 256 data beats. Instead of sending one address for each 64-bit data word, the DMA sends one AR (address read) transaction with arlen=N-1, then receives N consecutive rdata beats from the memory controller. This dramatically reduces address channel bandwidth — transferring 256×64-bit=2KB requires 1 arvalid handshake instead of 256. For a DMA transferring megabytes of matrix data, burst mode is essential: without it, the address channel becomes the bottleneck. AXI4 also supports INCR (incrementing) burst type which is the natural mode for linear DMA transfers.
Q5How do you size the scratchpad for an N×N systolic array?
The scratchpad must hold the working set for at least one compute tile: (1) Weight tile: N×N elements (held in PEs, not scratchpad), (2) Activation tile: N×K elements (K = inner dimension), (3) Output tile: N×M elements (accumulated partial sums). For double-buffering, multiply by 2 for activation and output buffers. A 128×128 INT8 array with K=M=128 needs: activations = 2×128×128×1 = 32 KB, outputs = 2×128×128×4 = 128 KB (INT32), plus weight buffer = 128×128×1 = 16 KB. Total ≈ 176 KB minimum. Practical designs add margin (2–4×) for flexibility: 512 KB to 1 MB scratchpad for a 128-PE array.
Q6What is the roofline model and how does it apply to accelerator memory design?
The roofline model sets an upper bound on performance: perf = min(peak_compute, bandwidth × arithmetic_intensity). Arithmetic intensity (ops/byte) is the key metric — it tells you whether the workload is compute-bound or memory-bound. For matrix multiply, intensity = 2N ops per 3N bytes loaded = ~2/3 × N ops/byte. For N=128, intensity = 85 ops/byte. If bandwidth = 1 TB/s (on-chip scratchpad), compute roof = 128² × 2 × freq. At 1 GHz: roof = 32 TFlops, bandwidth roof = 85 × 1TB/s = 85 TFlops — compute-bound, which is ideal. Going off-chip (HBM ≈ 1 TB/s), the same calculation still holds for N≥64. The design goal is to keep the working set in the scratchpad so bandwidth stays at the on-chip level.
← Day 5: RoCC + Systolic Day 7: AXI4 Integration →