Why Memory Architecture Determines Accelerator Performance
The systolic array from Days 4–5 can execute N² MACs per cycle — but it's useless if it's starved for data. Memory bandwidth is almost always the bottleneck in real accelerator deployments. A 128×128 INT8 systolic array needs 128² × 2 = 32,768 bytes of new data per cycle to stay fully utilised. At 1 GHz that's 32 TB/s — far beyond any off-chip memory. The solution is a hierarchy of on-chip memory that feeds the array fast enough.
Performance is bounded by min(compute_roof, bandwidth × arithmetic_intensity). For matrix multiply, arithmetic intensity = N (ops/byte). A scratchpad that fits the working set eliminates off-chip bandwidth as the bottleneck by keeping data on-chip across multiple reuse passes.
Scratchpad vs Cache
| Property | Cache (L1/L2) | Scratchpad (SPM/TCM) |
|---|---|---|
| Management | Hardware (HW-managed eviction) | Software (explicit DMA load/store) |
| Access latency | Variable (hit=1–10cyc, miss=100+cyc) | Fixed (1–3 cycles always) |
| Area efficiency | Lower — tag arrays consume 15–20% area | Higher — all bits are data |
| Coherency | Hardware coherence protocol (MESI) | Software flush/invalidate |
| Predictability | Low (miss rate varies) | High (deterministic latency) |
| Best for | General-purpose CPU workloads | Accelerators with known access patterns |
| Example | ARM Cortex-A L1D cache | Google TPU unified buffer (256 MB SPM) |
DMA Engine Design
A DMA (Direct Memory Access) engine transfers data between DRAM and the scratchpad without CPU intervention. The accelerator controller configures the DMA (source address, destination, byte count) and the DMA runs autonomously, interrupting or setting a done flag when complete.
module dma_engine #(parameter AW=32, DW=64, LEN_W=16) ( input clk, rst, // Config interface (from accelerator controller) input start, input [AW-1:0] src_addr, // DRAM source input [AW-1:0] dst_addr, // scratchpad destination input [LEN_W-1:0] byte_len, // bytes to transfer output reg done, // AXI4 Read Address Channel → DRAM output reg arvalid, input arready, output [AW-1:0] araddr, output [7:0] arlen, // burst length - 1 // AXI4 Read Data Channel input rvalid, output rready, input [DW-1:0] rdata, input rlast, // Scratchpad write port output reg spm_wen, output reg [AW-1:0] spm_waddr, output [DW-1:0] spm_wdata ); localparam IDLE=2'd0, AR=2'd1, RDATA=2'd2, DONE=2'd3; reg [1:0] state; reg [AW-1:0] cur_src, cur_dst; reg [LEN_W-1:0] remaining; assign araddr = cur_src; assign arlen = (remaining >> 3) - 1; // 64-bit bursts assign rready = (state == RDATA); assign spm_wdata= rdata; always @(posedge clk) begin if (rst) begin state<=IDLE; done<=0; end else case(state) IDLE: if(start) begin cur_src <= src_addr; cur_dst <= dst_addr; remaining <= byte_len; arvalid <= 1; done <= 0; state <= AR; end AR: if(arready) begin arvalid<=0; state<=RDATA; end RDATA: if(rvalid) begin spm_wen <= 1; spm_waddr <= cur_dst; cur_dst <= cur_dst + 8; remaining <= remaining - 8; if(rlast) state <= (remaining>8) ? AR : DONE; end else spm_wen <= 0; DONE: begin done<=1; state<=IDLE; end endcase end endmodule
Double-Buffering for Zero-Stall Throughput
Without double-buffering: compute waits for DMA, DMA waits for compute → 50% utilisation. With double-buffering, two scratchpad banks alternate — while the accelerator reads bank A, the DMA loads the next tile into bank B. When compute finishes, swap banks instantly.
#define SPM_BANK_A 0x50000000UL #define SPM_BANK_B 0x50080000UL // second half of scratchpad #define TILE_BYTES (128*128) // one 128×128 INT8 tile void double_buffered_matmul(const int8_t *A, int num_tiles) { uint64_t *bank[2] = {(uint64_t*)SPM_BANK_A, (uint64_t*)SPM_BANK_B}; int cur = 0; // Prefetch first tile into bank[0] dma_start((uint64_t)A, (uint64_t)bank[0], TILE_BYTES); dma_wait(); for (int t = 0; t < num_tiles; t++) { int nxt = 1 - cur; // Prefetch NEXT tile into other bank (overlaps with compute) if (t + 1 < num_tiles) dma_start((uint64_t)(A + (t+1)*TILE_BYTES), (uint64_t)bank[nxt], TILE_BYTES); // Compute on current bank — runs while DMA loads next tile sys_matmul((int8_t*)bank[cur], result_buf); dma_wait(); // ensure next tile is ready before swap cur = nxt; // swap banks } }
Cache Coherency Issues
When an accelerator accesses memory directly (via DMA), the CPU's cache may have stale copies of the same data. This causes coherency bugs that are among the hardest to debug in SoC design.
| Scenario | Problem | Fix |
|---|---|---|
| CPU writes A[], accelerator reads A[] | CPU's dirty cache line not yet flushed to DRAM — accelerator reads stale data | CPU must flush/clean cache lines for A[] before starting DMA |
| Accelerator writes C[], CPU reads C[] | CPU cache has old C[] — reads stale value after accelerator writes | CPU must invalidate cache lines for C[] after DMA completes |
| Accelerator uses hardware-coherent bus (CCI-500) | None — snooping maintains coherency automatically | No software action needed; area/power cost for coherency |
// Flush CPU cache lines before DMA reads (D$ → DRAM) static inline void cache_flush(void *addr, size_t len) { uintptr_t a = (uintptr_t)addr & ~63UL; // align to 64-byte cache line for (; a < (uintptr_t)addr + len; a += 64) __asm__ volatile("cbo.flush (%0)" :: "r"(a)); // RISC-V Zicbom extension __asm__ volatile("fence"); } // Invalidate CPU cache lines after DMA writes (DRAM → D$) static inline void cache_inval(void *addr, size_t len) { uintptr_t a = (uintptr_t)addr & ~63UL; for (; a < (uintptr_t)addr + len; a += 64) __asm__ volatile("cbo.inval (%0)" :: "r"(a)); __asm__ volatile("fence"); } // Correct usage pattern: void safe_accelerator_run(int8_t *A, int32_t *C, size_t n) { cache_flush(A, n*n); // ensure A is in DRAM before DMA reads cache_inval(C, n*n*sizeof(int32_t)); // invalidate C before accel writes sys_matmul(A, C); // run accelerator // After return: C in DRAM. CPU will re-fetch from DRAM on next access. }