The Optimisation Goal
A naive tiling loop stalls the systolic array every tile while waiting for the DMA to load the next one. The goal of optimisation is to eliminate idle time by overlapping compute and data movement so the array stays busy nearly 100% of cycles.
Without double-buffering:
time_per_tile = T_compute + T_dma (serial: compute then load)
With double-buffering:
time_per_tile = max(T_compute, T_dma) (parallel: compute while loading)
Speedup = (T_compute + T_dma) / max(T_compute, T_dma)
For T_compute ≈ T_dma: Speedup ≈ 2×
C — Double-buffered tiling loop
#define P 16 // tile size = array dimension #define SPM_A ((int8_t*)0x20000000) // scratchpad bank A #define SPM_B ((int8_t*)0x20004000) // scratchpad bank B void tiled_matmul_db(int8_t *A, int8_t *B, int32_t *C, int M, int N, int K) { int8_t *banks[2] = { SPM_A, SPM_B }; int bank = 0; for (int mi = 0; mi < M; mi += P) { for (int ki = 0; ki < K; ki += P) { // Prologue: DMA load first tile into bank[0], no compute yet dma_load_tile(A, mi, 0, banks[0], P, N); dma_wait(); for (int ni = 0; ni < N; ni += P) { int next_ni = ni + P; int next_bank = 1 - bank; // Start DMA for NEXT tile into next bank (async) if (next_ni < N) dma_load_tile_async(A, mi, next_ni, banks[next_bank], P, N); // Compute on CURRENT tile in current bank accel_load_weight(B + ni*K + ki, P); accel_run(banks[bank], C + mi*K + ki, P); accel_wait(); // Wait for DMA to finish before next iteration dma_wait(); bank = next_bank; } } } }
INT8 vs FP32 Comparison
| Metric | FP32 | INT8 | INT4 |
|---|---|---|---|
| Bits per weight | 32 | 8 | 4 |
| Memory bandwidth | 1× | 4× less | 8× less |
| Compute density | 1× | 4× more MACs/cycle | 8× more MACs/cycle |
| Power | 1× | ~3× lower | ~5× lower |
| Accuracy drop | 0% | <1% (with calibration) | 1–3% (QAT needed) |
| Typical use | Training | Edge inference | Extreme edge / IoT |
Day 11 — Interview Questions
Q1What is loop tiling and why does it improve accelerator performance?
Loop tiling (also called loop blocking) reorders the iteration space of a nested loop to work on small sub-matrices (tiles) that fit in fast on-chip memory (scratchpad or cache) instead of streaming through DRAM for every element access. For matrix multiply C = A×B, the naive triple-loop reads each element of B N times from DRAM. With tiling (tile size P): load a P×P tile of B into the scratchpad, reuse it for all P rows of A in the same tile, then move to the next tile. This converts N DRAM reads per B element into 1 DRAM read per B element, reducing memory bandwidth by N/P. The result is more compute per byte fetched — higher arithmetic intensity and better roofline position.
Q2Explain double-buffering and how it eliminates memory stalls.
Double-buffering uses two scratchpad banks alternating between compute and load roles. In each tile iteration: the DMA asynchronously loads the next tile into the idle bank while the systolic array computes on the current bank. At the end of each tile, the banks swap roles. This hides DMA latency behind compute latency — if both take the same time, the systolic array is never idle. The hardware requirement is two independently addressable scratchpad banks and a DMA that can transfer to one bank while the array reads from another (no bank conflicts). In software, the key is issuing the DMA transfer before starting compute (dma_load_tile_async → accel_run → dma_wait), not after (which reintroduces the serial stall).
Q3What is post-training quantisation (PTQ) and how does it enable INT8 inference?
Post-training quantisation converts FP32 model weights and activations to INT8 without retraining. For each tensor, it determines a scale factor (S) and zero point (Z): x_int8 = round(x_fp32 / S) + Z. The scale is calibrated using a representative dataset: run a few batches through the model, record the min/max (or percentile range) of each activation, and compute S = (max - min) / 255. During inference: load INT8 weights (4× smaller), perform INT8 MACs (4× more compute density), dequantise the result back to FP32 (or INT32) before the next layer's scale calibration. Accuracy loss is typically under 1% for CNNs with per-channel weight quantisation. For more aggressive INT4, quantisation-aware training (QAT) is needed during the original training to minimise accuracy loss.
Q4How do you determine the optimal tile size P for a systolic array?
The optimal P balances three constraints: (1) Hardware: P must equal the systolic array dimension (a 16×16 array requires P=16; larger tiles need re-use via the tiling loop), (2) Scratchpad capacity: 2×P×P bytes (double-buffered A tile) + P×P bytes (B weights) must fit in the scratchpad. For P=16 with INT8: 2×256 + 256 = 768 bytes, (3) DMA/compute balance: T_dma(P) ≈ T_compute(P) for maximum double-buffering efficiency. T_compute = 3P−1 cycles; T_dma = P² / bandwidth × freq. Solve for P: P = bandwidth × (3P-1) / freq × P² → iterate numerically. In practice, P = array_size is forced by the hardware; the software tiling loop handles matrices of any size using P×P chunks.
Q5What is the roofline model and how do you use it to guide optimisation?
The roofline model plots achievable performance (GFlops/s, y-axis) against arithmetic intensity (FLOPs/byte, x-axis) for a given platform. Two lines form the "roof": the horizontal line at peak compute (N² MACs/cycle × freq) and the diagonal at peak memory bandwidth × arithmetic intensity. Any kernel's actual performance sits at or below the intersection of both roofs. To use it: (1) Measure the kernel's arithmetic intensity (FLOPs / bytes of DRAM traffic), (2) Plot it on the roofline — if it's to the left of the ridge point, it's memory-bound; to the right, it's compute-bound, (3) For memory-bound: increase data reuse via tiling; for compute-bound: increase array size or clock. The model tells you the maximum possible speedup from each type of optimisation before you implement it.
Q6What is software pipelining and how does it apply to the tiling loop?
Software pipelining reorders loop iterations so that multiple stages of successive iterations overlap in time — similar to a hardware pipeline but managed by software. In the tiling loop: stage 1 = DMA load, stage 2 = accelerator compute. Naive loop executes them serially: load(i) → compute(i) → load(i+1) → compute(i+1). Software-pipelined version: load(i) → [compute(i) + load(i+1)] → [compute(i+1) + load(i+2)] → ... → compute(last). The first iteration (prologue) and last iteration (epilogue) don't overlap, but all middle iterations do. This is exactly what double-buffering implements in the tiling loop. The software pipelining depth equals the number of concurrent stages — with DMA + compute, depth = 2. Adding prefetch + DMA + compute gives depth = 3, hiding both fetch and transfer latency.