From CNNs to the architecture behind ChatGPT. Self-attention in hardware: QKV matmuls, scaled dot-product, hardware softmax, and multi-head parallelism for BERT and Vision Transformer inference.
Transformers power modern AI — BERT for language, ViT for vision, GPT/LLaMA for generation. Their core is the attention mechanism, and the good news for FPGA designers is that attention is mostly matrix multiplication — the exact operation you already accelerate with the GEMM engine (Day 3) and systolic array (Day 4). The new ingredients are a large intermediate score matrix and a softmax.
Three of the four steps in attention are GEMMs. Your Day 4 systolic array runs them directly. The only genuinely new hardware is the softmax (which builds on Day 7's activation work) and the buffering for the N×N score matrix (Day 6 memory). Transformers are far more FPGA-friendly than their reputation suggests.
Self-attention lets every token look at every other token and decide what's relevant. It's computed in four steps from the input sequence X (N tokens × d dimensions).
The three projections and the two attention matmuls all run on the Day 4 systolic array. You reuse the same MAC fabric, just reloading different weight matrices and routing different operands.
| Step | Operation | Dimensions | Hardware |
|---|---|---|---|
| Q projection | X · Wq | [N×d]·[d×d] | Systolic array (weight-stationary Wq) |
| K projection | X · Wk | [N×d]·[d×d] | Systolic array (weight-stationary Wk) |
| V projection | X · Wv | [N×d]·[d×d] | Systolic array (weight-stationary Wv) |
| Scores | Q · Kᵀ | [N×d]·[d×N] | Systolic array (Kᵀ streamed) |
| Context | A · V | [N×N]·[N×d] | Systolic array (V weight-stationary) |
Softmax is the one genuinely new piece. It runs row-wise on the N×N score matrix: for each row, subtract the max (stability), exponentiate, sum, and divide. This builds directly on Day 7's exp-LUT and Day 8's accumulation.
// Row-wise softmax over one row of the N×N score matrix
// 3 passes: max → exp+sum → normalize (numerically stable)
void softmax_row(data_t s[N], data_t out[N]) {
#pragma HLS PIPELINE II=1
// Pass 1: row max (stability)
data_t m = s[0];
for (int j = 1; j < N; j++) m = (s[j] > m) ? s[j] : m;
// Pass 2: exp(s - max) via LUT, accumulate sum
acc_t sum = 0;
data_t e[N];
#pragma HLS ARRAY_PARTITION variable=e cyclic factor=8
for (int j = 0; j < N; j++) {
e[j] = exp_lut(s[j] - m); // Day 7 exp LUT
sum += e[j];
}
// Pass 3: normalize by reciprocal (one divide, then multiplies)
data_t inv = reciprocal(sum); // 1/sum once
for (int j = 0; j < N; j++)
out[j] = e[j] * inv;
}
// Tip: "online softmax" (FlashAttention-style) fuses passes 1-2 to
// avoid storing the full row — essential for long sequences.The score matrix is N×N. For BERT (N=512) that's 262,144 elements per head — large but manageable in BRAM. For long-context LLMs (N=4096+) it explodes to 16M+ elements, exceeding on-chip memory. The fix is tiling + online softmax (FlashAttention): process the score matrix in blocks, never materializing the full N×N — keeping it on-chip and bandwidth-efficient.
Transformers run several attention "heads" in parallel, each on a different learned projection, then concatenate. Because heads are completely independent, they're a perfect spatial-parallelism dimension (Day 9) — no dependencies, ideal for FPGA replication.
Attention is one sub-layer. A complete transformer block adds a feed-forward network (two big GEMMs) and layer normalization — all of which you already know how to build.
| Sub-layer | Operation | Maps to (earlier day) |
|---|---|---|
| Multi-head attention | QKV + scores + softmax + A·V | Systolic array (D4) + softmax (D7) |
| Add & LayerNorm | residual + normalize | Norm folding ideas (D8) |
| Feed-forward (FFN) | 2 GEMMs + GELU | GEMM (D3) + activation LUT (D7) |
| Add & LayerNorm | residual + normalize | (D8) |
People obsess over attention, but in BERT/GPT the feed-forward network (typically 4× the model dimension) often has more FLOPs than attention itself for moderate sequence lengths. Both are GEMMs — so a strong systolic array benefits the whole transformer, not just attention.
Everything that makes a great CNN accelerator — INT8 quantization, a strong GEMM/systolic array, smart memory and pipelining — is exactly what a transformer accelerator needs. The leap from vision to language models is much smaller in hardware than in software. Master the fundamentals and you can accelerate both.
Next — Day 14: Benchmarking & Profiling — measuring real TOPS, latency, and power; MLPerf methodology; and finding the true bottleneck in your accelerator.