What is self-attention in transformers?

Self-attention lets every token in a sequence attend to every other token. It computes three projections — Query (Q), Key (K), Value (V) — then scores = softmax(Q·Kᵀ/√d), and output = scores·V. This is mostly matrix multiplication, which maps beautifully onto the GEMM/systolic hardware from Days 3-4, plus a softmax.

Why is attention hard to accelerate on FPGA?

Attention has two challenges: the Q·Kᵀ score matrix grows as sequence-length squared (O(N²)) in both compute and memory, and softmax requires exponentials plus a row-wise normalization that creates a data dependency. The matmuls are easy (reuse GEMM); the difficulty is managing the large intermediate score matrix and streaming softmax efficiently.

How is multi-head attention parallelized on FPGA?

Multi-head attention runs several independent attention computations (heads) on different projections of the input. On FPGA these heads are fully independent, so they map to parallel hardware: instantiate multiple attention engines that run concurrently, or time-share one engine across heads. Heads are an ideal spatial-parallelism dimension since there are no dependencies between them.

Transformer Attention on FPGA — Self-Attention, QKV, Softmax & Multi-Head

1. Why Transformers, Why FPGA

Transformers power modern AI — BERT for language, ViT for vision, GPT/LLaMA for generation. Their core is the attention mechanism, and the good news for FPGA designers is that attention is mostly matrix multiplication — the exact operation you already accelerate with the GEMM engine (Day 3) and systolic array (Day 4). The new ingredients are a large intermediate score matrix and a softmax.

You Already Have 80% of the Hardware

Three of the four steps in attention are GEMMs. Your Day 4 systolic array runs them directly. The only genuinely new hardware is the softmax (which builds on Day 7's activation work) and the buffering for the N×N score matrix (Day 6 memory). Transformers are far more FPGA-friendly than their reputation suggests.

2. The Attention Mechanism

Self-attention lets every token look at every other token and decide what's relevant. It's computed in four steps from the input sequence X (N tokens × d dimensions).

Scaled Dot-Product Attention: Step 1 — Projections (3 GEMMs): Q = X · Wq Query [N × d] K = X · Wk Key [N × d] V = X · Wv Value [N × d] Step 2 — Scores (GEMM, N×N output): S = Q · Kᵀ / √d [N × N] ← every token vs every token Step 3 — Softmax (row-wise): A = softmax(S) [N × N] ← normalize each row to sum=1 Step 4 — Weighted sum (GEMM): out = A · V [N × d] Compute: 3 projection GEMMs + Q·Kᵀ + A·V = mostly matrix multiply Memory: the N×N score matrix scales as O(N²) — the main challenge

Attention Dataflow

3. Mapping QKV to the Systolic Array

The three projections and the two attention matmuls all run on the Day 4 systolic array. You reuse the same MAC fabric, just reloading different weight matrices and routing different operands.

Step	Operation	Dimensions	Hardware
Q projection	X · Wq	[N×d]·[d×d]	Systolic array (weight-stationary Wq)
K projection	X · Wk	[N×d]·[d×d]	Systolic array (weight-stationary Wk)
V projection	X · Wv	[N×d]·[d×d]	Systolic array (weight-stationary Wv)
Scores	Q · Kᵀ	[N×d]·[d×N]	Systolic array (Kᵀ streamed)
Context	A · V	[N×N]·[N×d]	Systolic array (V weight-stationary)

4. Hardware Softmax for Attention

Softmax is the one genuinely new piece. It runs row-wise on the N×N score matrix: for each row, subtract the max (stability), exponentiate, sum, and divide. This builds directly on Day 7's exp-LUT and Day 8's accumulation.

attention_softmax.cpp (HLS, streaming row-wise)

// Row-wise softmax over one row of the N×N score matrix
// 3 passes: max → exp+sum → normalize (numerically stable)
void softmax_row(data_t s[N], data_t out[N]) {
  #pragma HLS PIPELINE II=1

  // Pass 1: row max (stability)
  data_t m = s[0];
  for (int j = 1; j < N; j++) m = (s[j] > m) ? s[j] : m;

  // Pass 2: exp(s - max) via LUT, accumulate sum
  acc_t sum = 0;
  data_t e[N];
  #pragma HLS ARRAY_PARTITION variable=e cyclic factor=8
  for (int j = 0; j < N; j++) {
    e[j] = exp_lut(s[j] - m);     // Day 7 exp LUT
    sum += e[j];
  }

  // Pass 3: normalize by reciprocal (one divide, then multiplies)
  data_t inv = reciprocal(sum);   // 1/sum once
  for (int j = 0; j < N; j++)
    out[j] = e[j] * inv;
}
// Tip: "online softmax" (FlashAttention-style) fuses passes 1-2 to
// avoid storing the full row — essential for long sequences.

The O(N²) Problem

The score matrix is N×N. For BERT (N=512) that's 262,144 elements per head — large but manageable in BRAM. For long-context LLMs (N=4096+) it explodes to 16M+ elements, exceeding on-chip memory. The fix is tiling + online softmax (FlashAttention): process the score matrix in blocks, never materializing the full N×N — keeping it on-chip and bandwidth-efficient.

5. Multi-Head Attention

Transformers run several attention "heads" in parallel, each on a different learned projection, then concatenate. Because heads are completely independent, they're a perfect spatial-parallelism dimension (Day 9) — no dependencies, ideal for FPGA replication.

Multi-Head Attention — Parallel Independent Heads

FPGA strategy: instantiate H attention engines for full parallelism, or time-share one engine across heads to save area. Heads have zero inter-dependency — the cleanest parallelism in the whole network.

6. The Full Transformer Block

Attention is one sub-layer. A complete transformer block adds a feed-forward network (two big GEMMs) and layer normalization — all of which you already know how to build.

Sub-layer	Operation	Maps to (earlier day)
Multi-head attention	QKV + scores + softmax + A·V	Systolic array (D4) + softmax (D7)
Add & LayerNorm	residual + normalize	Norm folding ideas (D8)
Feed-forward (FFN)	2 GEMMs + GELU	GEMM (D3) + activation LUT (D7)
Add & LayerNorm	residual + normalize	(D8)

FFN Is Often the Bigger Cost

People obsess over attention, but in BERT/GPT the feed-forward network (typically 4× the model dimension) often has more FLOPs than attention itself for moderate sequence lengths. Both are GEMMs — so a strong systolic array benefits the whole transformer, not just attention.

7. BERT / ViT on FPGA — Real Numbers

BERT-Base attention on Alveo U280 (HBM), INT8: Config: 12 layers, 12 heads, d=768, sequence N=128 Per-layer GEMMs: 3 proj + Q·Kᵀ + A·V + 2 FFN Systolic array: 32×32 INT8 @ 300MHz = ~0.6 TOPS/array, ×4 arrays HBM bandwidth: ~460 GB/s feeds the weight streaming Latency: ~2–5 ms / sequence (vs ~8–12ms on CPU) Throughput: ~200–400 sequences/sec Power: ~45 W → competitive perf/W vs GPU for batch-1 latency Vision Transformer (ViT-Base, 196 patches): Same engine — patches are just the "tokens" Real-time image classification at the edge with a smaller config

From CNNs to LLMs — Same Foundation

Everything that makes a great CNN accelerator — INT8 quantization, a strong GEMM/systolic array, smart memory and pipelining — is exactly what a transformer accelerator needs. The leap from vision to language models is much smaller in hardware than in software. Master the fundamentals and you can accelerate both.

Day 13 — Key Takeaways

✅ Attention = mostly GEMM — Q/K/V projections, Q·Kᵀ, and A·V all run on the systolic array
✅ 4 steps: project QKV → scores Q·Kᵀ/√d → softmax → weighted sum A·V
✅ Softmax is the one new piece — row-wise, stable (subtract max), exp-LUT + reciprocal
✅ O(N²) score matrix is the challenge → tiling + online softmax (FlashAttention)
✅ Multi-head = independent heads → ideal spatial parallelism (replicate or time-share)
✅ Full block adds FFN (2 GEMMs + GELU) + LayerNorm — often FFN dominates FLOPs
✅ BERT/ViT on Alveo: ~2–5ms/sequence, competitive batch-1 latency vs GPU
✅ Same foundation as CNNs — quantization, GEMM, memory, pipelining carry straight over

Next — Day 14: Benchmarking & Profiling — measuring real TOPS, latency, and power; MLPerf methodology; and finding the true bottleneck in your accelerator.

← Previous

Day 12: Power Optimization

Day 14: Benchmarking

Transformer Attentionon FPGA