HomeFPGA Neural NetworkDay 13 — Transformer Attention

Transformer Attention
on FPGA

From CNNs to the architecture behind ChatGPT. Self-attention in hardware: QKV matmuls, scaled dot-product, hardware softmax, and multi-head parallelism for BERT and Vision Transformer inference.

By EcrioniX Engineering Team · Published June 16, 2026 · ~4,800 words · 15 min read

1. Why Transformers, Why FPGA

Transformers power modern AI — BERT for language, ViT for vision, GPT/LLaMA for generation. Their core is the attention mechanism, and the good news for FPGA designers is that attention is mostly matrix multiplication — the exact operation you already accelerate with the GEMM engine (Day 3) and systolic array (Day 4). The new ingredients are a large intermediate score matrix and a softmax.

You Already Have 80% of the Hardware

Three of the four steps in attention are GEMMs. Your Day 4 systolic array runs them directly. The only genuinely new hardware is the softmax (which builds on Day 7's activation work) and the buffering for the N×N score matrix (Day 6 memory). Transformers are far more FPGA-friendly than their reputation suggests.

2. The Attention Mechanism

Self-attention lets every token look at every other token and decide what's relevant. It's computed in four steps from the input sequence X (N tokens × d dimensions).

Scaled Dot-Product Attention: Step 1 — Projections (3 GEMMs): Q = X · Wq Query [N × d] K = X · Wk Key [N × d] V = X · Wv Value [N × d] Step 2 — Scores (GEMM, N×N output): S = Q · Kᵀ / √d [N × N] ← every token vs every token Step 3 — Softmax (row-wise): A = softmax(S) [N × N] ← normalize each row to sum=1 Step 4 — Weighted sum (GEMM): out = A · V [N × d] Compute: 3 projection GEMMs + Q·Kᵀ + A·V = mostly matrix multiply Memory: the N×N score matrix scales as O(N²) — the main challenge
Attention Dataflow
X Q K V 3 projection GEMMs Q·Kᵀ/√dN×N scores softmax A·VGEMM out Purple/cyan = GEMM (your systolic array) · Yellow = softmax (new) · V feeds A·V

3. Mapping QKV to the Systolic Array

The three projections and the two attention matmuls all run on the Day 4 systolic array. You reuse the same MAC fabric, just reloading different weight matrices and routing different operands.

StepOperationDimensionsHardware
Q projectionX · Wq[N×d]·[d×d]Systolic array (weight-stationary Wq)
K projectionX · Wk[N×d]·[d×d]Systolic array (weight-stationary Wk)
V projectionX · Wv[N×d]·[d×d]Systolic array (weight-stationary Wv)
ScoresQ · Kᵀ[N×d]·[d×N]Systolic array (Kᵀ streamed)
ContextA · V[N×N]·[N×d]Systolic array (V weight-stationary)

4. Hardware Softmax for Attention

Softmax is the one genuinely new piece. It runs row-wise on the N×N score matrix: for each row, subtract the max (stability), exponentiate, sum, and divide. This builds directly on Day 7's exp-LUT and Day 8's accumulation.

attention_softmax.cpp (HLS, streaming row-wise)
// Row-wise softmax over one row of the N×N score matrix // 3 passes: max → exp+sum → normalize (numerically stable) void softmax_row(data_t s[N], data_t out[N]) { #pragma HLS PIPELINE II=1 // Pass 1: row max (stability) data_t m = s[0]; for (int j = 1; j < N; j++) m = (s[j] > m) ? s[j] : m; // Pass 2: exp(s - max) via LUT, accumulate sum acc_t sum = 0; data_t e[N]; #pragma HLS ARRAY_PARTITION variable=e cyclic factor=8 for (int j = 0; j < N; j++) { e[j] = exp_lut(s[j] - m); // Day 7 exp LUT sum += e[j]; } // Pass 3: normalize by reciprocal (one divide, then multiplies) data_t inv = reciprocal(sum); // 1/sum once for (int j = 0; j < N; j++) out[j] = e[j] * inv; } // Tip: "online softmax" (FlashAttention-style) fuses passes 1-2 to // avoid storing the full row — essential for long sequences.

The O(N²) Problem

The score matrix is N×N. For BERT (N=512) that's 262,144 elements per head — large but manageable in BRAM. For long-context LLMs (N=4096+) it explodes to 16M+ elements, exceeding on-chip memory. The fix is tiling + online softmax (FlashAttention): process the score matrix in blocks, never materializing the full N×N — keeping it on-chip and bandwidth-efficient.

5. Multi-Head Attention

Transformers run several attention "heads" in parallel, each on a different learned projection, then concatenate. Because heads are completely independent, they're a perfect spatial-parallelism dimension (Day 9) — no dependencies, ideal for FPGA replication.

Multi-Head Attention — Parallel Independent Heads
X Head 1: attention Head 2: attention Head h: attention all run concurrently — no dependencies Concat Wo proj out
FPGA strategy: instantiate H attention engines for full parallelism, or time-share one engine across heads to save area. Heads have zero inter-dependency — the cleanest parallelism in the whole network.

6. The Full Transformer Block

Attention is one sub-layer. A complete transformer block adds a feed-forward network (two big GEMMs) and layer normalization — all of which you already know how to build.

Sub-layerOperationMaps to (earlier day)
Multi-head attentionQKV + scores + softmax + A·VSystolic array (D4) + softmax (D7)
Add & LayerNormresidual + normalizeNorm folding ideas (D8)
Feed-forward (FFN)2 GEMMs + GELUGEMM (D3) + activation LUT (D7)
Add & LayerNormresidual + normalize(D8)

FFN Is Often the Bigger Cost

People obsess over attention, but in BERT/GPT the feed-forward network (typically 4× the model dimension) often has more FLOPs than attention itself for moderate sequence lengths. Both are GEMMs — so a strong systolic array benefits the whole transformer, not just attention.

7. BERT / ViT on FPGA — Real Numbers

BERT-Base attention on Alveo U280 (HBM), INT8: Config: 12 layers, 12 heads, d=768, sequence N=128 Per-layer GEMMs: 3 proj + Q·Kᵀ + A·V + 2 FFN Systolic array: 32×32 INT8 @ 300MHz = ~0.6 TOPS/array, ×4 arrays HBM bandwidth: ~460 GB/s feeds the weight streaming Latency: ~2–5 ms / sequence (vs ~8–12ms on CPU) Throughput: ~200–400 sequences/sec Power: ~45 W → competitive perf/W vs GPU for batch-1 latency Vision Transformer (ViT-Base, 196 patches): Same engine — patches are just the "tokens" Real-time image classification at the edge with a smaller config

From CNNs to LLMs — Same Foundation

Everything that makes a great CNN accelerator — INT8 quantization, a strong GEMM/systolic array, smart memory and pipelining — is exactly what a transformer accelerator needs. The leap from vision to language models is much smaller in hardware than in software. Master the fundamentals and you can accelerate both.

Day 13 — Key Takeaways

Next — Day 14: Benchmarking & Profiling — measuring real TOPS, latency, and power; MLPerf methodology; and finding the true bottleneck in your accelerator.

← Previous
Day 12: Power Optimization
Next →
Day 14: Benchmarking