Why is ReLU the most popular activation for FPGA?

ReLU is just max(0, x) — a single comparator and mux in hardware, zero multipliers, zero LUTs for approximation, one clock cycle. It needs no curve approximation unlike sigmoid or tanh. This near-free hardware cost, combined with good training behavior, makes ReLU the default activation in most CNN accelerators.

How are sigmoid and tanh implemented on FPGA?

Sigmoid and tanh are smooth nonlinear curves that can't be computed directly in cheap hardware. They are implemented via: (1) LUT lookup — store precomputed values in BRAM, (2) piecewise-linear approximation — a few line segments, (3) CORDIC — iterative shift-add for high accuracy. LUT is fastest; CORDIC is most accurate for a given area.

How is softmax computed in hardware?

Softmax requires exponentials and a division: softmax(x_i) = exp(x_i)/Σexp(x_j). In hardware it's done by: subtracting the max for numerical stability, computing exp via LUT or CORDIC, accumulating the sum, then dividing (or multiplying by reciprocal). It's only applied at the final classification layer, so its cost is small relative to the convolution layers.

Activation Functions in Hardware — ReLU, Sigmoid, Softmax on FPGA

1. Why Activations Matter

Activation functions are what make neural networks nonlinear — without them, stacking layers would collapse into a single linear transform. After every conv/FC layer, an activation reshapes the output. On FPGA, the design challenge is computing these functions cheaply, because they run on every single output element.

Common Activation Curves

2. ReLU — Nearly Free in Hardware

ReLU = max(0, x). It's the cheapest possible activation: one comparator, one mux, one cycle, zero multipliers. This is a huge part of why ReLU dominates CNN design.

// relu.v — ReLU and Leaky ReLU activation (INT8/INT16)
module relu #(parameter DW = 16, parameter LEAKY = 0)(
  input  wire signed [DW-1:0] x,
  output wire signed [DW-1:0] y
);
  // ReLU: y = (x > 0) ? x : 0
  // Leaky: y = (x > 0) ? x : x >>> 4  (slope 1/16 for negatives)
  generate
    if (LEAKY == 0)
      assign y = x[DW-1] ? {DW{1'b0}} : x;        // sign bit set → negative → 0
    else
      assign y = x[DW-1] ? (x >>> 4) : x;         // leaky negative slope
  endgenerate
endmodule

ReLU Variant	Formula	Hardware Cost
ReLU	max(0, x)	1 comparator + mux
Leaky ReLU	x>0 ? x : 0.01x	+ 1 shifter
ReLU6	min(6, max(0, x))	2 comparators + mux
PReLU	x>0 ? x : αx (learned α)	+ 1 multiplier

3. LUT-Based Approximation (Sigmoid/Tanh)

The simplest way to compute a smooth curve: precompute it and store the values in a BRAM lookup table. Input bits index the table; output is the stored value.

LUT Sizing for Sigmoid (INT8 output): Input range: clamp to [-8, +8] (sigmoid saturates beyond) Address bits: 8 → 256 entries Each entry: 8-bit output Total LUT: 256 × 8 = 2 Kbit → less than 1 BRAM18 Accuracy: with 256 entries over [-8,8], step = 16/256 = 0.0625 input resolution max error ≈ 0.004 (excellent for inference) Latency: 1 cycle (single BRAM read) Throughput: 1 result/clock (fully pipelined)

// sigmoid_lut.v — LUT-based sigmoid approximation
module sigmoid_lut #(parameter DW = 8)(
  input  wire              clk,
  input  wire signed [DW-1:0] x,       // input (Q4.4 fixed point)
  output reg  [DW-1:0]     y            // sigmoid output (Q0.8, 0..1)
);
  reg [DW-1:0] lut [0:255];

  // Precomputed at synthesis (generated by Python: sigmoid(i/16))
  initial $readmemh("sigmoid_table.hex", lut);

  // Saturate-and-index: clamp input, use as address
  wire [7:0] addr = (x < -128) ? 8'd0   :
                    (x >  127) ? 8'd255 :
                    x + 8'd128;          // shift signed → unsigned index

  always @(posedge clk)
    y <= lut[addr];                       // 1-cycle BRAM lookup
endmodule

4. Piecewise-Linear Approximation

An alternative to LUTs: approximate the curve with a few straight-line segments. Cheaper in BRAM, slightly more logic. Common for tanh/sigmoid when BRAM is scarce.

Piecewise-Linear Sigmoid (3 segments)

Hard Sigmoid (piecewise, used in MobileNetV3): y = 0 if x < -3 y = 1 if x > +3 y = x/6 + 0.5 otherwise Hardware: 2 comparators + 1 shift-add. No LUT, no multiplier. Used in production mobile models → proves cheap approximations work.

5. CORDIC for High Accuracy

CORDIC (COordinate Rotation DIgital Computer) computes transcendental functions (sin, cos, tanh, exp) using only shifts and adds — no multipliers. It's iterative: more iterations = more accuracy.

Method	Accuracy	BRAM	Latency	Best For
LUT	Good (table size)	1+ BRAM	1 cycle	Fast, BRAM available
Piecewise-linear	Moderate	None	1–2 cycles	BRAM-scarce, mobile
CORDIC	High (iterations)	None	N iterations	Accuracy-critical

6. Softmax — The Classification Layer

Softmax converts the final layer's scores into probabilities. It needs exponentials and a division — the most expensive activation — but it runs only once at the output, so its cost is negligible vs the conv layers.

Softmax: softmax(x_i) = exp(x_i) / Σ_j exp(x_j) Numerically stable hardware version: 1. Find max: m = max(x_0 ... x_n) 2. Subtract max: x_i' = x_i − m (prevents exp overflow) 3. Exp (LUT/CORDIC): e_i = exp(x_i') 4. Accumulate: S = Σ e_i 5. Divide: out_i = e_i / S (or × reciprocal) Cost: only for the final ~1000 classes → tiny vs millions of conv MACs Tip: for top-1 classification you can skip softmax entirely — argmax of the logits gives the same answer!

The argmax Shortcut

If you only need the predicted class (not the probability), you can skip softmax completely — the largest logit is the same as the largest probability. Many edge inference designs do exactly this, saving the entire exp/divide hardware.

7. Fused Activation

Activations are almost always fused into the previous layer — the conv/GEMM engine writes its output straight through the ReLU before storing it. This saves a full memory round-trip.

// Fused conv output + requantize + ReLU (single pipeline stage)
module conv_requant_relu #(parameter ACC_W=32, parameter OUT_W=8)(
  input  wire                  clk,
  input  wire signed [ACC_W-1:0] acc_in,     // 32-bit accumulator from conv
  input  wire [7:0]            shift,         // requant shift amount
  output reg  signed [OUT_W-1:0] y            // INT8 activated output
);
  wire signed [ACC_W-1:0] scaled = acc_in >>> shift;   // requantize
  wire signed [ACC_W-1:0] relu   = scaled[ACC_W-1] ? 0 : scaled; // ReLU
  // saturate to INT8 range [-128, 127]
  always @(posedge clk)
    y <= (relu > 127)  ? 8'sd127 :
         (relu < -128) ? -8'sd128 : relu[OUT_W-1:0];
endmodule

Day 7 — Key Takeaways

✅ Activations add nonlinearity — run on every output element, so must be cheap
✅ ReLU = max(0,x) — one comparator + mux, the default for a reason
✅ LUT approximation: store curve in BRAM, 1-cycle lookup, <1 BRAM for sigmoid
✅ Piecewise-linear (hard sigmoid): comparators + shift-add, no BRAM — used in MobileNetV3
✅ CORDIC: shift-add iterations for high accuracy without multipliers
✅ Softmax: stable version subtracts max; only at output, so cost is tiny
✅ argmax shortcut: skip softmax for top-1 classification
✅ Fuse activation into the conv/GEMM output to save a memory round-trip

Next — Day 8: Pooling Layers & Normalization — max/average pooling hardware, batch normalization folding, and fused BN+ReLU+Pool.

← Previous

Day 6: Memory Architecture

Day 8: Pooling & Normalization

Activation Functionsin Hardware