HomeFPGA Neural NetworkDay 7 — Activation Functions

Activation Functions
in Hardware

ReLU, Leaky ReLU, Sigmoid, Tanh, and Softmax in synthesizable hardware — LUT approximation, CORDIC, piecewise-linear methods, and pipelined activation units in Verilog.

By EcrioniX Engineering Team · Published June 14, 2026 · ~4,500 words · 14 min read

1. Why Activations Matter

Activation functions are what make neural networks nonlinear — without them, stacking layers would collapse into a single linear transform. After every conv/FC layer, an activation reshapes the output. On FPGA, the design challenge is computing these functions cheaply, because they run on every single output element.

Common Activation Curves
ReLU Sigmoid Tanh ReLU is piecewise-linear (cheap); sigmoid/tanh are smooth curves (need approximation)

2. ReLU — Nearly Free in Hardware

ReLU = max(0, x). It's the cheapest possible activation: one comparator, one mux, one cycle, zero multipliers. This is a huge part of why ReLU dominates CNN design.

// relu.v — ReLU and Leaky ReLU activation (INT8/INT16) module relu #(parameter DW = 16, parameter LEAKY = 0)( input wire signed [DW-1:0] x, output wire signed [DW-1:0] y ); // ReLU: y = (x > 0) ? x : 0 // Leaky: y = (x > 0) ? x : x >>> 4 (slope 1/16 for negatives) generate if (LEAKY == 0) assign y = x[DW-1] ? {DW{1'b0}} : x; // sign bit set → negative → 0 else assign y = x[DW-1] ? (x >>> 4) : x; // leaky negative slope endgenerate endmodule
ReLU VariantFormulaHardware Cost
ReLUmax(0, x)1 comparator + mux
Leaky ReLUx>0 ? x : 0.01x+ 1 shifter
ReLU6min(6, max(0, x))2 comparators + mux
PReLUx>0 ? x : αx (learned α)+ 1 multiplier

3. LUT-Based Approximation (Sigmoid/Tanh)

The simplest way to compute a smooth curve: precompute it and store the values in a BRAM lookup table. Input bits index the table; output is the stored value.

LUT Sizing for Sigmoid (INT8 output): Input range: clamp to [-8, +8] (sigmoid saturates beyond) Address bits: 8 → 256 entries Each entry: 8-bit output Total LUT: 256 × 8 = 2 Kbit → less than 1 BRAM18 Accuracy: with 256 entries over [-8,8], step = 16/256 = 0.0625 input resolution max error ≈ 0.004 (excellent for inference) Latency: 1 cycle (single BRAM read) Throughput: 1 result/clock (fully pipelined)
// sigmoid_lut.v — LUT-based sigmoid approximation module sigmoid_lut #(parameter DW = 8)( input wire clk, input wire signed [DW-1:0] x, // input (Q4.4 fixed point) output reg [DW-1:0] y // sigmoid output (Q0.8, 0..1) ); reg [DW-1:0] lut [0:255]; // Precomputed at synthesis (generated by Python: sigmoid(i/16)) initial $readmemh("sigmoid_table.hex", lut); // Saturate-and-index: clamp input, use as address wire [7:0] addr = (x < -128) ? 8'd0 : (x > 127) ? 8'd255 : x + 8'd128; // shift signed → unsigned index always @(posedge clk) y <= lut[addr]; // 1-cycle BRAM lookup endmodule

4. Piecewise-Linear Approximation

An alternative to LUTs: approximate the curve with a few straight-line segments. Cheaper in BRAM, slightly more logic. Common for tanh/sigmoid when BRAM is scarce.

Piecewise-Linear Sigmoid (3 segments)
y=0 y=0.25x+0.5 y=1 Green = 3-segment approx · Yellow = true sigmoid (just a few comparators + 1 mult)
Hard Sigmoid (piecewise, used in MobileNetV3): y = 0 if x < -3 y = 1 if x > +3 y = x/6 + 0.5 otherwise Hardware: 2 comparators + 1 shift-add. No LUT, no multiplier. Used in production mobile models → proves cheap approximations work.

5. CORDIC for High Accuracy

CORDIC (COordinate Rotation DIgital Computer) computes transcendental functions (sin, cos, tanh, exp) using only shifts and adds — no multipliers. It's iterative: more iterations = more accuracy.

MethodAccuracyBRAMLatencyBest For
LUTGood (table size)1+ BRAM1 cycleFast, BRAM available
Piecewise-linearModerateNone1–2 cyclesBRAM-scarce, mobile
CORDICHigh (iterations)NoneN iterationsAccuracy-critical

6. Softmax — The Classification Layer

Softmax converts the final layer's scores into probabilities. It needs exponentials and a division — the most expensive activation — but it runs only once at the output, so its cost is negligible vs the conv layers.

Softmax: softmax(x_i) = exp(x_i) / Σ_j exp(x_j) Numerically stable hardware version: 1. Find max: m = max(x_0 ... x_n) 2. Subtract max: x_i' = x_i − m (prevents exp overflow) 3. Exp (LUT/CORDIC): e_i = exp(x_i') 4. Accumulate: S = Σ e_i 5. Divide: out_i = e_i / S (or × reciprocal) Cost: only for the final ~1000 classes → tiny vs millions of conv MACs Tip: for top-1 classification you can skip softmax entirely — argmax of the logits gives the same answer!

The argmax Shortcut

If you only need the predicted class (not the probability), you can skip softmax completely — the largest logit is the same as the largest probability. Many edge inference designs do exactly this, saving the entire exp/divide hardware.

7. Fused Activation

Activations are almost always fused into the previous layer — the conv/GEMM engine writes its output straight through the ReLU before storing it. This saves a full memory round-trip.

// Fused conv output + requantize + ReLU (single pipeline stage) module conv_requant_relu #(parameter ACC_W=32, parameter OUT_W=8)( input wire clk, input wire signed [ACC_W-1:0] acc_in, // 32-bit accumulator from conv input wire [7:0] shift, // requant shift amount output reg signed [OUT_W-1:0] y // INT8 activated output ); wire signed [ACC_W-1:0] scaled = acc_in >>> shift; // requantize wire signed [ACC_W-1:0] relu = scaled[ACC_W-1] ? 0 : scaled; // ReLU // saturate to INT8 range [-128, 127] always @(posedge clk) y <= (relu > 127) ? 8'sd127 : (relu < -128) ? -8'sd128 : relu[OUT_W-1:0]; endmodule

Day 7 — Key Takeaways

Next — Day 8: Pooling Layers & Normalization — max/average pooling hardware, batch normalization folding, and fused BN+ReLU+Pool.

← Previous
Day 6: Memory Architecture
Next →
Day 8: Pooling & Normalization