ReLU, Leaky ReLU, Sigmoid, Tanh, and Softmax in synthesizable hardware — LUT approximation, CORDIC, piecewise-linear methods, and pipelined activation units in Verilog.
Activation functions are what make neural networks nonlinear — without them, stacking layers would collapse into a single linear transform. After every conv/FC layer, an activation reshapes the output. On FPGA, the design challenge is computing these functions cheaply, because they run on every single output element.
ReLU = max(0, x). It's the cheapest possible activation: one comparator, one mux, one cycle, zero multipliers. This is a huge part of why ReLU dominates CNN design.
// relu.v — ReLU and Leaky ReLU activation (INT8/INT16)
module relu #(parameter DW = 16, parameter LEAKY = 0)(
input wire signed [DW-1:0] x,
output wire signed [DW-1:0] y
);
// ReLU: y = (x > 0) ? x : 0
// Leaky: y = (x > 0) ? x : x >>> 4 (slope 1/16 for negatives)
generate
if (LEAKY == 0)
assign y = x[DW-1] ? {DW{1'b0}} : x; // sign bit set → negative → 0
else
assign y = x[DW-1] ? (x >>> 4) : x; // leaky negative slope
endgenerate
endmodule| ReLU Variant | Formula | Hardware Cost |
|---|---|---|
| ReLU | max(0, x) | 1 comparator + mux |
| Leaky ReLU | x>0 ? x : 0.01x | + 1 shifter |
| ReLU6 | min(6, max(0, x)) | 2 comparators + mux |
| PReLU | x>0 ? x : αx (learned α) | + 1 multiplier |
The simplest way to compute a smooth curve: precompute it and store the values in a BRAM lookup table. Input bits index the table; output is the stored value.
// sigmoid_lut.v — LUT-based sigmoid approximation
module sigmoid_lut #(parameter DW = 8)(
input wire clk,
input wire signed [DW-1:0] x, // input (Q4.4 fixed point)
output reg [DW-1:0] y // sigmoid output (Q0.8, 0..1)
);
reg [DW-1:0] lut [0:255];
// Precomputed at synthesis (generated by Python: sigmoid(i/16))
initial $readmemh("sigmoid_table.hex", lut);
// Saturate-and-index: clamp input, use as address
wire [7:0] addr = (x < -128) ? 8'd0 :
(x > 127) ? 8'd255 :
x + 8'd128; // shift signed → unsigned index
always @(posedge clk)
y <= lut[addr]; // 1-cycle BRAM lookup
endmoduleAn alternative to LUTs: approximate the curve with a few straight-line segments. Cheaper in BRAM, slightly more logic. Common for tanh/sigmoid when BRAM is scarce.
CORDIC (COordinate Rotation DIgital Computer) computes transcendental functions (sin, cos, tanh, exp) using only shifts and adds — no multipliers. It's iterative: more iterations = more accuracy.
| Method | Accuracy | BRAM | Latency | Best For |
|---|---|---|---|---|
| LUT | Good (table size) | 1+ BRAM | 1 cycle | Fast, BRAM available |
| Piecewise-linear | Moderate | None | 1–2 cycles | BRAM-scarce, mobile |
| CORDIC | High (iterations) | None | N iterations | Accuracy-critical |
Softmax converts the final layer's scores into probabilities. It needs exponentials and a division — the most expensive activation — but it runs only once at the output, so its cost is negligible vs the conv layers.
If you only need the predicted class (not the probability), you can skip softmax completely — the largest logit is the same as the largest probability. Many edge inference designs do exactly this, saving the entire exp/divide hardware.
Activations are almost always fused into the previous layer — the conv/GEMM engine writes its output straight through the ReLU before storing it. This saves a full memory round-trip.
// Fused conv output + requantize + ReLU (single pipeline stage)
module conv_requant_relu #(parameter ACC_W=32, parameter OUT_W=8)(
input wire clk,
input wire signed [ACC_W-1:0] acc_in, // 32-bit accumulator from conv
input wire [7:0] shift, // requant shift amount
output reg signed [OUT_W-1:0] y // INT8 activated output
);
wire signed [ACC_W-1:0] scaled = acc_in >>> shift; // requantize
wire signed [ACC_W-1:0] relu = scaled[ACC_W-1] ? 0 : scaled; // ReLU
// saturate to INT8 range [-128, 127]
always @(posedge clk)
y <= (relu > 127) ? 8'sd127 :
(relu < -128) ? -8'sd128 : relu[OUT_W-1:0];
endmoduleNext — Day 8: Pooling Layers & Normalization — max/average pooling hardware, batch normalization folding, and fused BN+ReLU+Pool.