Every FPGA contains dedicated DSP blocks — hard arithmetic tiles that implement a multiplier and accumulator in a single slice running at 500+ MHz. Building signal processing on FPGAs without using DSP blocks is like writing software without a CPU — you can do it, but it is vastly inefficient. This lesson builds a pipelined Multiply-Accumulate (MAC) unit and shows how to make the synthesis tool map it to DSP48 slices automatically.
The DSP48E1 (Xilinx 7-series) is a hard arithmetic block containing:
The key pattern: register the inputs (A_reg, B_reg), multiply them (P = A_reg * B_reg registered), then accumulate (acc = acc + P). Use signed arithmetic and keep widths within 18×18 bits to map cleanly to DSP48. Add (* USE_DSP48="yes" *) if needed.
| Port | Dir | Width | Description |
|---|---|---|---|
| clk | IN | 1 | Clock |
| rst | IN | 1 | Synchronous reset — clears accumulator |
| clr | IN | 1 | Synchronous accumulator clear (use to start a new dot product) |
| valid_in | IN | 1 | Asserted when A and B are valid inputs to accumulate |
| a | IN | 16 | Signed 16-bit multiplicand |
| b | IN | 16 | Signed 16-bit multiplier coefficient |
| acc | OUT | 32 | Accumulated sum of A×B products. Valid 2 cycles after inputs (pipeline). |
| valid_out | OUT | 1 | Pipelined valid flag corresponding to acc output |
// mac.v — Pipelined 16-bit signed Multiply-Accumulate
// Maps to DSP48E1 on Xilinx 7-series.
// Pipeline: Stage 1 = register inputs + multiply
// Stage 2 = accumulate into 32-bit sum
// Total latency: 2 clock cycles from valid_in to valid acc output
module mac (
input wire clk,
input wire rst,
input wire clr, // clear accumulator (start new sum)
input wire valid_in,
input wire signed [15:0] a,
input wire signed [15:0] b,
output reg signed [31:0] acc,
output reg valid_out
);
// ---- Stage 1: Input registers + multiply ----
// Registering inputs before multiply helps DSP48 inference
(* USE_DSP48 = "yes" *)
reg signed [15:0] a_reg, b_reg;
reg signed [31:0] product;
reg s1_valid;
reg s1_clr;
always @(posedge clk) begin
if (rst) begin
a_reg <= 0;
b_reg <= 0;
product <= 0;
s1_valid <= 0;
s1_clr <= 0;
end else begin
a_reg <= a;
b_reg <= b;
product <= a_reg * b_reg; // 16x16 → 32-bit (DSP48 multiply)
s1_valid <= valid_in;
s1_clr <= clr;
end
end
// ---- Stage 2: Accumulate ----
always @(posedge clk) begin
if (rst) begin
acc <= 0;
valid_out <= 0;
end else begin
valid_out <= s1_valid;
if (s1_clr)
acc <= product; // load first value (clear old sum)
else if (s1_valid)
acc <= acc + product; // accumulate (DSP48 P register)
end
end
endmodule
After synthesis in Vivado, check the Synthesis Report → Resource Utilization. You should see DSPs: 1 for this mac module. If you see DSPs: 0 and high LUT usage instead, the infer failed — check that both inputs are signed and widths are ≤ 18 bits.
You can also force DSP use in the module header: add (* USE_DSP48 = "yes" *) before the module keyword or on the multiply statement.
// tb_mac.v — self-checking testbench for mac
// Computes a dot product and verifies against software reference
`timescale 1ns/1ps
module tb_mac;
reg clk = 0;
reg rst = 1;
reg clr = 0;
reg valid_in = 0;
reg signed [15:0] a = 0, b = 0;
wire signed [31:0] acc;
wire valid_out;
mac dut(.clk(clk),.rst(rst),.clr(clr),.valid_in(valid_in),
.a(a),.b(b),.acc(acc),.valid_out(valid_out));
always #5 clk = ~clk;
integer pass_cnt = 0, fail_cnt = 0;
integer i;
// Test vectors: 4-tap dot product
// [1,2,3,4] · [10,20,30,40] = 10+40+90+160 = 300
reg signed [15:0] a_vec [0:3];
reg signed [15:0] b_vec [0:3];
integer expected_acc;
initial begin
a_vec[0] = 1; b_vec[0] = 10;
a_vec[1] = 2; b_vec[1] = 20;
a_vec[2] = 3; b_vec[2] = 30;
a_vec[3] = 4; b_vec[3] = 40;
expected_acc = 300;
end
initial begin
$dumpfile("tb_mac.vcd");
$dumpvars(0, tb_mac);
repeat(4) @(posedge clk);
rst = 0;
// --- Test 1: 4-tap dot product ---
// First input uses clr to load instead of accumulate
@(posedge clk);
clr = 1; valid_in = 1;
a <= a_vec[0]; b <= b_vec[0];
@(posedge clk);
clr = 0; valid_in = 1;
a <= a_vec[1]; b <= b_vec[1];
@(posedge clk);
valid_in = 1;
a <= a_vec[2]; b <= b_vec[2];
@(posedge clk);
valid_in = 1;
a <= a_vec[3]; b <= b_vec[3];
@(posedge clk);
valid_in = 0;
// Wait for pipeline to flush (2 cycles)
repeat(4) @(posedge clk);
if (acc === expected_acc) begin
$display("PASS: dot product = %0d (expected %0d)", acc, expected_acc);
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL: dot product = %0d (expected %0d)", acc, expected_acc);
fail_cnt = fail_cnt + 1;
end
// --- Test 2: negative values ---
// [-3] · [7] = -21, then clear and accumulate once
@(posedge clk);
clr = 1; valid_in = 1;
a <= -3; b <= 7;
@(posedge clk);
clr = 0; valid_in = 0;
repeat(4) @(posedge clk);
if (acc === -21) begin
$display("PASS: -3*7 = %0d (expected -21)", acc);
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL: -3*7 = %0d (expected -21)", acc);
fail_cnt = fail_cnt + 1;
end
// --- Test 3: accumulate 5 × (3×4) = 60 ---
@(posedge clk);
clr = 1; valid_in = 1;
a <= 3; b <= 4;
@(posedge clk);
clr = 0; valid_in = 1;
a <= 3; b <= 4;
@(posedge clk); a <= 3; b <= 4;
@(posedge clk); a <= 3; b <= 4;
@(posedge clk); a <= 3; b <= 4;
@(posedge clk); valid_in = 0;
repeat(4) @(posedge clk);
if (acc === 60) begin
$display("PASS: 5x(3*4) = %0d (expected 60)", acc);
pass_cnt = pass_cnt + 1;
end else begin
$display("FAIL: 5x(3*4) = %0d (expected 60)", acc);
fail_cnt = fail_cnt + 1;
end
if (fail_cnt == 0)
$display("\nALL TESTS PASSED (%0d/%0d)", pass_cnt, pass_cnt+fail_cnt);
else
$display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);
$finish;
end
initial #5000 begin $display("TIMEOUT"); $finish; end
endmodule
PASS: dot product = 300 (expected 300) PASS: -3*7 = -21 (expected -21) PASS: 5x(3*4) = 60 (expected 60) ALL TESTS PASSED (3/3)
A Direct-Form FIR filter with N taps computes: y[n] = Σ h[k] × x[n-k] for k=0..N-1. This is exactly N MAC operations. On an FPGA you have three implementation strategies:
(* USE_DSP48="yes" *) if neededDSP48E1 (Xilinx 7-series) is a hard arithmetic block combining a pre-adder, 18×27 multiplier, and 48-bit accumulator (P register) in a single tile running at 500+ MHz. One DSP48 implements a complete MAC operation consuming zero LUTs.
Write a pipelined multiply followed by accumulate using registered intermediate values. Keep operand widths ≤18 bits (signed). Vivado infers DSP48 when it sees registered A and B feeding a multiplier whose output feeds a registered adder. Add (* USE_DSP48="yes" *) to force it.
Multiply-Accumulate computes Sum = Sum + A×B. It is the core operation in FIR filters, matrix multiplication, correlation, and convolution neural networks. A single DSP48 performs one MAC per clock cycle at 500+ MHz — far faster and more area-efficient than equivalent LUT logic.