What is a DSP48 slice on Xilinx FPGAs?

DSP48 (DSP48E1 on 7-series, DSP48E2 on UltraScale) is a hard arithmetic block combining a pre-adder, 18×27 multiplier, and 48-bit accumulator (P register). It runs at 500+ MHz and consumes zero LUTs. A single DSP48 can implement a full multiply-accumulate (MAC) operation — the fundamental building block of FIR filters, FFTs, and neural networks.

How do you make Vivado infer DSP48 blocks?

Write a pipelined multiply followed by an accumulate, using registered intermediate values. Vivado infers DSP48 when it sees: registered A and B inputs, a multiplier whose output feeds a registered adder. Add (* USE_DSP48 = "yes" *) attribute if the tool does not infer automatically. Check synthesis reports under 'DSP' to confirm.

DAY 19 · DSP

DSP: Multipliers, MACs & Filters on FPGA

Q: What is a MAC unit and why is it important?

A Multiply-Accumulate unit computes Sum = Sum + A×B repeatedly. MACs are the core operation in FIR filters (dot product of samples and coefficients), matrix multiplication, correlation, and convolution. A single DSP48 implements one MAC per clock cycle at full speed — a 200-tap FIR filter needs 200 MACs or can time-multiplex fewer DSP blocks.

By EcrioniX · Updated Jun 11, 2026

Every FPGA contains dedicated DSP blocks — hard arithmetic tiles that implement a multiplier and accumulator in a single slice running at 500+ MHz. Building signal processing on FPGAs without using DSP blocks is like writing software without a CPU — you can do it, but it is vastly inefficient. This lesson builds a pipelined Multiply-Accumulate (MAC) unit and shows how to make the synthesis tool map it to DSP48 slices automatically.

1. What is a DSP48 slice?

The DSP48E1 (Xilinx 7-series) is a hard arithmetic block containing:

A pre-adder: computes A ± D before the multiplier (useful for symmetrical FIR filters)
An 18×27 signed multiplier: produces a 45-bit product in a single clock cycle
A 48-bit accumulator (P register): adds the multiplier output to the current sum
A cascade path: chains multiple DSP48 blocks without routing through the fabric

DSP48 inference rule

The key pattern: register the inputs (A_reg, B_reg), multiply them (P = A_reg * B_reg registered), then accumulate (acc = acc + P). Use signed arithmetic and keep widths within 18×18 bits to map cleanly to DSP48. Add (* USE_DSP48="yes" *) if needed.

2. Port table — mac

Port	Dir	Width	Description
clk	IN	1	Clock
rst	IN	1	Synchronous reset — clears accumulator
clr	IN	1	Synchronous accumulator clear (use to start a new dot product)
valid_in	IN	1	Asserted when A and B are valid inputs to accumulate
a	IN	16	Signed 16-bit multiplicand
b	IN	16	Signed 16-bit multiplier coefficient
acc	OUT	32	Accumulated sum of A×B products. Valid 2 cycles after inputs (pipeline).
valid_out	OUT	1	Pipelined valid flag corresponding to acc output

3. mac.v — pipelined multiply-accumulate

mac.v

// mac.v — Pipelined 16-bit signed Multiply-Accumulate
// Maps to DSP48E1 on Xilinx 7-series.
// Pipeline: Stage 1 = register inputs + multiply
//           Stage 2 = accumulate into 32-bit sum
// Total latency: 2 clock cycles from valid_in to valid acc output

module mac (
    input  wire        clk,
    input  wire        rst,
    input  wire        clr,        // clear accumulator (start new sum)
    input  wire        valid_in,
    input  wire signed [15:0] a,
    input  wire signed [15:0] b,
    output reg  signed [31:0] acc,
    output reg                valid_out
);

// ---- Stage 1: Input registers + multiply ----
// Registering inputs before multiply helps DSP48 inference
(* USE_DSP48 = "yes" *)
reg signed [15:0] a_reg, b_reg;
reg signed [31:0] product;
reg               s1_valid;
reg               s1_clr;

always @(posedge clk) begin
    if (rst) begin
        a_reg   <= 0;
        b_reg   <= 0;
        product <= 0;
        s1_valid <= 0;
        s1_clr   <= 0;
    end else begin
        a_reg    <= a;
        b_reg    <= b;
        product  <= a_reg * b_reg;   // 16x16 → 32-bit (DSP48 multiply)
        s1_valid <= valid_in;
        s1_clr   <= clr;
    end
end

// ---- Stage 2: Accumulate ----
always @(posedge clk) begin
    if (rst) begin
        acc       <= 0;
        valid_out <= 0;
    end else begin
        valid_out <= s1_valid;
        if (s1_clr)
            acc <= product;                 // load first value (clear old sum)
        else if (s1_valid)
            acc <= acc + product;           // accumulate (DSP48 P register)
    end
end

endmodule

4. DSP48 mapping verification

After synthesis in Vivado, check the Synthesis Report → Resource Utilization. You should see DSPs: 1 for this mac module. If you see DSPs: 0 and high LUT usage instead, the infer failed — check that both inputs are signed and widths are ≤ 18 bits.

You can also force DSP use in the module header: add (* USE_DSP48 = "yes" *) before the module keyword or on the multiply statement.

5. Testbench — tb_mac.v

tb_mac.v

// tb_mac.v — self-checking testbench for mac
// Computes a dot product and verifies against software reference
`timescale 1ns/1ps

module tb_mac;

reg        clk = 0;
reg        rst = 1;
reg        clr = 0;
reg        valid_in = 0;
reg  signed [15:0] a = 0, b = 0;
wire signed [31:0] acc;
wire               valid_out;

mac dut(.clk(clk),.rst(rst),.clr(clr),.valid_in(valid_in),
        .a(a),.b(b),.acc(acc),.valid_out(valid_out));

always #5 clk = ~clk;

integer pass_cnt = 0, fail_cnt = 0;
integer i;

// Test vectors: 4-tap dot product
// [1,2,3,4] · [10,20,30,40] = 10+40+90+160 = 300
reg signed [15:0] a_vec [0:3];
reg signed [15:0] b_vec [0:3];
integer expected_acc;

initial begin
    a_vec[0] = 1;  b_vec[0] = 10;
    a_vec[1] = 2;  b_vec[1] = 20;
    a_vec[2] = 3;  b_vec[2] = 30;
    a_vec[3] = 4;  b_vec[3] = 40;
    expected_acc = 300;
end

initial begin
    $dumpfile("tb_mac.vcd");
    $dumpvars(0, tb_mac);

    repeat(4) @(posedge clk);
    rst = 0;

    // --- Test 1: 4-tap dot product ---
    // First input uses clr to load instead of accumulate
    @(posedge clk);
    clr = 1; valid_in = 1;
    a <= a_vec[0]; b <= b_vec[0];
    @(posedge clk);
    clr = 0; valid_in = 1;
    a <= a_vec[1]; b <= b_vec[1];
    @(posedge clk);
    valid_in = 1;
    a <= a_vec[2]; b <= b_vec[2];
    @(posedge clk);
    valid_in = 1;
    a <= a_vec[3]; b <= b_vec[3];
    @(posedge clk);
    valid_in = 0;

    // Wait for pipeline to flush (2 cycles)
    repeat(4) @(posedge clk);

    if (acc === expected_acc) begin
        $display("PASS: dot product = %0d (expected %0d)", acc, expected_acc);
        pass_cnt = pass_cnt + 1;
    end else begin
        $display("FAIL: dot product = %0d (expected %0d)", acc, expected_acc);
        fail_cnt = fail_cnt + 1;
    end

    // --- Test 2: negative values ---
    // [-3] · [7] = -21, then clear and accumulate once
    @(posedge clk);
    clr = 1; valid_in = 1;
    a <= -3; b <= 7;
    @(posedge clk);
    clr = 0; valid_in = 0;
    repeat(4) @(posedge clk);

    if (acc === -21) begin
        $display("PASS: -3*7 = %0d (expected -21)", acc);
        pass_cnt = pass_cnt + 1;
    end else begin
        $display("FAIL: -3*7 = %0d (expected -21)", acc);
        fail_cnt = fail_cnt + 1;
    end

    // --- Test 3: accumulate 5 × (3×4) = 60 ---
    @(posedge clk);
    clr = 1; valid_in = 1;
    a <= 3; b <= 4;
    @(posedge clk);
    clr = 0; valid_in = 1;
    a <= 3; b <= 4;
    @(posedge clk); a <= 3; b <= 4;
    @(posedge clk); a <= 3; b <= 4;
    @(posedge clk); a <= 3; b <= 4;
    @(posedge clk); valid_in = 0;
    repeat(4) @(posedge clk);

    if (acc === 60) begin
        $display("PASS: 5x(3*4) = %0d (expected 60)", acc);
        pass_cnt = pass_cnt + 1;
    end else begin
        $display("FAIL: 5x(3*4) = %0d (expected 60)", acc);
        fail_cnt = fail_cnt + 1;
    end

    if (fail_cnt == 0)
        $display("\nALL TESTS PASSED (%0d/%0d)", pass_cnt, pass_cnt+fail_cnt);
    else
        $display("\nFAILED: %0d passed, %0d failed", pass_cnt, fail_cnt);

    $finish;
end

initial #5000 begin $display("TIMEOUT"); $finish; end

endmodule

6. Expected output

PASS: dot product = 300 (expected 300)
PASS: -3*7 = -21 (expected -21)
PASS: 5x(3*4) = 60 (expected 60)

ALL TESTS PASSED (3/3)

7. Building a simple FIR filter

A Direct-Form FIR filter with N taps computes: y[n] = Σ h[k] × x[n-k] for k=0..N-1. This is exactly N MAC operations. On an FPGA you have three implementation strategies:

Parallel (N DSP48 blocks): one MAC per tap, all computing simultaneously. Highest throughput (1 output per clock), uses N DSP48 blocks.
Serial (1 DSP48 block): time-multiplex one MAC over N cycles. Lowest area, throughput = Fclk / N samples/s.
Systolic array: chain DSP48 blocks using the cascade path. Very high throughput with minimal routing.

Key Takeaways

DSP48E1 contains a pre-adder, 18×27 multiplier, and 48-bit accumulator in one hard tile
Infer DSP48 by registering inputs before multiply and accumulating the registered product
Use signed inputs of ≤18 bits for automatic inference; add (* USE_DSP48="yes" *) if needed
The 2-cycle pipeline latency (input register → multiply → accumulate) must be accounted for in control logic
FIR filters are the most common DSP application — N-tap filter uses N MACs

Frequently Asked Questions

What is a DSP48 slice?

DSP48E1 (Xilinx 7-series) is a hard arithmetic block combining a pre-adder, 18×27 multiplier, and 48-bit accumulator (P register) in a single tile running at 500+ MHz. One DSP48 implements a complete MAC operation consuming zero LUTs.

How do you make Vivado infer DSP48?

Write a pipelined multiply followed by accumulate using registered intermediate values. Keep operand widths ≤18 bits (signed). Vivado infers DSP48 when it sees registered A and B feeding a multiplier whose output feeds a registered adder. Add (* USE_DSP48="yes" *) to force it.

What is a MAC and why is it important?

Multiply-Accumulate computes Sum = Sum + A×B. It is the core operation in FIR filters, matrix multiplication, correlation, and convolution neural networks. A single DSP48 performs one MAC per clock cycle at 500+ MHz — far faster and more area-efficient than equivalent LUT logic.