What is a Carry Lookahead Adder?

A Carry Lookahead Adder (CLA) eliminates the ripple carry bottleneck by computing all carry bits simultaneously using propagate (P) and generate (G) signals. Instead of waiting for each carry to ripple through each full adder, the CLA uses a 2-level AND-OR logic network to compute every carry in parallel, reducing adder delay from O(n) to O(log n).

What are propagate and generate signals in a CLA?

For bit position i: Generate Gi = Ai AND Bi — this bit position produces a carry regardless of any incoming carry. Propagate Pi = Ai XOR Bi — this bit position passes an incoming carry to the next position. The carry out is: Ci+1 = Gi OR (Pi AND Ci). These two signals allow all carries to be computed in parallel without bit-by-bit rippling.

Why is a CLA faster than a ripple carry adder?

A ripple carry adder has O(n) delay because each carry must wait for the previous carry to resolve. A 32-bit ripple carry adder needs 32 full-adder delays on the carry chain. A CLA computes all carries simultaneously with 2-level logic (one AND, one OR), giving O(1) delay for a 4-bit group. Cascaded block CLAs achieve O(log n) delay overall.

What is a prefix adder?

A prefix adder (Kogge-Stone, Brent-Kung, Han-Carlson) generalizes the CLA concept using a parallel prefix computation tree. The prefix operator combines (G,P) pairs from multiple bit positions. Kogge-Stone achieves O(log n) depth with O(n log n) gates — maximum speed. Brent-Kung achieves O(log n) depth with O(n) gates — better area.

Carry Lookahead Adder (CLA) – How It Works, Equations & Verilog

Section 01

The Ripple Carry Problem

A full adder computes one bit of sum and carry-out. Chain n of them and you get a ripple carry adder. Simple — but the carry has to ripple from bit 0 all the way to bit n−1. Each full adder must wait for the carry-out of the previous one before it can compute its own carry-out and sum.

4-bit ripple carry adder. C4 cannot be known until C3 resolves, which waits for C2, which waits for C1. For a 32-bit adder: 32 FA delays on the critical path.

Real-world impact: At 28nm, one full adder contributes ~50 ps of carry delay. A 64-bit ripple carry adder needs ~3.2 ns just for carry propagation — limiting the clock to ~300 MHz. Modern CPUs run at 3–5 GHz. A CLA-based adder reduces this to 3–4 gate delays regardless of width.

Adder Type	Delay	Gate Count	Used in
Ripple Carry	O(n)	O(n)	Simple designs, small n
Carry Lookahead	O(1) per 4-bit group	O(n)	ALUs, 8–32 bit
Kogge-Stone	O(log n)	O(n log n)	High-speed 64-bit ALUs
Brent-Kung	O(log n)	O(n)	Area-efficient 64-bit

Section 02

Propagate & Generate — The Key Insight

Before computing carries in parallel, define two signals per bit position that describe what that position will do with an incoming carry. These two signals, P and G, are computed purely from A and B — no carry needed.

Generate: Gᵢ = Aᵢ AND Bᵢ
// This bit produces a carry regardless of Cin

Propagate: Pᵢ = Aᵢ XOR Bᵢ
// This bit passes an incoming carry to Cout

Carry out: Cᵢ₊₁ = Gᵢ + (Pᵢ · Cᵢ)
Sum: Sᵢ = Pᵢ XOR Cᵢ

Generate (G = 1) — A=1, B=1

1 + 1 = 10 in binary. This bit generates a carry regardless of any carry-in. Even if Cin = 0, Cout = 1. G is the "certain carry" signal.

Propagate (P = 1) — A⊕B = 1

One of A or B is 1, the other is 0. This bit will propagate a carry-in to carry-out. If Cin = 1, Cout = 1. If Cin = 0, Cout = 0. P passes carry through.

Kill (G=0, P=0) — A=0, B=0

0 + 0 = 0. No carry generated, no carry propagated. This bit kills any incoming carry. Cout = 0 always regardless of Cin.

Sum is always P ⊕ C

Once we know the carry-in Cᵢ for each position, the sum bit is trivially Sᵢ = Pᵢ ⊕ Cᵢ. This is why computing carries fast is the only problem to solve.

A	B	G = A·B	P = A⊕B	Behaviour
0	0	0	0	Kill — Cout always 0
0	1	0	1	Propagate — Cout = Cin
1	0	0	1	Propagate — Cout = Cin
1	1	1	0	Generate — Cout always 1

Section 03

CLA Boolean Equations — All Carries in Parallel

Now substitute the carry recurrence Cᵢ₊₁ = Gᵢ + Pᵢ·Cᵢ into itself repeatedly. Each carry expression expands until it depends only on G, P signals and the original C₀ — never on another carry. All four carries are now computable simultaneously.

C₁ = G₀ + P₀·C₀
C₂ = G₁ + P₁·G₀ + P₁·P₀·C₀
C₃ = G₂ + P₂·G₁ + P₂·P₁·G₀ + P₂·P₁·P₀·C₀
C₄ = G₃ + P₃·G₂ + P₃·P₂·G₁ + P₃·P₂·P₁·G₀ + P₃·P₂·P₁·P₀·C₀

Each equation is a sum-of-products (AND-OR) — exactly 2 levels of logic. No matter how wide the adder, every carry is computed with one AND stage and one OR stage. This is the core of CLA speed.

Reading C₄: It says — "a carry out of bit 3 exists if: bit 3 generates one (G₃), OR bit 2 generated one that bit 3 propagated (P₃·G₂), OR bits 1–2 generated+propagated it up (P₃·P₂·G₁), and so on all the way to the original carry-in rippling through all 4 bits (P₃·P₂·P₁·P₀·C₀)."

4-bit CLA structure — 3 levels total: PG computation (1 level) + CLA AND-OR (2 levels) = fixed delay regardless of n.

Section 04

Block CLA — Scaling to 16-bit and 32-bit

A single 4-bit CLA block has constant carry delay. To build a 16-bit adder, chain four 4-bit CLA blocks. But simply chaining them makes the group carries ripple again. The solution: define Group Generate (GG) and Group Propagate (GP) for each block, then apply CLA logic at the block level too.

Group Generate: GGₖ = G₃ + P₃·G₂ + P₃·P₂·G₁ + P₃·P₂·P₁·G₀
// Block produces a carry out regardless of block carry-in

Group Propagate: GPₖ = P₃·P₂·P₁·P₀
// Block passes a carry-in all the way through to carry-out

Level 1 — Bit-level PG

Compute Gᵢ = Aᵢ·Bᵢ and Pᵢ = Aᵢ⊕Bᵢ for every bit. One gate delay.

Level 2 — 4-bit group CLA

Each 4-bit block computes its internal carries AND its GG/GP outputs using AND-OR logic. Two gate delays. All four blocks work in parallel.

Level 3 — Block-level CLA

A second CLA network uses GG/GP from each block to compute the block carry-ins (C₄, C₈, C₁₂) simultaneously. Two more gate delays.

Level 4 — Final sums

Each block now has its carry-in and can compute its final sum bits. One XOR gate delay. Total: ~6 gate delays for a 16-bit adder regardless of word width.

Total delay for two-level 16-bit CLA: 1 (PG) + 2 (group CLA) + 2 (block CLA) + 1 (sum XOR) = 6 gate delays, compared to 16 × ~2 = 32 gate delays for ripple carry. A 5× speedup.

Section 05

Prefix Adder Trees — Maximum Speed

Prefix adders generalize CLA using a parallel prefix computation. Define the prefix operator ∘ that combines two (G,P) pairs from adjacent bit groups:

(G_high, P_high) ∘ (G_low, P_low) = (G_high + P_high·G_low, P_high·P_low)
// Associative, allows tree-structured parallel computation

Apply this operator in a binary tree structure. After log₂(n) tree levels, every bit position has a prefix result that gives the carry into that position. Two major implementations trade area vs speed:

Kogge-Stone Tree

Depth: O(log₂ n) — minimum possible
Gates: O(n log n) — many wires
Fan-out: 2 at each level
Used in high-performance 64-bit CPUs (Intel, AMD) where speed is paramount over area. Each node connects to the node 2ᵏ positions away at level k.

Brent-Kung Tree

Depth: 2log₂(n)−1 — twice as many levels
Gates: O(n) — minimum gates
Fan-out: 2
Used where area matters more than raw speed. First half builds a complete binary tree upward; second half distributes carries back downward.

In practice: Modern EDA tools (Synopsys DC, Cadence Genus) automatically select the best adder tree during synthesis based on your timing constraints. You write assign S = A + B; in Verilog and the tool infers a Kogge-Stone or Brent-Kung structure to meet timing.

Section 06

Verilog Implementation

Two approaches: explicit gate-level CLA for understanding, and behavioral RTL that lets synthesis infer the optimal structure.

Gate-Level 4-bit CLA

module cla_4bit (
    input  [3:0] A, B,
    input        Cin,
    output [3:0] S,
    output       Cout
);
    wire [3:0] G, P;   // generate, propagate per bit
    wire       C1, C2, C3, C4;

    // PG logic — no carry needed
    assign G = A & B;
    assign P = A ^ B;

    // CLA carry equations (2-level AND-OR)
    assign C1 = G[0] | (P[0] & Cin);
    assign C2 = G[1] | (P[1] & G[0]) | (P[1] & P[0] & Cin);
    assign C3 = G[2] | (P[2] & G[1]) | (P[2] & P[1] & G[0])
                     | (P[2] & P[1] & P[0] & Cin);
    assign C4 = G[3] | (P[3] & G[2]) | (P[3] & P[2] & G[1])
                     | (P[3] & P[2] & P[1] & G[0])
                     | (P[3] & P[2] & P[1] & P[0] & Cin);

    // Sum: P XOR carry-in to that bit
    assign S[0] = P[0] ^ Cin;
    assign S[1] = P[1] ^ C1;
    assign S[2] = P[2] ^ C2;
    assign S[3] = P[3] ^ C3;
    assign Cout = C4;

endmodule

32-bit Behavioral (Synthesis Infers CLA)

module adder_32bit (
    input  [31:0] A, B,
    input         Cin,
    output [31:0] S,
    output        Cout
);
    assign {Cout, S} = A + B + Cin;
    // Synthesis tool infers fast carry structure based on timing constraints
    // Set timing constraint tight → Kogge-Stone; relaxed → Brent-Kung
endmodule

Group Propagate/Generate — Building Block for Block CLA

module cla_group_pg (
    input  [3:0] G, P,
    input        Cin,
    output       GG, GP, Cout
);
    // Group generate and propagate for cascaded CLA
    assign GG = G[3] | (P[3]&G[2]) | (P[3]&P[2]&G[1]) | (P[3]&P[2]&P[1]&G[0]);
    assign GP = P[3] & P[2] & P[1] & P[0];
    assign Cout = GG | (GP & Cin);
endmodule

Section 07 · Lab

Interactive Lab — Ripple vs CLA Carry Visualizer

Enter two 4-bit numbers (0–15). Watch how the ripple carry propagates bit by bit vs CLA computing all carries simultaneously. The propagation delay difference is shown visually.

A (0–15): B (0–15):

Ripple carry (sequential) CLA carry (parallel) Sum bits