Deep Dive · Digital Electronics

Carry Lookahead Adder —
No More Waiting for Carry

The ripple carry adder forces every bit to wait for the one before it. CLA breaks that chain — computing all carries simultaneously using propagate & generate logic. Here's how, from first principles to Verilog.

~20 min read
Basics → Prefix Trees
Interactive Lab
← Combinational Circuits LFSR →
Section 01
The Ripple Carry Problem

A full adder computes one bit of sum and carry-out. Chain n of them and you get a ripple carry adder. Simple — but the carry has to ripple from bit 0 all the way to bit n−1. Each full adder must wait for the carry-out of the previous one before it can compute its own carry-out and sum.

FA bit 0 FA bit 1 FA bit 2 FA bit 3 C1 C2 C3 C4 A0 B0 C0 ← Critical path: carry must ripple through every FA → O(n) delay →
4-bit ripple carry adder. C4 cannot be known until C3 resolves, which waits for C2, which waits for C1. For a 32-bit adder: 32 FA delays on the critical path.
Real-world impact: At 28nm, one full adder contributes ~50 ps of carry delay. A 64-bit ripple carry adder needs ~3.2 ns just for carry propagation — limiting the clock to ~300 MHz. Modern CPUs run at 3–5 GHz. A CLA-based adder reduces this to 3–4 gate delays regardless of width.
Adder TypeDelayGate CountUsed in
Ripple CarryO(n)O(n)Simple designs, small n
Carry LookaheadO(1) per 4-bit groupO(n)ALUs, 8–32 bit
Kogge-StoneO(log n)O(n log n)High-speed 64-bit ALUs
Brent-KungO(log n)O(n)Area-efficient 64-bit
Section 02
Propagate & Generate — The Key Insight

Before computing carries in parallel, define two signals per bit position that describe what that position will do with an incoming carry. These two signals, P and G, are computed purely from A and B — no carry needed.

Generate: Gᵢ = Aᵢ AND Bᵢ
// This bit produces a carry regardless of Cin

Propagate: Pᵢ = Aᵢ XOR Bᵢ
// This bit passes an incoming carry to Cout

Carry out: Cᵢ₊₁ = Gᵢ + (Pᵢ · Cᵢ)
Sum: Sᵢ = Pᵢ XOR Cᵢ

Generate (G = 1) — A=1, B=1

1 + 1 = 10 in binary. This bit generates a carry regardless of any carry-in. Even if Cin = 0, Cout = 1. G is the "certain carry" signal.

Propagate (P = 1) — A⊕B = 1

One of A or B is 1, the other is 0. This bit will propagate a carry-in to carry-out. If Cin = 1, Cout = 1. If Cin = 0, Cout = 0. P passes carry through.

Kill (G=0, P=0) — A=0, B=0

0 + 0 = 0. No carry generated, no carry propagated. This bit kills any incoming carry. Cout = 0 always regardless of Cin.

Sum is always P ⊕ C

Once we know the carry-in Cᵢ for each position, the sum bit is trivially Sᵢ = Pᵢ ⊕ Cᵢ. This is why computing carries fast is the only problem to solve.

ABG = A·BP = A⊕BBehaviour
0000Kill — Cout always 0
0101Propagate — Cout = Cin
1001Propagate — Cout = Cin
1110Generate — Cout always 1
Section 03
CLA Boolean Equations — All Carries in Parallel

Now substitute the carry recurrence Cᵢ₊₁ = Gᵢ + Pᵢ·Cᵢ into itself repeatedly. Each carry expression expands until it depends only on G, P signals and the original C₀ — never on another carry. All four carries are now computable simultaneously.

C₁ = G₀ + P₀·C₀
C₂ = G₁ + P₁·G₀ + P₁·P₀·C₀
C₃ = G₂ + P₂·G₁ + P₂·P₁·G₀ + P₂·P₁·P₀·C₀
C₄ = G₃ + P₃·G₂ + P₃·P₂·G₁ + P₃·P₂·P₁·G₀ + P₃·P₂·P₁·P₀·C₀

Each equation is a sum-of-products (AND-OR) — exactly 2 levels of logic. No matter how wide the adder, every carry is computed with one AND stage and one OR stage. This is the core of CLA speed.

Reading C₄: It says — "a carry out of bit 3 exists if: bit 3 generates one (G₃), OR bit 2 generated one that bit 3 propagated (P₃·G₂), OR bits 1–2 generated+propagated it up (P₃·P₂·G₁), and so on all the way to the original carry-in rippling through all 4 bits (P₃·P₂·P₁·P₀·C₀)."
PG LOGIC LAYER — computed from A, B only (parallel, no carry needed) G₀=A₀·B₀ P₀=A₀⊕B₀ | G₁=A₁·B₁ P₁=A₁⊕B₁ | G₂=A₂·B₂ P₂=A₂⊕B₂ | G₃=A₃·B₃ P₃=A₃⊕B₃ CLA LOGIC — AND-OR network, 2 gate levels, all carries computed simultaneously C₁ C₂ C₃ C₄ (all ready at the same time) SUM LAYER — Sᵢ = Pᵢ ⊕ Cᵢ (one XOR gate per bit, uses pre-computed Cᵢ) S₀ S₁ S₂ S₃ (final output) Total: 3 gate levels vs n·tFA ripple
4-bit CLA structure — 3 levels total: PG computation (1 level) + CLA AND-OR (2 levels) = fixed delay regardless of n.
Section 04
Block CLA — Scaling to 16-bit and 32-bit

A single 4-bit CLA block has constant carry delay. To build a 16-bit adder, chain four 4-bit CLA blocks. But simply chaining them makes the group carries ripple again. The solution: define Group Generate (GG) and Group Propagate (GP) for each block, then apply CLA logic at the block level too.

Group Generate: GGₖ = G₃ + P₃·G₂ + P₃·P₂·G₁ + P₃·P₂·P₁·G₀
// Block produces a carry out regardless of block carry-in

Group Propagate: GPₖ = P₃·P₂·P₁·P₀
// Block passes a carry-in all the way through to carry-out
1

Level 1 — Bit-level PG

Compute Gᵢ = Aᵢ·Bᵢ and Pᵢ = Aᵢ⊕Bᵢ for every bit. One gate delay.

2

Level 2 — 4-bit group CLA

Each 4-bit block computes its internal carries AND its GG/GP outputs using AND-OR logic. Two gate delays. All four blocks work in parallel.

3

Level 3 — Block-level CLA

A second CLA network uses GG/GP from each block to compute the block carry-ins (C₄, C₈, C₁₂) simultaneously. Two more gate delays.

4

Level 4 — Final sums

Each block now has its carry-in and can compute its final sum bits. One XOR gate delay. Total: ~6 gate delays for a 16-bit adder regardless of word width.

Total delay for two-level 16-bit CLA: 1 (PG) + 2 (group CLA) + 2 (block CLA) + 1 (sum XOR) = 6 gate delays, compared to 16 × ~2 = 32 gate delays for ripple carry. A 5× speedup.
Section 05
Prefix Adder Trees — Maximum Speed

Prefix adders generalize CLA using a parallel prefix computation. Define the prefix operator ∘ that combines two (G,P) pairs from adjacent bit groups:

(G_high, P_high) ∘ (G_low, P_low) = (G_high + P_high·G_low, P_high·P_low)
// Associative, allows tree-structured parallel computation

Apply this operator in a binary tree structure. After log₂(n) tree levels, every bit position has a prefix result that gives the carry into that position. Two major implementations trade area vs speed:

Kogge-Stone Tree

Depth: O(log₂ n) — minimum possible
Gates: O(n log n) — many wires
Fan-out: 2 at each level
Used in high-performance 64-bit CPUs (Intel, AMD) where speed is paramount over area. Each node connects to the node 2ᵏ positions away at level k.

Brent-Kung Tree

Depth: 2log₂(n)−1 — twice as many levels
Gates: O(n) — minimum gates
Fan-out: 2
Used where area matters more than raw speed. First half builds a complete binary tree upward; second half distributes carries back downward.

In practice: Modern EDA tools (Synopsys DC, Cadence Genus) automatically select the best adder tree during synthesis based on your timing constraints. You write assign S = A + B; in Verilog and the tool infers a Kogge-Stone or Brent-Kung structure to meet timing.
Section 06
Verilog Implementation

Two approaches: explicit gate-level CLA for understanding, and behavioral RTL that lets synthesis infer the optimal structure.

Gate-Level 4-bit CLA

module cla_4bit (
    input  [3:0] A, B,
    input        Cin,
    output [3:0] S,
    output       Cout
);
    wire [3:0] G, P;   // generate, propagate per bit
    wire       C1, C2, C3, C4;

    // PG logic — no carry needed
    assign G = A & B;
    assign P = A ^ B;

    // CLA carry equations (2-level AND-OR)
    assign C1 = G[0] | (P[0] & Cin);
    assign C2 = G[1] | (P[1] & G[0]) | (P[1] & P[0] & Cin);
    assign C3 = G[2] | (P[2] & G[1]) | (P[2] & P[1] & G[0])
                     | (P[2] & P[1] & P[0] & Cin);
    assign C4 = G[3] | (P[3] & G[2]) | (P[3] & P[2] & G[1])
                     | (P[3] & P[2] & P[1] & G[0])
                     | (P[3] & P[2] & P[1] & P[0] & Cin);

    // Sum: P XOR carry-in to that bit
    assign S[0] = P[0] ^ Cin;
    assign S[1] = P[1] ^ C1;
    assign S[2] = P[2] ^ C2;
    assign S[3] = P[3] ^ C3;
    assign Cout = C4;

endmodule

32-bit Behavioral (Synthesis Infers CLA)

module adder_32bit (
    input  [31:0] A, B,
    input         Cin,
    output [31:0] S,
    output        Cout
);
    assign {Cout, S} = A + B + Cin;
    // Synthesis tool infers fast carry structure based on timing constraints
    // Set timing constraint tight → Kogge-Stone; relaxed → Brent-Kung
endmodule

Group Propagate/Generate — Building Block for Block CLA

module cla_group_pg (
    input  [3:0] G, P,
    input        Cin,
    output       GG, GP, Cout
);
    // Group generate and propagate for cascaded CLA
    assign GG = G[3] | (P[3]&G[2]) | (P[3]&P[2]&G[1]) | (P[3]&P[2]&P[1]&G[0]);
    assign GP = P[3] & P[2] & P[1] & P[0];
    assign Cout = GG | (GP & Cin);
endmodule
Section 07 · Lab
Interactive Lab — Ripple vs CLA Carry Visualizer

Enter two 4-bit numbers (0–15). Watch how the ripple carry propagates bit by bit vs CLA computing all carries simultaneously. The propagation delay difference is shown visually.

Ripple carry (sequential) CLA carry (parallel) Sum bits

Related Topics

Go Deeper

Combinational Circuits
Half adder, full adder, MUX, decoder, encoder — the full combinational circuit toolkit including where ripple carry adders are used.
LFSR — Linear Feedback Shift Register
Another classic digital building block — pseudorandom sequences, CRC generation, BIST. XOR-based feedback makes maximal-length sequences.
Critical Path & STA
CLA exists because of critical path constraints. See how STA measures adder delay, identifies the critical carry chain, and how timing closure drives adder selection.