A full adder computes one bit of sum and carry-out. Chain n of them and you get a ripple carry adder. Simple — but the carry has to ripple from bit 0 all the way to bit n−1. Each full adder must wait for the carry-out of the previous one before it can compute its own carry-out and sum.
| Adder Type | Delay | Gate Count | Used in |
|---|---|---|---|
| Ripple Carry | O(n) | O(n) | Simple designs, small n |
| Carry Lookahead | O(1) per 4-bit group | O(n) | ALUs, 8–32 bit |
| Kogge-Stone | O(log n) | O(n log n) | High-speed 64-bit ALUs |
| Brent-Kung | O(log n) | O(n) | Area-efficient 64-bit |
Before computing carries in parallel, define two signals per bit position that describe what that position will do with an incoming carry. These two signals, P and G, are computed purely from A and B — no carry needed.
// This bit produces a carry regardless of Cin
Propagate: Pᵢ = Aᵢ XOR Bᵢ
// This bit passes an incoming carry to Cout
Carry out: Cᵢ₊₁ = Gᵢ + (Pᵢ · Cᵢ)
Sum: Sᵢ = Pᵢ XOR Cᵢ
Generate (G = 1) — A=1, B=1
1 + 1 = 10 in binary. This bit generates a carry regardless of any carry-in. Even if Cin = 0, Cout = 1. G is the "certain carry" signal.
Propagate (P = 1) — A⊕B = 1
One of A or B is 1, the other is 0. This bit will propagate a carry-in to carry-out. If Cin = 1, Cout = 1. If Cin = 0, Cout = 0. P passes carry through.
Kill (G=0, P=0) — A=0, B=0
0 + 0 = 0. No carry generated, no carry propagated. This bit kills any incoming carry. Cout = 0 always regardless of Cin.
Sum is always P ⊕ C
Once we know the carry-in Cᵢ for each position, the sum bit is trivially Sᵢ = Pᵢ ⊕ Cᵢ. This is why computing carries fast is the only problem to solve.
| A | B | G = A·B | P = A⊕B | Behaviour |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | Kill — Cout always 0 |
| 0 | 1 | 0 | 1 | Propagate — Cout = Cin |
| 1 | 0 | 0 | 1 | Propagate — Cout = Cin |
| 1 | 1 | 1 | 0 | Generate — Cout always 1 |
Now substitute the carry recurrence Cᵢ₊₁ = Gᵢ + Pᵢ·Cᵢ into itself repeatedly. Each carry expression expands until it depends only on G, P signals and the original C₀ — never on another carry. All four carries are now computable simultaneously.
C₂ = G₁ + P₁·G₀ + P₁·P₀·C₀
C₃ = G₂ + P₂·G₁ + P₂·P₁·G₀ + P₂·P₁·P₀·C₀
C₄ = G₃ + P₃·G₂ + P₃·P₂·G₁ + P₃·P₂·P₁·G₀ + P₃·P₂·P₁·P₀·C₀
Each equation is a sum-of-products (AND-OR) — exactly 2 levels of logic. No matter how wide the adder, every carry is computed with one AND stage and one OR stage. This is the core of CLA speed.
A single 4-bit CLA block has constant carry delay. To build a 16-bit adder, chain four 4-bit CLA blocks. But simply chaining them makes the group carries ripple again. The solution: define Group Generate (GG) and Group Propagate (GP) for each block, then apply CLA logic at the block level too.
// Block produces a carry out regardless of block carry-in
Group Propagate: GPₖ = P₃·P₂·P₁·P₀
// Block passes a carry-in all the way through to carry-out
Level 1 — Bit-level PG
Compute Gᵢ = Aᵢ·Bᵢ and Pᵢ = Aᵢ⊕Bᵢ for every bit. One gate delay.
Level 2 — 4-bit group CLA
Each 4-bit block computes its internal carries AND its GG/GP outputs using AND-OR logic. Two gate delays. All four blocks work in parallel.
Level 3 — Block-level CLA
A second CLA network uses GG/GP from each block to compute the block carry-ins (C₄, C₈, C₁₂) simultaneously. Two more gate delays.
Level 4 — Final sums
Each block now has its carry-in and can compute its final sum bits. One XOR gate delay. Total: ~6 gate delays for a 16-bit adder regardless of word width.
Prefix adders generalize CLA using a parallel prefix computation. Define the prefix operator ∘ that combines two (G,P) pairs from adjacent bit groups:
// Associative, allows tree-structured parallel computation
Apply this operator in a binary tree structure. After log₂(n) tree levels, every bit position has a prefix result that gives the carry into that position. Two major implementations trade area vs speed:
Kogge-Stone Tree
Depth: O(log₂ n) — minimum possible
Gates: O(n log n) — many wires
Fan-out: 2 at each level
Used in high-performance 64-bit CPUs (Intel, AMD) where speed is paramount over area. Each node connects to the node 2ᵏ positions away at level k.
Brent-Kung Tree
Depth: 2log₂(n)−1 — twice as many levels
Gates: O(n) — minimum gates
Fan-out: 2
Used where area matters more than raw speed. First half builds a complete binary tree upward; second half distributes carries back downward.
assign S = A + B; in Verilog and the tool infers a Kogge-Stone or Brent-Kung structure to meet timing.
Two approaches: explicit gate-level CLA for understanding, and behavioral RTL that lets synthesis infer the optimal structure.
Gate-Level 4-bit CLA
module cla_4bit (
input [3:0] A, B,
input Cin,
output [3:0] S,
output Cout
);
wire [3:0] G, P; // generate, propagate per bit
wire C1, C2, C3, C4;
// PG logic — no carry needed
assign G = A & B;
assign P = A ^ B;
// CLA carry equations (2-level AND-OR)
assign C1 = G[0] | (P[0] & Cin);
assign C2 = G[1] | (P[1] & G[0]) | (P[1] & P[0] & Cin);
assign C3 = G[2] | (P[2] & G[1]) | (P[2] & P[1] & G[0])
| (P[2] & P[1] & P[0] & Cin);
assign C4 = G[3] | (P[3] & G[2]) | (P[3] & P[2] & G[1])
| (P[3] & P[2] & P[1] & G[0])
| (P[3] & P[2] & P[1] & P[0] & Cin);
// Sum: P XOR carry-in to that bit
assign S[0] = P[0] ^ Cin;
assign S[1] = P[1] ^ C1;
assign S[2] = P[2] ^ C2;
assign S[3] = P[3] ^ C3;
assign Cout = C4;
endmodule
32-bit Behavioral (Synthesis Infers CLA)
module adder_32bit (
input [31:0] A, B,
input Cin,
output [31:0] S,
output Cout
);
assign {Cout, S} = A + B + Cin;
// Synthesis tool infers fast carry structure based on timing constraints
// Set timing constraint tight → Kogge-Stone; relaxed → Brent-Kung
endmodule
Group Propagate/Generate — Building Block for Block CLA
module cla_group_pg (
input [3:0] G, P,
input Cin,
output GG, GP, Cout
);
// Group generate and propagate for cascaded CLA
assign GG = G[3] | (P[3]&G[2]) | (P[3]&P[2]&G[1]) | (P[3]&P[2]&P[1]&G[0]);
assign GP = P[3] & P[2] & P[1] & P[0];
assign Cout = GG | (GP & Cin);
endmodule
Enter two 4-bit numbers (0–15). Watch how the ripple carry propagates bit by bit vs CLA computing all carries simultaneously. The propagation delay difference is shown visually.
Related Topics