DFT Day 4 — ATPG: D-Algorithm, PODEM, FAN & Fault Simulation

What is ATPG?

In Days 1–3 we established what faults look like (stuck-at, transition, bridging), how scan chains make every flip-flop controllable and observable, and how EDT compression reduces the number of test cycles needed. Now comes the central question: how do we compute the actual test patterns?

Automatic Test Pattern Generation (ATPG) is the process of algorithmically computing a set of binary input patterns such that, when applied to the chip, each pattern detects one or more faults from the fault list. The ATPG tool's goal is to maximize fault coverage (the fraction of modeled faults detected) while minimizing pattern count (because more patterns = more test time = more cost).

ATPG works at the gate level on the synthesized netlist. For each undetected fault in the fault list, the algorithm searches for a primary input assignment (and, via scan chains, a flip-flop state assignment) that:

Activates the fault — drives the faulty net to the opposite of its stuck value
Propagates the resulting error — creates a path from the fault site to an observable output (primary output or scan chain output)
Justifies the assignment — ensures all required gate input values are consistently achievable from primary inputs

ATPG vs Manual Testing

A chip with 100,000 gates has millions of possible fault sites. Computing test patterns manually is impossible. ATPG tools (Tessent, TetraMAX, FastScan) run on the gate-level netlist and generate 99%+ coverage in minutes to hours, depending on design complexity. The resulting patterns are loaded onto an ATE (Automatic Test Equipment) which applies them at-speed to every manufactured die.

D-Notation — The Language of ATPG

Roth's 1966 paper introduced the D-calculus, a five-valued logic system that allows ATPG algorithms to reason simultaneously about fault-free and faulty circuit behavior. The five values are:

Symbol	Fault-Free Value	Faulty Circuit Value	Meaning
0	0	0	Normal logic 0 (same in both circuits)
1	1	1	Normal logic 1 (same in both circuits)
X	unassigned	unassigned	Don't-care / not yet assigned
D	1	0	Error: fault-free=1, faulty=0 — detectable difference
D̄	0	1	Error: fault-free=0, faulty=1 — detectable difference

D and D-bar (D with a bar) represent the error signal — a point in the circuit where the fault-free and faulty circuits produce different values. The ATPG goal is to propagate D or D-bar to a primary output. Once D reaches a primary output, the ATE can observe the difference: the good chip outputs 1 while the bad chip outputs 0 (for D), or vice versa (for D-bar).

D-Calculus Through an AND Gate

Consider a 2-input AND gate. If input A has a stuck-at-0 (SA0) fault, and we want to detect it:

Input A (fault site)	Input B	AND output (fault-free)	D-notation output
D (1→0 due to SA0)	1	1	D — fault detected!
D (1→0 due to SA0)	0	0	0 — error masked by B=0
0	X	0	0 — fault not activated

The key rule for propagating D through an AND gate: all non-D inputs must be 1 (the controlling-value complement for AND). Setting B=1 allows the D on input A to pass through to the output as D. This is called sensitizing the gate to the fault.

The D-Algorithm

The D-Algorithm (Roth, 1966) was the first systematic ATPG procedure. It operates on the gate-level circuit graph and proceeds in three phases:

Fault Activation (D-drive): Assign D (or D-bar) to the faulty net by setting the primary input (or scan FF state) that drives the faulty net to the opposite of the stuck value. For a SA0 fault on net N: set the driver of N to 1 — in the fault-free circuit N=1, in the faulty circuit N=0 (stuck). This creates a D on net N.

D-Propagation (Forward implication): Find a path from the D/D-bar value to a primary output or scan chain output. At each gate on the path, assign the necessary non-D inputs to sensitize the gate (1 for AND/NAND, 0 for OR/NOR). Continue propagating D through the gate until a primary output is reached. If all paths are blocked (outputs already assigned, D cannot propagate), backtrack.

Justification (Backward implication): Work backward from the assigned gate inputs to primary inputs, ensuring all required values are achievable from primary inputs. If an AND gate needs its B input to be 1, trace back through the logic to find primary inputs that drive B=1. If a conflict is found (same net needs to be 0 and 1), backtrack and try another propagation path.

Worked Example — 3-Gate Circuit

Consider this simple circuit: A AND B → G1 → G1 OR C → G2 → output Z. Suppose we want to detect a SA0 fault on the output of G1 (the AND gate).

D-Algorithm Example — SA0 on AND gate output, propagate through OR gate to output Z

Step-by-step trace: (1) Activate: set A=1, B=1 so G1's fault-free output = 1, but SA0 forces it to 0. D is placed on G1's output. (2) Propagate: G1.out → G2.input. For the OR gate, we need C=0 (the non-controlling value for OR) so D passes through. G2 output = D. (3) Justify: A=1 (PI), B=1 (PI), C=0 (PI) — all directly set. No conflict. Test vector: {A=1, B=1, C=0}. Good chip output Z=1; faulty chip Z=0. Fault detected.

PODEM — Path-Oriented Decision Making

The D-algorithm was groundbreaking but had a critical scalability problem: it backtracks at internal circuit nodes. When D-propagation fails at a gate deep in the circuit, the algorithm must try all possible gate input assignments — an exponential search in the worst case. For industrial circuits with hundreds of thousands of gates, this backtracking explosion made D-algorithm impractical.

PODEM (Prabhu Goel, 1981) solved this with a key insight: all backtracking decisions should be made at primary inputs, not internal nodes. This transforms the search space from exponential in circuit depth to linear in number of primary inputs.

PODEM Algorithm Structure

Choose a Primary Input (PI) to assign: Select an unassigned PI that, according to an "objective function," will help either activate the fault or propagate D toward a primary output. The objective function heuristically picks the most promising PI.

Assign a binary value (0 or 1) to that PI: Try 0 first, then 1 if needed. Forward-imply: propagate the PI value through the circuit logic as far as possible using Boolean implication.

Check D-propagation: After implication, check if D or D-bar has reached a primary output. If yes — test found! If the fault is activated and D can still be propagated, return to step 1 for more assignments.

Backtrack if necessary: If a contradiction is found (impossible assignment) or D-propagation is provably blocked, backtrack to the most recent PI assignment and try the other value. If both values have been tried, backtrack further.

Why PODEM Is Faster Than D-Algorithm

PODEM's backtrack function operates only on primary inputs. This means the search tree has depth equal to the number of primary inputs (say 100), not circuit depth (potentially 1000+ gates). The number of nodes in the search tree is at most 2^100 in theory, but PODEM's objective function is effective enough that in practice it finds patterns in very few backtrack steps. Industrial benchmarks show 10–100x speedup over D-algorithm on real circuits.

FAN — Fanout-Oriented ATPG

FAN (Fanout-oriented ATPG, Fujiwara & Shimono, 1983) is the next major improvement after PODEM, adding three key ideas that make it significantly faster for circuits with high fanout:

1. Multiple-Backtrace Procedure

Instead of making one PI assignment at a time (PODEM), FAN propagates requirements through the circuit simultaneously from multiple gates, resolving which PI to set without committing prematurely. This avoids unnecessary backtracks caused by choosing a PI that doesn't help the current goal.

2. Head Lines

FAN identifies head lines — the first line on each unique sensitization path from the fault site to primary outputs. Head lines act as decision points. FAN assigns values at head lines first, then works backward to primary inputs. This focuses the search on the paths most relevant to fault propagation.

3. Unique Sensitization

If there is only one possible path from D to a primary output (unique sensitization), FAN commits to that path immediately without exploration — eliminating backtracking for that branch entirely. This is particularly powerful in deep pipeline designs where most paths are unique.

Algorithm	Year	Backtrack Point	Key Innovation	Relative Speed
D-Algorithm	1966	Internal nodes	D-calculus, first systematic ATPG	1x (baseline)
PODEM	1981	Primary inputs only	PI-level backtracking, objective function	10–100x faster
FAN	1983	Head lines + PIs	Multiple backtrace, unique sensitization	100–1000x faster

Modern commercial ATPG tools (Tessent, TetraMAX) use variants of FAN with additional optimizations: dominance-based pruning, redundancy identification during ATPG, and concurrent fault simulation to detect multiple faults per pattern.

Fault Simulation

Fault simulation is the process of simulating the circuit with each fault injected and determining whether a given test pattern detects that fault. It is distinct from ATPG: ATPG generates patterns targeting specific faults; fault simulation evaluates a pattern set against the entire fault list.

Parallel Fault Simulation

Simulating one fault at a time would be impractical — a circuit with 100,000 nets has 200,000 stuck-at fault sites. Parallel fault simulation exploits the bit-parallel nature of computer arithmetic: simulate N faults simultaneously by using N-bit words. Each bit position represents one fault. For each gate operation, the simulator computes the N-bit word representing the output for all N faults simultaneously using one 64-bit (or 256-bit SIMD) operation.

Fault Coverage Formula: FC = (Detected + ATPG-Redundant) / Total Faults × 100% Where: Detected = faults detected by at least one pattern ATPG-Redundant = structurally untestable faults (excluded from denominator in some tools) Total Faults = all faults in the initial fault list Industry standard: Stuck-at FC ≥ 99.0% Transition FC ≥ 95.0%

Fault Dropping

Fault dropping is the key optimization that makes fault simulation feasible. Once a fault is detected by any pattern, it is removed from the active fault list. Subsequent patterns only simulate against the remaining undetected faults. Since early patterns tend to detect many faults, the active fault list shrinks rapidly. A typical simulation run detects 80% of faults with the first 100 patterns and spends most of its time on the last 1% of hard-to-detect faults.

Fault Dropping vs Fault Collapsing

These are different concepts. Fault dropping removes a fault from simulation once detected — it's a runtime optimization. Fault collapsing reduces the initial fault list by identifying faults that are equivalent (have identical detection conditions) or dominated (detecting one guarantees detecting the other). Fault collapsing is done once before ATPG; fault dropping happens dynamically during simulation.

Untestable Faults

Not all faults in the fault list can be detected by any test pattern. Untestable faults (also called ATPG-redundant or structurally redundant faults) are those for which no test pattern exists because the circuit's logic structure makes it impossible to simultaneously:

Activate the fault (drive the faulty net to the opposite value), AND
Propagate the resulting error to any primary output

This often happens in circuits with intentional redundancy — logic added to prevent hazards or glitches. For example, a glitch-suppression AND gate that duplicates another logic path creates a situation where the fault on the duplicate path can never be observed independently. Intentional redundancy must be documented in the DFT sign-off report.

Fault Classification	Definition	In Coverage Denominator?	Action
Detected	At least one pattern detects it	Yes (numerator)	Count toward coverage
Undetected	No pattern found yet (time limit, tool limit)	Yes (denominator)	Reduce coverage; investigate
ATPG-Redundant	Provably untestable — no pattern possible	Excluded (or both)	Document; review for unintentional redundancy
Aborted	ATPG gave up within time/backtrack limit	Yes (denominator)	Same as undetected; tune ATPG effort

The distinction matters for coverage calculation. Some tools include ATPG-redundant faults in the denominator and separately report them; others exclude them. Industry comparisons between coverage numbers must account for this — "99% coverage (tool A)" may not equal "99% coverage (tool B)" if one excludes redundant faults and the other doesn't.

Tessent / TetraMAX ATPG Flow

A production ATPG run in Siemens Tessent (or Synopsys TetraMAX) follows these steps. The output is a set of test patterns in STIL (Standard Test Interface Language) or WGL format, ready to be loaded onto the ATE.

Tessent TCL — Full ATPG Run (stuck-at + transition)

# ── Step 1: Setup Tessent ATPG context ──────────────────────
set_context patterns -scan

# ── Step 2: Read scan-inserted netlist (from Day 3 flow) ────
read_verilog     top_chip_scan.v
read_cell_library {tsmc7nm_tt.lib tsmc7nm_scan.lib}
read_sdc          top_chip_scan.sdc
set_current_design top_chip

# ── Step 3: Set ATPG constraints ────────────────────────────
add_clock  0 clk -period 2.0
add_scan_enable scan_en
set_fault_type stuck          ;# or: transition, path_delay

# ── Step 4: Build the fault list ────────────────────────────
create_patterns -auto         ;# auto generates + simulates
set_atpg_limits -backtrack_limit 50
set_atpg_limits -coverage_target 99.0

# ── Step 5: Run ATPG ────────────────────────────────────────
run_atpg -auto

# ── Step 6: Report coverage ─────────────────────────────────
report_statistics
report_fault_coverage -class {detected undetected redundant}

# ── Step 7: Write patterns in STIL format for ATE ───────────
write_patterns top_chip_atpg.stil -format stil
write_patterns top_chip_atpg.wgl  -format wgl

# ── Transition fault run (at-speed) ─────────────────────────
set_fault_type transition
run_atpg -auto
write_patterns top_chip_trans.stil -format stil

The report_statistics output will show the number of detected, undetected, and ATPG-redundant faults, along with pattern count and CPU time. A well-tuned ATPG run on a 5M-gate block typically generates 3,000–8,000 stuck-at patterns achieving 99.2–99.8% coverage in 1–4 hours of compute time.

Coverage vs Pattern Count — The Last-Mile Problem

Fault coverage doesn't grow linearly with pattern count. Early patterns detect many faults (easy targets). As coverage increases, the remaining undetected faults are progressively harder — requiring more specific input conditions, longer propagation paths, or more backtracking. This creates the last-mile problem: the final 1% of coverage costs as many patterns as the first 98%.

Fault Coverage Target	Approx. Pattern Count	Incremental Patterns Needed	Marginal DPPM Reduction
90%	~200 patterns	200	High
95%	~500 patterns	300	High
99%	~1,500 patterns	1,000	Moderate
99.5%	~3,000 patterns	1,500	Low
99.9%	~15,000 patterns	12,000	Very Low
99.99%	~80,000+ patterns	65,000	Diminishing

In practice, DFT engineers set a coverage target (e.g., 99.0% stuck-at) and accept the residual DPPM risk associated with the undetected 1%. The ATPG tool's -coverage_target parameter stops generation once the target is met, avoiding unnecessary pattern explosion. Additional techniques for hard-to-detect faults include: increasing the backtrack limit, running a dedicated "top-up" ATPG pass with higher effort, or using dynamic compaction to squeeze multiple fault detections into each pattern.

Dynamic Compaction

A single test pattern often detects dozens or hundreds of faults simultaneously. Dynamic compaction fills the "don't care" (X) bits of a generated pattern — bits not needed for the target fault — with assignments that detect additional faults. This is why the pattern count in the table above is far less than the fault count. Without compaction, you would need one pattern per fault.

Interview Questions & Answers

What is the difference between D-algorithm and PODEM?

The D-algorithm (Roth, 1966) backtracks at internal circuit nodes — when D-propagation fails at a gate, it tries all possible input assignments at that internal gate, leading to exponential search in deep circuits. PODEM (Path-Oriented Decision Making, Goel, 1981) restricts all backtracking decisions to primary inputs only. By only assigning values to primary inputs and forward-implying the effects, PODEM avoids the internal node explosion. In practice, PODEM is 10–100x faster than D-algorithm on industrial circuits. Both algorithms achieve the same theoretical fault coverage — the difference is purely in search efficiency.

What is a redundant fault in ATPG?

A redundant fault (structurally untestable fault) is one for which no input pattern exists that can simultaneously activate the fault AND propagate the resulting error to any observable output. This is a property of the circuit structure, not a tool limitation — even with unlimited compute, no pattern would detect it. Redundant faults arise from intentional redundancy in the design (e.g., glitch-suppression logic, voter circuits), where duplicate logic paths create situations where one path's fault is always masked by the other path's correct output. They are excluded from the coverage denominator and must be documented in DFT sign-off. Their presence should be reviewed to confirm they are intentional — accidental redundancy can indicate design bugs.

What is the difference between fault simulation and ATPG?

ATPG computes new test patterns targeting specific undetected faults — it's a pattern generation algorithm. Fault simulation takes an existing pattern set and measures which faults are detected by those patterns. ATPG uses fault simulation internally to verify each generated pattern and to perform fault dropping. But fault simulation is also run as a standalone step: after ATPG, the full pattern set is fault-simulated to compute the final coverage report, including faults detected by multiple patterns (compaction analysis) and to identify which specific faults remain undetected. Think of ATPG as the writer and fault simulation as the editor: ATPG writes patterns; fault simulation grades them.

Why can't we achieve 100% fault coverage in practice?

Three fundamental reasons: (1) Structural untestability — redundant faults that no pattern can detect due to circuit topology; (2) Practical ATPG limits — some faults require extremely long sensitization paths or many specific input constraints; ATPG tools abort after a backtrack limit is exceeded; (3) Fault model limitations — the stuck-at model doesn't capture all physical defects; some defects (high-resistance opens, coupled noise) only manifest under specific timing or environmental conditions not covered by any static test. Industry practice targets 99%+ stuck-at coverage (excluding ATPG-redundant faults) and accepts the residual DPPM risk, quantified by empirical defect level models.

What is the difference between stuck-at ATPG and transition fault ATPG?

Stuck-at ATPG targets SA0 and SA1 faults — it checks that every net can be driven to both 0 and 1, catching opens and shorts. Patterns can be applied at any (slow) frequency; the test clock can be much slower than the operating frequency. Transition fault ATPG targets slow-to-rise (STR) and slow-to-fall (STF) faults — it verifies that transitions on each net complete within one clock period at the chip's actual operating frequency. This requires at-speed testing: two consecutive clock cycles (launch cycle + capture cycle) must run at full speed. Transition fault ATPG catches timing-related defects — resistive opens, weak driver cells, excessive RC delay — that pass stuck-at tests. Industry minimum targets: 99%+ stuck-at, 95%+ transition fault coverage.

← Day 3: Scan Insertion & EDT Day 5: Boundary Scan →

ATPG — Automatic Test Pattern GenerationD-Algorithm, PODEM, FAN & Fault Simulation