Why Do Chips Need Testing?
A chip that is correctly designed can still be incorrectly manufactured. Semiconductor fabrication involves hundreds of process steps — photolithography, deposition, etching, chemical-mechanical polishing, implantation — each with its own variability. A particle of contamination landing on a wafer, a lithography focus error, a metal via that didn't fill completely — any of these can create a defect that turns a functional design into a broken chip.
The numbers are sobering. Even at mature process nodes with well-controlled fabs, typical die yields are 60–90%. At leading-edge nodes (3nm, 2nm), yields for large dies start significantly lower and improve over time. Every chip that ships with a manufacturing defect is a potential field return, a safety issue, or a recall.
Testing is the last gate before a chip reaches the customer. The Automatic Test Equipment (ATE) — a machine costing $1–5 million — applies thousands of test vectors to every chip, measures the outputs, and decides: pass or fail. The set of test vectors is generated by ATPG (Automatic Test Pattern Generation) tools, and the probability that a bad chip slips through is determined by the fault coverage of those vectors.
Modern chips have billions of transistors. You cannot apply enough random test vectors to thoroughly exercise all of them in a reasonable time. DFT solves this by adding special test structures (scan chains, BIST, JTAG) that make the chip's internal state directly accessible — converting an intractable test problem into a manageable one.
Defect vs Fault vs Error vs Failure
DFT engineers use these four terms precisely. Confusing them is a common interview mistake.
| Term | Definition | Layer | Example |
|---|---|---|---|
| Defect | Physical imperfection from manufacturing | Physical | Metal particle bridging two wires; unfilled via |
| Fault | Logical abstraction of a defect's circuit effect | Logical | Net stuck-at-1 due to short to VDD |
| Error | Incorrect logic value produced at a node | Logical | Output of gate = 1 when it should be 0 |
| Failure | Observable incorrect system behaviour | System | Chip produces wrong output; system crashes |
A defect always causes a fault (in the fault model). A fault causes an error only when the faulty net is exercised (driven to the opposite of its stuck value). An error causes a failure only when it propagates to an observable output. A defect can be present without causing a field failure — if it happens to be on a rarely-used path or if the error is masked by other logic. This is why fault coverage targets are not 100% — some faults are structurally impossible to observe (untestable faults).
Fault Models
Because directly modeling billions of physical defect sites is intractable, DFT uses fault models — simplified logical abstractions that correlate well with real manufacturing defects. Different fault models detect different classes of physical defects.
Controllability and Observability
These are the two fundamental DFT concepts. Together they determine whether a fault can be tested.
Controllability
Controllability is the ability to set a specific logic value on an internal net by applying signals at the chip's primary inputs (or scan inputs). A net buried deep inside sequential logic has low controllability — you may need to clock through many flip-flops to reach a specific state at that net.
To activate a stuck-at-0 fault on net N, you must be able to drive N to logic 1 (so the SA0 fault causes an error — N reads 0 when it should be 1). If you can't controllably set N = 1, the fault is untestable.
Observability
Observability is the ability to propagate the value on an internal net to a primary output where it can be measured by the tester. Deep internal logic has low observability — an error must propagate through many gate stages to reach an output.
Scan chains solve both problems simultaneously. In scan mode, every flip-flop in the design is connected as a shift register. To set a flip-flop's value: shift in the value directly (100% controllability). To read a flip-flop's value: shift it out to the scan output (100% observability). This is why scan insertion is the most fundamental DFT technique — it turns every register into a directly accessible test point.
Fault Coverage Metrics
Industry Coverage Targets
| Fault Model | Typical Target | Why This Level | Tool |
|---|---|---|---|
| Stuck-at (SA) | > 99.0% | Maps directly to open/short defects. 99% correlates to ~100–300 DPPM at typical defect densities. | Tessent, TetraMAX |
| Transition (TF) | > 95.0% | At-speed; harder to achieve due to launch constraints. 95% is typical, 98%+ for high-reliability. | Tessent LOC/LOS mode |
| Path Delay (PDF) | Top 1,000–10,000 paths | Combinatorial explosion — target only critical paths. All critical timing paths in STA. | Tessent TDF mode |
| Bridging | Best effort, layout-aware | Requires layout extraction to identify adjacent nets. Optional but valuable. | Calibre + ATPG |
| MBIST (memory) | 100% (memory cells) | Memory cells are regular → full coverage is achievable with March algorithms. | Tessent MBIST |
DFT Concepts in Verilog
Before scan is inserted, this is a simple combinational block. The problem: if there's a SA1 fault on net and_out, you can only detect it if you can observe and_out at the output — which requires sel to be set correctly and the mux to pass through. Deep inside a large design, this observability chain may span dozens of logic levels.
// This module has a deep internal node 'and_out' // Detecting a SA1 on and_out requires: // 1. Controllability: set a=1, b=0 (or a=0, b=1) to activate fault // 2. Observability: route the error through mux → y to see it at output module deep_logic ( input a, b, c, d, sel, output y ); wire and_out; // internal node — low observability wire or_out; assign and_out = a & b; // SA1 fault here: and_out stuck at 1 assign or_out = c | d; // To observe and_out, sel must = 0 AND // and_out must differ from the correct value assign y = sel ? or_out : and_out; endmodule // Test vector to ACTIVATE SA1 on and_out: // Set a=0, b=X (or a=X, b=0) → correct and_out = 0 // Fault makes and_out = 1 → ERROR // Test vector to PROPAGATE error to output y: // Set sel=0 → y = and_out = 1 (faulty) vs 0 (good) → DETECTED // If sel=1 → y = or_out → fault is MASKED (not observable this cycle)
// ATPG views the circuit with scan FFs replacing regular FFs // Any flip-flop output can be: // - Controlled: shift a known value into the FF via scan chain // - Observed: capture logic value into FF, then shift out module scan_ff ( input clk, scan_en, input d, // functional data input scan_in, // scan chain input output reg q, output scan_out // connects to next FF's scan_in ); assign scan_out = q; always @(posedge clk) q <= scan_en ? scan_in // SHIFT MODE: load from chain : d; // CAPTURE MODE: normal operation endmodule // SHIFT MODE (scan_en=1): SI → FF[0].q → FF[1].q → ... → SO // CAPTURE MODE (scan_en=0): apply 1 functional clock, capture logic values // Then SHIFT MODE again to read out captured values at SO
Common Manufacturing Defects & Their Fault Models
| Physical Defect | Fault Model | Process Node Risk | Detection Method |
|---|---|---|---|
| Metal open (broken wire) | SA0 (line floats low) or SA1 | All nodes; vias most vulnerable | Stuck-at ATPG |
| Via void / unfilled via | SA0 or Transition Fault (resistive) | Advanced nodes (7nm→) | ATPG + at-speed TF |
| Metal-to-metal short | Bridging Fault | Increases at advanced nodes (tight pitch) | Bridging ATPG |
| Gate oxide defect | SA0 / SA1 (transistor always on/off) | All nodes | Stuck-at ATPG |
| Resistive contact | Transition Fault (slow rise/fall) | All nodes | At-speed TF ATPG |
| Particle contamination | Bridging or SA | All nodes; worse at smaller nodes | ATPG |
| CMP over-polishing | SA0 (metal thinned → open) | 7nm and below | Stuck-at ATPG |
From Fault Coverage to DPPM
DPPM — Defective Parts Per Million — is the number of bad chips expected to pass the test and reach customers. It's directly tied to fault coverage. A simple model:
For automotive and safety-critical applications (ISO 26262), DPPM targets are often <10 DPPM — requiring fault coverage above 99.9% with additional diagnostic coverage requirements.