1. Common CDC Failure Modes
Failure Mode 1: Data Corruption (Glitches)
Symptom: Multi-bit data arrives corrupted (bits swapped, garbled).
Root cause: Multi-bit binary crossing without Gray code. All bits changed simultaneously, some captured mid-transition.
Example: Write 0x00, read 0x80 (not a valid transition).
Fix: Use Gray code synchronization for multi-bit monotonic signals.
Failure Mode 2: Intermittent Data Loss
Symptom: Occasionally, data written to FIFO doesn't appear on read side.
Root cause: Single-bit FIFO write pointer not synchronized. Read pointer sees stale pointer, thinks FIFO is full.
Fix: Ensure Gray code synchronizes FIFO pointers, verify dual-FF stages.
Failure Mode 3: Deadlock (FIFO stuck)
Symptom: FIFO full flag asserted indefinitely, data can't flow.
Root cause: Synchronized read pointer is incorrect (sync failed), so full flag wrong.
Investigation: Check that synchronized read pointer in write domain equals actual read pointer (accounting for 2-cycle sync latency).
Failure Mode 4: Random Failures (Metastability)
Symptom: System works 99% of the time, but occasionally fails in specific conditions (specific frequency, temperature, voltage).
Root cause: CDC synchronizer insufficient. Single-FF or incomplete dual-FF, metastability not resolved fully.
Fix: Add FF stages, verify timing closure, formal verification.
2. Debugging Flow (Simulation)
Step 1: Identify failing behavior
- Does failure reproduce in simulation or only silicon?
- What conditions trigger it (frequency, clock phase, data pattern)?
- Is it deterministic (always at same point) or random?
Step 2: Isolate the CDC crossing
- Which clock domains are involved?
- Which specific signal crosses?
- What synchronizer is used?
Step 3: Check synchronizer quality
- Is it dual-FF or single-FF? (Rule: always dual-FF minimum)
- Are FF stages instantiated correctly?
- Is data stable 2-3 cycles after crossing?
Step 4: Test under stress
- Inject metastability (random clock skew, setup/hold violations)
- Extreme frequency ratios (100:1, 1:100)
- All PVT corners (especially slow-slow)
3. Debugging Flow (Post-Silicon)
Step 1: Reproduce the failure
- Identify conditions that trigger bug (freq, temp, voltage)
- Can we trigger it repeatedly or is it random?
Step 2: Add instrumentation**
- Debug signals: route CDC intermediate signals to debug port
- On-chip logic analyzer: capture clock domain crossings in real-time
- Check for metastable intermediate values (0.4-0.6V, not clean 0/1)
Step 3: Compare with simulation**
- Reproduce same failure conditions in simulation
- Does simulation match silicon behavior?
- If not, simulator may not model metastability correctly
Step 4: Root cause determination**
- Is synchronizer too slow (MTBF low)?
- Did design change during synthesis (compare netlist vs. RTL)?
- Is timing constraint too loose (setup/hold violated in non-metastable regime)?
4. Post-Silicon Metastability Indicators
Signs of metastability issues in silicon:
- Temperature dependent: Fails more at high temp (slow transistors)
- Voltage dependent: Fails at low voltage (weak signal)
- Frequency dependent: Fails at specific clock ratio or phase offset
- Intermittent: Random failures, hard to reproduce
- Correlates with FF depth: Adding more FF stages fixes it → MTBF issue
5. Forensic Analysis: Trace Examination
If you have logic analyzer traces (simulation or real silicon):
- Identify CDC crossing point: Where does source signal change?
- Check sync stages: Does FF1 output transition correctly after input change?
- Check FF2 stability: Is FF2 output stable before downstream logic uses it?
- Look for glitches: Brief spikes (< 1ns) in signals indicate metastable state
- Check timing:**Between signal change and FF capture edge: is setup time violated?
6. Common Debugging Mistakes
- ❌ Mistake: Assuming simulation proves metastability safety (it doesn't, unless explicitly injected)
- ✓ Fix: Inject metastability in simulation, use formal verification
- ❌ Mistake: Debugging only at nominal conditions (misses corner cases)
- ✓ Fix: Test at worst-case PVT first (slow-slow), then other corners
- ❌ Mistake: Assuming all CDC violations show up immediately in simulation
- ✓ Fix: Some CDC bugs are probabilistic (MTBF hours), need long sim runs or formal proof
- ❌ Mistake: Not checking synthesis netlist against RTL
- ✓ Fix: Formal equivalence check (Conformal) to ensure synthesis didn't break CDC
7. Emergency Fixes (Band-Aids)
If you're post-silicon and discover a CDC bug:
- Reduce frequency: Slower clocks reduce metastability risk (MTBF improves with time budget)
- Add delay: Insert pipeline stages (buffer time before downstream logic uses data)
- Temperature management: Run at lower temp if possible (reduces metastability)
- Add guard band: Operate at lower voltage/frequency than specs (larger margin)
These are temporary. Real fix requires redesign and respin.
8. Prevention: Design Review Checklist
- ✅ Every async signal synchronized?
- ✅ Dual-FF or better minimum?
- ✅ Gray code on multi-bit monotonic?
- ✅ FIFO pointers synchronized?
- ✅ Reset synchronized?
- ✅ No combinational logic from async input?
- ✅ Timing constraints set (false paths marked)?
- ✅ CDC lint passed?
- ✅ Formal verification on critical paths?
- ✅ Testbench includes metastability injection?
9. Incident Reporting Template
If CDC bug occurs, document it for future reference:
- Title: Brief description
- Symptom: What went wrong (data corruption, deadlock, intermittent)
- Reproduction conditions: Frequency, temperature, voltage, data patterns
- Root cause: Which CDC crossing failed and why
- Fix:**RTL change + new constraints + verification plan
- Prevention: What lint/formal checks would have caught this
10. Checklist: CDC Debugging
- ✅ Simulation first: Reproduce in sim before post-silicon
- ✅ Inject metastability: Random clock skew, setup/hold violations
- ✅ Test all PVT corners: Especially slow-slow
- ✅ Check synchronizer stages: Count FFs, verify dual minimum
- ✅ Examine traces: Look for glitches, wrong timing
- ✅ Formal verification: Prove MTBF bounds
- ✅ Compare netlist: Ensure synthesis didn't break CDC
- ✅ Document findings: Root cause, fix, prevention
Next (Day 15): Production verification and final checklist.