Interview Prep

VLSI Engineer Interview Questions

Real questions asked at top semiconductor and tech companies — with detailed, interview-ready answers covering RTL, STA, CDC, low power, DFT, and protocols.

54
Questions
9
Topics
3
Difficulty Levels
3
Companies Live
01
Easy GPU Architecture
What is SIMD execution in a GPU? What is a warp and what happens during warp divergence?

SIMD (Single Instruction, Multiple Data) means one instruction operates on many data elements in parallel. In a GPU's Streaming Multiprocessor (SM), this is implemented as SIMT — Single Instruction, Multiple Threads: 32 threads are grouped into a warp, and all 32 threads execute the same instruction simultaneously, each on its own private data.

The warp is the fundamental scheduling unit. A single SM can hold many warps in flight (e.g., 64 warps × 32 threads = 2048 threads per SM on H100). When one warp stalls on a memory access, the SM's warp scheduler instantly switches to another ready warp — this is how GPUs hide memory latency through latency hiding rather than large caches.

Warp divergence occurs when threads within a warp take different execution paths — for example, in if (threadIdx.x % 2 == 0), even-numbered threads go one way, odd threads another. The GPU must serialize execution:

  • First execute all threads that took the "true" branch (odd threads masked off)
  • Then execute all threads that took the "false" branch (even threads masked off)

A 50/50 divergent branch halves effective throughput. Nested divergence multiplies the penalty. Minimizing divergence is one of the most important GPU kernel optimization principles.

NVIDIA's term: NVIDIA calls their SIMT execution model "CUDA cores" at the thread level. Each SM has 128 CUDA cores (H100), meaning 128 FP32 operations per clock cycle — 4 warps executing simultaneously.
02
Medium GPU Architecture
Describe the GPU memory hierarchy. What are the latency and bandwidth characteristics at each level?

From fastest/smallest to slowest/largest (H100 as reference):

  • Registers: Per-thread, 65,536 × 32-bit registers per SM. Latency: 0 cycles (operand bypass). Total bandwidth: ~20 TB/s across all SMs. Zero-latency access when available; register pressure causes spill to local memory.
  • Shared Memory / L1 cache: Per-SM, programmer-managed scratchpad, 228 KB per SM (H100), configurable split with L1. Latency: ~4–5 cycles. Bandwidth: ~19 TB/s. Used for inter-thread communication within a thread block.
  • L2 cache: Chip-wide, 50 MB in H100, shared by all 132 SMs. Latency: ~100–200 cycles. Bandwidth: ~12 TB/s. Caches both data and instructions.
  • HBM (device memory): Off-chip stacked DRAM, 80 GB in H100 SXM5. Latency: ~400–800 cycles. Bandwidth: 3.35 TB/s. The primary memory bottleneck for memory-bound kernels.
  • Peer GPU memory (NVLink): Another GPU's HBM accessed via NVLink 4.0. Bandwidth: 900 GB/s bidirectional, cache-coherent. Latency: ~1–2 µs.
  • Host CPU memory: Via PCIe 5.0 x16. Bandwidth: 64 GB/s, Latency: ~5–10 µs. The slowest tier — minimize host↔device transfers.
Key GPU optimization principle: Most AI/HPC workloads are memory-bandwidth-bound at the HBM level. The goal is to maximize arithmetic intensity (FLOPs per byte of HBM access) — this is captured by the Roofline model, which NVIDIA engineers use to analyze and optimize kernel performance.
03
Medium GPU Architecture
What is a shared memory bank conflict in a GPU? How do you detect and avoid it?

GPU shared memory is physically organized into 32 banks, matching the warp width. Each bank can serve exactly one 32-bit access per clock cycle. If multiple threads in the same warp access different addresses within the same bank simultaneously, those accesses are serialized — reducing effective bandwidth proportionally.

Bank mapping rule: Address A maps to bank A % 32. So addresses 0, 128, 256… all map to bank 0; addresses 4, 132, 260… all map to bank 1 (for 4-byte elements).

Classic bank conflict example: A 32×32 matrix stored in shared memory, accessed column-first by a warp. Thread 0 reads M[0][0] (bank 0), Thread 1 reads M[1][0] (bank 32 % 32 = bank 0 also) → 32-way bank conflict → 32× slowdown.

Solutions:

  • Padding: Declare the array as float M[32][33] instead of [32][32]. The extra column shifts each row's bank alignment, breaking the conflict pattern. One of the most common GPU optimization tricks.
  • Access reordering: Restructure the algorithm so consecutive threads access consecutive 4-byte addresses (consecutive banks).
  • Broadcast: If all threads access the same address in one bank, the memory system broadcasts it — no conflict. This is free.
Detection: NVIDIA Nsight Compute profiler reports "shared memory bank conflicts" per kernel. A non-zero count directly indicates throughput loss that can be fixed at the algorithm level.
04
Hard GPU Architecture
What is HBM (High Bandwidth Memory)? Why does NVIDIA use it in data center GPUs instead of GDDR?

HBM (High Bandwidth Memory) is a stacked DRAM architecture. Multiple DRAM dies are stacked vertically using Through-Silicon Vias (TSVs) and connected to the GPU die via a silicon interposer — a passive silicon layer with very fine-pitch connections. The stack sits physically adjacent to the GPU die on the interposer, connected by thousands of short, dense wires rather than long PCB traces.

Why this matters:

  • Massive bus width: HBM3 provides a 1024-bit-wide bus per stack. An H100 with 5 stacks has a 5120-bit total memory bus. Compare to GDDR6X: 16-bit per chip × 24 chips on a high-end gaming card = 384-bit total. HBM is 13× wider.
  • Bandwidth: H100 achieves 3.35 TB/s of HBM3 bandwidth. An RTX 4090 with GDDR6X achieves ~1 TB/s. The H100 is 3.35× faster despite being a smaller number of dies.
  • Power efficiency: Short interconnects (millimeters vs centimeters on PCB) mean much lower capacitance per bit — lower switching energy. HBM typically consumes 50% less power per GB/s than GDDR.
  • Package area: No GDDR chips around the periphery of a large PCB. HBM stacks sit compactly next to the die on the interposer.

Why GDDR wins for gaming GPUs: HBM requires a silicon interposer, which is expensive (2–3× PCB packaging cost). For gaming budgets, GDDR6X offers enough bandwidth at lower cost. HBM's cost is justified only where bandwidth-per-dollar is critical — AI training, HPC, data center.

05
Medium Protocols
What are the key specs of PCIe Gen5? What signal integrity challenges arise at 32 GT/s?

PCIe Gen5 (PCIe 5.0) doubles the per-lane data rate from 16 GT/s (Gen4) to 32 GT/s. Using 128b/130b encoding (2-bit overhead vs 8b/10b's 20% overhead), the effective throughput per lane is ~31 Gbps. A ×16 link delivers 64 GB/s per direction (128 GB/s total bidirectional).

Signal integrity challenges at 32 GT/s:

  • Insertion loss: FR4 PCB material absorbs high-frequency signal energy. At 16 GHz Nyquist (32 GT/s), losses across even short traces become severe. Requires low-loss dielectric materials (Megtron 6, Rogers) or very short trace lengths.
  • Crosstalk: Adjacent differential pairs couple more aggressively at higher frequencies. Tighter guard spacing and reference planes are needed.
  • Equalization demands: Both transmitter and receiver require aggressive equalization — CTLE (Continuous Time Linear Equalization) and DFE (Decision Feedback Equalization) at the receiver, and FIR (finite impulse response) transmitter pre-emphasis. Gen5 standardizes more complex equalization than Gen4.
  • Connector and via design: Even PCB connectors and vias introduce resonant stubs at Gen5 frequencies. Via back-drilling (removing unused via stubs) becomes mandatory.
NVIDIA's H100 SXM5 uses PCIe Gen5 for host connectivity but relies on NVLink 4.0 for GPU-to-GPU communication within a node — because even 64 GB/s per direction is insufficient for the collective operations needed in large AI model training.
06
Hard Protocols
What is NVLink? Why can't PCIe serve GPU-to-GPU communication at the scale of modern AI training?

NVLink is NVIDIA's proprietary high-speed, cache-coherent interconnect for direct GPU-to-GPU communication. NVLink 4.0 (H100) provides 900 GB/s bidirectional bandwidth per GPU across 18 NVLink connections. Compare this to PCIe 5.0 x16: 64 GB/s per direction — NVLink is 14× higher bandwidth.

Why PCIe fails for GPU-to-GPU at scale:

  • Topology: PCIe is a CPU-centric tree topology. GPU-to-GPU data must traverse CPU root complex, adding ~2–5 µs latency and halving effective bandwidth (each hop is bidirectional but the shared root complex is a bottleneck).
  • Bandwidth: A GPT-4 scale model training run performs AllReduce across 8+ GPUs every few hundred milliseconds. Each AllReduce requires each GPU to send and receive its full gradient tensor (~10s of GB). PCIe's 64 GB/s would be saturated; NVLink's 900 GB/s handles it comfortably.
  • No cache coherency in PCIe: PCIe lacks hardware cache coherency between GPUs. Memory copies must be explicit and managed by software/driver. NVLink supports hardware cache coherence — GPU A can read GPU B's memory with the same semantics as its own, dramatically simplifying programming models like NVLink's NVSHM.

NVSwitch: NVIDIA's NVSwitch chip (3.2 Tbps per switch in NVSwitch 3.0) connects 8 GPUs in an all-to-all topology inside a DGX H100 node. Every GPU can communicate with every other GPU at full NVLink bandwidth simultaneously — no head-of-line blocking.

07
Easy RTL Design
What are the three types of pipeline hazards? How is each resolved?

A hazard is a condition that prevents the next instruction from executing in the following clock cycle, threatening to give incorrect results.

1. Structural Hazard: Two instructions need the same hardware resource simultaneously (e.g., both need to write to the register file in the same cycle, but there is only one write port). Resolution: add a second hardware unit (more write ports, more functional units), or stall one instruction.

2. Data Hazard: An instruction depends on the result of a preceding instruction that hasn't yet written its output to the register file.

  • RAW (Read After Write): Most common. Instruction B reads a register before instruction A has written it. Resolution: forwarding/bypassing — route the ALU output directly back to the ALU input without waiting for register writeback. If forwarding can't bridge the gap (e.g., load-use hazard), insert a stall (pipeline bubble).
  • WAW (Write After Write): Two instructions write the same register — later one might arrive first in out-of-order execution. Resolved via register renaming.
  • WAR (Write After Read): Instruction writes a register before an earlier instruction reads it (only in out-of-order). Resolved via register renaming.

3. Control Hazard: A branch changes the program counter, but instructions after the branch have already been fetched. Resolution: branch prediction (speculate the branch outcome, flush on misprediction), delayed branching (execute one instruction after branch regardless — RISC classic), or speculative execution with rollback.

GPU relevance: GPU warp schedulers avoid control hazards by switching warps on every stall rather than speculatively executing. Branch prediction is rarely used in GPU pipelines because the massive number of warps provides enough instruction-level parallelism without it.
08
Medium RTL Design
How do you efficiently implement a wide datapath (e.g., 512-bit ALU or adder) in RTL? What timing and synthesis concerns arise?

Slicing + generate: Break the 512-bit operation into N independent or semi-independent slices and instantiate them with a generate loop. For operations that are fully parallel (bitwise AND, OR, XOR), this gives linear throughput with the number of slices and synthesis has no trouble.

The carry problem for adders: A naive 512-bit binary adder using ripple carry has a critical path through all 512 stages — completely unacceptable. Use a hierarchical adder structure:

  • Carry Lookahead Adder (CLA): Compute generate (G=A&B) and propagate (P=A^B) for each bit, then compute carries in parallel using a tree. Reduces depth from O(N) to O(log N).
  • Carry-Select Adder: Pre-compute two copies of each upper slice (one assuming carry-in=0, one assuming carry-in=1), then mux between them when the actual carry arrives. Trades area for speed.
  • For synthesis: write the adder behaviorally (assign sum = a + b;) and let the synthesis tool select the adder topology from the technology library. Modern tools (DC, Genus) will choose appropriately based on timing constraints.

Additional concerns:

  • Routing congestion: A 512-bit bus creates dense wiring in physical design. The datapath should be placed in a compact rectangular region with proper floorplanning.
  • Retiming: If the datapath is pipelined, enable retiming (set_optimize_registers true in DC) to let the tool rebalance registers across the datapath stages automatically.
  • Operand isolation: Wide datapaths toggle a lot of bits. Gate the inputs when the datapath result is not needed to reduce switching power.
09
Hard Timing / STA
What new timing challenges dominate at 4nm/5nm compared to older nodes? Why is timing closure harder?

At older nodes (28nm+), gate delay dominated wire delay. At 4nm/5nm, the relationship has reversed — wire RC delay now dominates on many paths, making interconnect the primary timing bottleneck.

Key new challenges:

  • Wire resistance explosion: As metal pitch shrinks, wire cross-section shrinks, resistance per unit length increases dramatically. A long metal wire at 5nm can have 5–10× higher resistance than the same wire at 28nm. RC delay scales as R×C — both R and C worsen at smaller nodes.
  • FinFET / GAAFET parasitic capacitance: Gate-to-drain capacitance (Miller capacitance) in FinFET and Gate-All-Around FET structures is proportionally larger, adding input/output loading that didn't exist at planar bulk CMOS.
  • Variation (OCV/POCV): Random dopant fluctuation and gate length variation are larger fractions of the total delay at small nodes. Statistical timing (POCV/LVF) is required rather than flat derating — adding complexity to STA flow.
  • Double/triple patterning constraints: Metal layers below M4 require multi-patterning. This forces additional spacing rules, reducing routing freedom and forcing the router to use longer detours → more wire → worse RC delay.
  • IR drop impact on timing: Higher current density at same power with smaller metal → worse static and dynamic IR drop. Cells in IR-drop hot spots run slower and create timing violations not visible in STA at nominal VDD.
  • Crosstalk aggressor coupling: Tighter metal pitch increases coupling capacitance between adjacent wires. A switching aggressor can add delay (or reduce delay) to a victim net, creating timing violations that only appear with specific data patterns.
10
Medium Physical Design
What is electromigration (EM) in VLSI? How does it constrain metal routing and what are the fixes?

Electromigration (EM) is the gradual displacement of metal atoms caused by momentum transfer from electron flow (high current density). At elevated current density, atoms migrate toward the cathode end, creating voids (open circuits) at the anode and hillocks (shorts) elsewhere — a wearout failure that grows over months or years in the field.

Why it matters for NVIDIA: High-TDP GPUs (300–700W) draw hundreds of amperes. The power distribution network carries enormous currents through metal layers. Clock nets and wide datapath buses also carry high current due to their switching activity.

The constraint: PDKs define maximum DC current density (J_DC, mA/µm²) and RMS current density (J_RMS for AC/switching) per metal layer and via. Routers check these limits during sign-off. Violations require fixes before tape-out.

Common EM weak points:

  • Vias: Single-cut vias have much higher current density than the wire they connect. Via-EM is the most common EM failure mechanism. Fix: use minimum 2 vias wherever current exceeds the single-via limit (the router enforces this via "via doubling" rules).
  • Clock buffers: Clock networks drive large loads at full VDD swing every cycle — highest RMS current in the design. Clock routing uses upper metal layers with wide wires by design.
  • Power grid stripes: Size power stripes based on average current drawn by each domain, with margin.

Fixes: Widen the wire, add parallel wires, add redundant vias, move to upper metal layers (lower resistance, higher J_max), spread high-current logic across more stripes.

11
Hard Physical Design
What are the challenges of delivering 400W to a GPU die? How is the power delivery network (PDN) designed?

A 400W GPU at 0.85V core voltage draws approximately 470 amperes. Delivering this cleanly — without voltage droops that cause functional failures or reliability damage — is a major system and chip design challenge.

Voltage droop: When the GPU transitions from light to heavy workload (e.g., a kernel launch), the current demand steps up in nanoseconds. The PDN inductance (L) resists this fast current change: V_droop = L × dI/dt. A 10 nH package inductance with a 100 A/ns current ramp creates 1 V of instantaneous droop — catastrophic for a 0.85V rail.

PDN design hierarchy — capacitors at three timescales:

  • Die-level decaps (on-chip, ~1–10 ns): Inserted as standard cells in unused routing areas and power domain boundaries. Handle the fastest transients. Limited by available die area. Capacitance: ~10–100 nF total.
  • Package decaps (~10–100 ns): Capacitors embedded in the package substrate or placed as discrete SMDs on the package interposer. Handle intermediate transients. Capacitance: ~100 nF – 10 µF.
  • Board bulk caps (>100 ns): Large ceramic and electrolytic capacitors on the PCB, close to the VRM (Voltage Regulator Module). Handle slow load steps. Capacitance: 100 µF – 1 mF+.

Target PDN impedance: Design the PDN so its impedance Z(f) = V_droop_budget / I_max is flat across all relevant frequencies (DC to ~1 GHz). Resonant peaks in impedance cause amplified droop at those frequencies and must be damped.

Chip-level: Top metal layers (M8–M14 in advanced nodes) are dedicated entirely to power distribution — wide horizontal and vertical stripes forming a mesh. The mesh resistance across the die must be kept below ~0.1 mΩ to limit static IR drop to under 50 mV.

12
Medium CDC
A GPU design has thousands of flip-flops across multiple clock domains. How does CDC verification scale, and what role does formal verification play?

Simulation-based CDC verification does not scale for a GPU-class design. There are too many clock phase combinations, too many data patterns, and too many cycles to simulate to ever encounter all CDC metastability scenarios. Missing a single unsafe synchronizer can cause silent data corruption that only appears under specific workloads in the field.

Formal CDC verification is the industry-standard solution. Tools like Synopsys SpyGlass CDC, Mentor Questa CDC, or Cadence JasperGold CDC analyze the entire netlist structurally and mathematically:

  • Topology analysis: Identify every net that crosses a clock domain boundary (source FF in domain A, destination FF in domain B).
  • Synchronizer recognition: Detect whether each crossing has a valid synchronizer structure (2-FF chain, handshake, async FIFO pointers). Compliant structures are "promoted" to safe.
  • Multi-bit coherency: Flag any multi-bit bus where individual bits are synchronized independently (torn-word risk). Require Gray code, handshake, or FIFO.
  • Reconvergence analysis: Detect where two signals from the same CDC crossing reconverge into the same logic — one through a synchronizer, one not. This is the most dangerous CDC pattern and is hard to find manually.

RTL coding guidelines enforce synchronizer templates that formal tools can recognize. Deviations from approved templates are flagged automatically. Waivers document paths that are safe for non-RTL reasons (e.g., a path only crosses during reset when data is irrelevant).

13
Medium Low Power
What is DVFS (Dynamic Voltage and Frequency Scaling)? How is it implemented in a GPU, and what are the hardware constraints?

DVFS dynamically adjusts both supply voltage (V_DD) and clock frequency (f) based on workload demand and thermal conditions. Since dynamic power scales as P = αCV²f, simultaneously reducing V and f provides cubic power reduction in the ideal case — halving both V and f reduces power by 8×.

The relationship: Maximum safe frequency is approximately proportional to (V_DD − V_th)/V_DD. Lower voltage → lower max frequency. The curve of achievable (V, f) operating points is characterized during silicon bring-up and stored as a VF table (V-F curve).

Hardware implementation in a GPU:

  • On-chip Performance Monitoring Unit (PMU): Monitors SM utilization, power draw, die temperature, and throttle signals every ~1 ms. Decides the target power state (P-state).
  • Multiple P-states: The GPU ships with a defined set of (V, f) operating points — e.g., Base Clock (guaranteed), Boost Clock (sustained under thermal headroom), Max Boost (burst, thermally limited). NVIDIA's GPU Boost algorithm dynamically selects among these.
  • Voltage regulator: An external PMIC or on-package VR changes V_DD on command. Voltage settling takes ~10–50 µs — during this window, frequency must stay within the safe range for the transitioning voltage.
  • PLL reprogramming: The on-chip PLL changes frequency by updating its divider ratios. Must happen after voltage is stable when scaling up, and before voltage drops when scaling down — violating this order can cause timing failures.
Thermal throttling: When die temperature approaches T_junction_max (~90°C for H100), the PMU reduces P-state to stay within thermal limits. This is the primary reason GPU TDP ratings exist — they define the sustainable power the cooling solution must handle to prevent throttling.
14
Medium Verification / DFT
How does formal verification (model checking / property checking) differ from simulation? When is formal preferred?

Simulation checks specific test cases: apply a sequence of inputs, observe the outputs, compare against expected. Even with millions of random vectors (constrained-random UVM), simulation covers only a tiny fraction of the total state space. It proves the design is correct for those specific scenarios, not in general.

Formal verification (property checking / model checking) mathematically proves that a property holds for all possible input sequences and all reachable states — or produces a concrete counterexample. No test vectors needed. Tools like JasperGold (Cadence), VC Formal (Synopsys), and Questa Formal (Siemens) use SAT/BDD solvers to explore the full state space.

When formal is preferred:

  • Safety-critical properties: "The FIFO never overflows", "The arbiter never grants two requestors simultaneously", "A valid handshake always completes within 16 cycles." These must be proven, not just tested.
  • CDC checking: Formal tools exhaustively verify all synchronizer topologies across all clock domains (as described in the previous question).
  • RTL-to-gate equivalence checking: After synthesis, prove the gate-level netlist is functionally identical to the RTL. Catches synthesis tool bugs. Industry standard for ASIC tape-out.
  • Reset/initialization verification: Prove all state elements reach known values after reset sequences.
  • Protocol compliance: Verify that an AXI4 interface implementation correctly follows the spec for all legal sequences of VALID/READY.

Limitation: State space explosion. Complex datapaths (e.g., floating-point units with 52-bit mantissas) have state spaces too large for formal to solve without heroic abstraction. Formal is most powerful on control paths; datapath is verified by simulation.

15
Easy Verification / DFT
What is hardware emulation? Why is it essential for verifying a GPU-scale design?

Hardware emulation compiles RTL onto a large array of FPGAs (or custom emulation processors like Cadence Palladium or Siemens Veloce). The emulated design runs at 1–10 MHz — 100–10,000× faster than RTL simulation. At 5 MHz emulation speed vs 500 Hz simulation speed, a test that takes 1 year in simulation completes in hours on an emulator.

Why emulation is essential for GPUs:

  • Software bring-up: The GPU driver stack, firmware, CUDA runtime, and OS interactions involve billions of transactions and complex state machines. Running real software stacks on the emulator is the only way to validate pre-silicon software behavior. Simulation is simply too slow.
  • Latent bugs: Some bugs only manifest after millions or billions of transactions under realistic workloads — memory coherency races, power state machine errors, firmware edge cases. Emulation can run full AI model training passes on pre-silicon hardware.
  • System-level integration: Connect the emulated GPU RTL to real PCIe hardware and run real-world benchmarks (ResNet training, CUDA programs) to validate system integration months before silicon is available.
  • DFT validation: Run production test patterns on the emulator to validate scan chain behavior, scan shift/capture at-speed before committing to tape-out.
NVIDIA's pre-silicon flow: RTL simulation handles targeted unit-level verification. Emulation handles system-level software bring-up and regression at scale. Then silicon samples validate both at full speed. Emulation dramatically compresses the software readiness timeline so drivers and frameworks are ready on day 1 of silicon availability.
16
Medium Architecture
What is out-of-order execution? How does a Reorder Buffer (ROB) maintain program-order commit?

Out-of-order (OOO) execution allows a CPU to execute instructions in a different order from program order when earlier instructions stall (e.g., on a cache miss), so that later independent instructions can proceed. This improves instruction-level parallelism (ILP) and hides memory latency.

The challenge: Although instructions execute out of order, they must commit (become architecturally visible — update registers and memory) in program order. If an instruction causes an exception or a branch misprediction is detected, all subsequent instructions must be discarded as if they never executed. This requires the ability to "undo" out-of-order execution.

The Reorder Buffer (ROB) solves this:

  • A circular buffer that tracks all in-flight instructions in program order. Each entry stores the instruction, its destination register, its result (when computed), and its status (executing / done / excepted).
  • Instructions are allocated (in order) at dispatch and deallocated (committed) only from the head — always in program order.
  • An instruction commits when: it is at the ROB head AND its result is ready AND no exception occurred. Commitment writes the result to the architectural register file.
  • On misprediction/exception: The ROB is flushed from the faulting instruction to the tail — all results are discarded, the architectural state is restored to the last committed state. Execution resumes from the correct PC.

Register renaming: Architecturally, there may be 32 registers. The OOO engine maps these to a larger physical register file (hundreds of registers). This eliminates WAW and WAR hazards by giving each instruction its own private physical register for its result — multiple "versions" of the same architectural register can be in-flight simultaneously.

Showing 20 of 20 questions

01
Easy RTL Design
What is the difference between a latch and a flip-flop? When would you use each?

A latch is a level-sensitive storage element. When the enable (or clock) is HIGH, the output follows the input continuously — the latch is "transparent." When the enable goes LOW, the last value is held. A flip-flop is edge-triggered — it samples the input only at the exact rising (or falling) clock edge and ignores the input at all other times.

Why flip-flops dominate RTL design: Their predictable sampling window makes static timing analysis (STA) straightforward — setup and hold times are well-defined relative to one clock edge. Latches create "transparent windows" that make STA far more complex; timing tools must ensure that no combinational path through an open latch violates timing in any cycle.

When to use latches deliberately:

  • Power savings: A latch consumes no clock dynamic power when transparent (no clock-to-output toggling).
  • High-performance pipelines: In "latch-based" designs (common in custom datapath and CPUs), a latch pair (master + slave) forms a pseudo-FF but allows time-borrowing — a slow first half-cycle can steal time from a fast second half-cycle, improving throughput.
  • Specialized cells: Sense amplifiers and certain memory cells use latch-based structures.
Interview tip: Google interviewers often follow up: "What is an enable flip-flop and how is it synthesized?" — Answer: it's a mux feeding the D input: D = en ? data_in : Q. The synthesis tool maps this to a clock-gate cell, not a latch.
02
Medium RTL Design
How do you design a clock divider by 3 with exactly 50% duty cycle?

Dividing by an odd number and achieving 50% duty cycle requires using both clock edges. A single-edge counter can only produce a 33%/67% duty cycle.

The technique: Create two signals derived from the same mod-3 counter — one toggled on the rising edge, one on the falling edge — then OR them together.

  • Use a 2-bit counter clocked on the rising edge counting 0→1→2→0. Generate out_r = HIGH when count == 0, LOW when count == 1.
  • Use the same counter logic clocked on the falling edge. Generate out_f identically.
  • Final output = out_r OR out_f. Because out_r and out_f are offset by half a source clock period, their OR produces exactly 1.5 periods HIGH and 1.5 periods LOW out of every 3 source periods — 50% duty cycle.
Why this works: The rising-edge FF goes HIGH at the source rising edge of count 0, and the falling-edge FF goes HIGH at the source falling edge of count 0 — they overlap for 1.5 source cycles. The combined waveform has transitions at every 1.5 source clock periods.
03
Medium RTL Design
How do you calculate the required depth of an asynchronous FIFO?

The FIFO must hold all data written during a burst before the reader catches up. The minimum depth is:

Depth ≥ (Write rate − Read rate) × Burst duration + Synchronizer latency guard

Breaking this down:

  • Burst excess: If a writer sends at fw words/cycle for T cycles, and the reader drains at fr words/cycle, the net accumulation is (fw − fr) × T words. This is the minimum storage needed.
  • Synchronizer latency: The gray-code read pointer takes 2–3 destination-clock cycles to propagate through the synchronizer. During this window, the write side may falsely see the FIFO as full (or the read side sees it as empty). Add 2–3 words of margin per side.
  • Round up to power of 2: Async FIFO address arithmetic requires a power-of-2 depth so that the gray code pointer MSB inversion trick for full/empty detection works correctly.
Example: Writer at 400 MHz (1 word/cycle), reader at 200 MHz (1 word/cycle), burst of 16 words. Excess = (1−0.5)×16 = 8 words, plus 4 guard words → depth = 12, round up to 16.
04
Easy RTL Design
What is a glitch in combinational logic and how do you prevent it?

A glitch (or hazard) is a spurious, short-duration output pulse that occurs when multiple inputs change simultaneously and race through paths of unequal delay. Even though the steady-state output is correct, the transient produces an unwanted transition.

Classic example: A static-1 hazard in a 2-input AND gate where both inputs come from the same signal A through paths with different delays — one direct, one through an inverter. Mathematically A AND NOT(A) = 0, but if the direct path is faster, the gate briefly sees 1 AND 1 = 1 before the inverted path arrives.

Why glitches matter:

  • Power waste: Every glitch is a switching event that consumes dynamic power (αCV²f). High-activity buses can waste significant power.
  • Clock path corruption: A glitch on a clock or enable line can clock a flip-flop at the wrong time, causing functional failure.
  • Latch transparency: Glitches on a latch enable propagate directly to the latch output while it is transparent.

Prevention:

  • Register outputs: Sampling glitchy combinational logic in a flip-flop on the clock edge filters all glitches shorter than the setup window.
  • Hazard-free logic: In Karnaugh map minimization, add "consensus" prime implicant terms that cover the transition between any two adjacent groups — eliminates static hazards.
  • Clock gating cells (ICG): Use library clock gate cells that latch the enable on the clock LOW phase — ensures the gated clock output is always a complete pulse or no pulse at all.
05
Hard RTL Design
How do you implement a glitch-free clock multiplexer for two asynchronous clocks?

A naive assign clk_out = sel ? clk1 : clk0 will produce a glitch when sel changes — the output can get a truncated pulse from one clock or a merged pulse from both. This corrupts any flop clocked by clk_out.

The safe design uses an interlocked two-branch structure:

  • Each branch has a flip-flop clocked on the falling edge of its own clock to gate that branch on or off.
  • Branch 0 FF: D = !sel AND !en1_q, clocked on negedge clk0. Branch 1 FF: D = sel AND !en0_q, clocked on negedge clk1.
  • Each branch gates its own clock: clk0_g = clk0 AND en0_q, clk1_g = clk1 AND en1_q.
  • Output: clk_out = clk0_g OR clk1_g.

Why falling-edge clocking? Gating on the falling edge ensures that the enable change is captured while the clock is LOW, so the gated clock output is either a complete HIGH pulse or nothing — never a partial pulse.

Why the cross-interlocking? The !en1_q / !en0_q terms ensure only one branch is ever active at a time. The transition from clk0 to clk1 requires clk0's branch to deassert fully before clk1's branch asserts — preventing both from being active simultaneously.

Caveat: The switch takes 1–2 cycles of the slower clock to complete (synchronizer latency). This is expected and acceptable. Never use this for cycle-accurate switching without understanding the latency.
06
Easy Timing / STA
What are setup time and hold time? What happens when each is violated?

Setup time (t_su) is the minimum time the data input must be stable before the active clock edge for the flip-flop to reliably capture it. Hold time (t_h) is the minimum time the data must remain stable after the clock edge.

Together they define a "forbidden window" around the clock edge where data must not change.

Setup violation (data arrives too late): The flip-flop samples data before it has settled to a valid logic level. The FF may capture the wrong value or enter a metastable state. This is a functional failure at the target frequency — the design either works slowly or not at all. Setup violations are frequency-dependent: slow down the clock enough and they disappear.

Hold violation (data changes too soon after the clock edge): The flip-flop's captured value is overwritten before it is fully stored. This causes the FF to capture a corrupted value — either the new data that hasn't fully arrived, or garbage. Hold violations are frequency-independent — they occur even at 1 Hz and are caused by short combinational paths (fast data propagation relative to clock skew). They are the more dangerous class because no amount of slowing down the clock fixes them.

Key insight: Setup violations → fix by making data arrive earlier or giving more time (slow clock). Hold violations → fix by making data arrive later (insert delays), completely unrelated to clock frequency.
07
Medium Timing / STA
How do you fix a setup timing violation? How about a hold violation? Are the fixes different?

Yes, the fixes are completely different and cannot be mixed up.

Fixing a setup violation (data arrives too late — reduce data path delay):

  • Replace high-drive-strength cells with faster variants (higher Vt cells are slower; swap to lower Vt)
  • Reduce logic depth — restructure combinational logic to fewer gate stages
  • Use retiming — move registers across combinational logic to balance stages
  • Add pipeline registers to split a long path into two shorter ones
  • Optimize clock skew — use positive skew on the capture FF (delay the capture clock) to give the data path more time
  • Last resort: reduce the clock frequency

Fixing a hold violation (data arrives too early — increase minimum data path delay):

  • Insert delay buffers (DEL cells from the standard cell library) on the short data path
  • Use higher-drive-strength cells — paradoxically, some cells have more internal delay than smaller ones
  • Add logic stages that cancel each other (insert an even number of inverters)
  • Adjust clock skew: negative skew on the capture FF (advance the capture clock) reduces the hold window
Critical rule: When fixing hold violations by adding buffers, always re-check setup slack — the added delay consumes setup margin. Both must pass simultaneously.
08
Medium Timing / STA
What is the difference between clock skew and clock jitter? How does each affect timing?

Clock skew is a static, deterministic difference in clock arrival time between two flip-flops on the same chip. It is fixed for a given netlist and process corner. Skew arises from different buffer depths or wire lengths in the clock tree.

Clock jitter is a dynamic, cycle-to-cycle variation in the clock edge position. It is random (caused by power supply noise, substrate coupling, PLL VCO noise) and varies every cycle. You cannot predict the sign or magnitude of jitter in any given cycle.

Effect on timing:

  • Skew and setup: Positive skew (capture FF clock arrives later) helps setup — the data has more time to travel. Negative skew (capture clock earlier) hurts setup.
  • Skew and hold: Positive skew hurts hold — data launched early from the launch FF might arrive at the capture FF before its (delayed) clock edge. Negative skew helps hold.
  • Jitter and both: Jitter degrades both setup and hold because you can't know which direction the edge will shift. STA tools add a "clock uncertainty" (a worst-case jitter margin) that reduces both setup and hold slack. It cannot be recovered through skew optimization.
Rule of thumb: Skew is a tool — you can use it (via clock tree tuning) to deliberately help tight paths. Jitter is noise — you can only characterize it and budget for it.
09
Hard Timing / STA
Walk through a complete flip-flop to flip-flop timing path analysis. What is slack?

For a path from a launch flip-flop (FF1) to a capture flip-flop (FF2):

Setup check — data must arrive before the capture edge:

  • Data arrival time = T_clk_launch + T_cq(FF1) + T_comb
  • Data required time = T_clk_capture − T_setup(FF2)
  • Setup slack = Required − Arrival = (T_clk_capture − T_su) − (T_clk_launch + T_cq + T_comb)

Hold check — data must not arrive too early:

  • Data must arrive after: T_clk_capture + T_hold(FF2)
  • Hold slack = Arrival − Hold_required = (T_clk_launch + T_cq_min + T_comb_min) − (T_clk_capture + T_hold)

Where:

  • T_cq = clock-to-Q propagation delay of the launch FF
  • T_comb = total combinational path delay (sum of gate + wire delays)
  • T_setup / T_hold = FF timing constraints from the cell library
  • T_clk_capture − T_clk_launch = clock skew (positive = capture is later)

Slack = margin above the requirement. Positive slack → timing met. Negative slack → timing violated. The most negative slack in the design = worst negative slack (WNS); summing all negative slacks = total negative slack (TNS).

STA tools use MMMC: Multi-mode, multi-corner analysis — checking setup at slow-corner (slow cells, high temp, low Vdd) and hold at fast-corner (fast cells, low temp, high Vdd) simultaneously, since those are the worst cases for each check.
10
Hard Timing / STA
After place-and-route, a flip-flop path has both a setup AND a hold violation. How do you approach this?

A simultaneous setup and hold violation on the same path means the combinational delay window is too narrow — the path is fast enough to threaten hold, but not fast enough to meet setup. This typically means the logic between two FFs is very shallow (perhaps only a wire or one gate), but there is also a clock tree imbalance creating large skew.

Diagnosis first:

  • Check the clock skew between launch and capture FFs. Excessive positive skew can simultaneously worsen hold (by making the capture clock late) while hurting setup if the data path is borderline.
  • Look at the path's actual combinational depth — very few gates means it's a structurally short path.

Fix strategy:

  • Rebalance the clock tree first: Reduce skew between these two FFs. This is the most targeted fix — less skew directly improves both simultaneously.
  • Insert delay cells: Add buffers on the data path to increase minimum delay (fix hold). Then verify setup is still met — if setup is tight, you may need to also optimize the logic depth.
  • Restructure logic: If setup is violated because the path is in a long combinational chain overall, pipelining it (adding an intermediate register) can help. But this changes the design architecture.
At advanced nodes (7nm, 5nm): This scenario is common because cells are extremely fast and clock skew control is harder with dense routing. PDKs provide dedicated delay cells (e.g., DLY4, BUF_DEL) tuned specifically for hold fixing without wasting area.
11
Medium CDC
What is metastability? Can it be completely eliminated?

Metastability occurs when a flip-flop's setup or hold time is violated — the flip-flop enters a metastable state where its output is neither a valid logic 0 nor a valid logic 1. The internal node of the FF is stuck near the switching threshold (V_DD/2) and takes an unpredictable time to resolve to a valid level.

The physics: A flip-flop is a bistable element with two stable equilibria (0 and 1) and one unstable equilibrium (the metastable point). When forced into the unstable point, it resolves exponentially fast — but how long it takes is governed by thermal noise and is therefore random.

Can it be eliminated? No — not completely. Any time asynchronous data crosses a clock boundary, there is a non-zero probability of violating the setup/hold window. The probability of remaining metastable beyond a time T_r decreases exponentially with T_r, but never reaches exactly zero.

What we do instead: We manage the probability using synchronizers. The key metric is MTBF (Mean Time Between Failures). A 2-flop synchronizer gives the metastable FF one full clock period to resolve — in modern CMOS (τ ≈ 30ps), at 1 GHz this gives MTBF of thousands of years, making failure astronomically unlikely.

Critical: A metastable output propagating into combinational logic is dangerous — the intermediate voltage can cause multiple gates to output contradictory values, causing unpredictable circuit-wide failures. Always ensure metastability resolves before the signal fans out.
12
Medium CDC
What is a 2-flop synchronizer and why exactly 2 flops? Why not 1 or 3?

A 2-flop synchronizer consists of two back-to-back flip-flops, both clocked by the destination domain clock, inserted on a signal crossing from another clock domain.

How it works: The first FF may go metastable when it samples the asynchronous input. It has one full clock period (minus the FF's own propagation delay and the second FF's setup time) to resolve. Because metastability resolution time is exponential, the probability that it remains metastable long enough to corrupt the second FF is extremely small.

Why not 1 flop? One flop doesn't give enough resolution time. The metastable FF must resolve within roughly T_clk − T_cq − T_setup2 — for a 1 GHz clock, that's ~500ps. The probability of remaining metastable that long is non-trivial in some technologies.

Why not 3 flops? Three flops are rarely necessary. With a 1 GHz destination clock and τ ≈ 30ps (modern 7nm), a 2-flop synchronizer gives:

  • Resolution time T_r ≈ 500ps, τ ≈ 30ps
  • MTBF ≈ e^(T_r/τ) / (f_c × f_d) ≈ e^16.7 / 10^18 → millions of years

Three flops extend MTBF to an astronomically larger number that adds no practical benefit. Use 3 flops only in safety-critical applications (automotive ASIL-D, aerospace) where even million-year MTBF is required to be proven insufficient with 2.

13
Medium CDC
Why are Gray code counters used for asynchronous FIFO read/write pointers?

In an async FIFO, the read pointer lives in the read clock domain and the write pointer lives in the write clock domain. Each pointer must be compared against the other's synchronized version to determine full or empty.

The problem with binary counters: When a binary counter increments, multiple bits change simultaneously. For example, 0111 → 1000 changes all 4 bits. If you sample a binary counter while it's transitioning, you might read any of the 16 possible values — a catastrophic error that could falsely declare the FIFO full or empty, corrupting data.

Why Gray code solves this: A Gray code changes exactly one bit per count. When the pointer transitions from count N to N+1, only one bit flips. If the synchronized copy is sampled mid-transition, the worst case is that it sees either count N or count N+1 — off by at most one.

The FIFO full/empty logic deliberately has one count of tolerance: full is declared when the pointers are exactly DEPTH apart, not DEPTH−1. This one-count margin absorbs the maximum one-count error introduced by Gray code sampling, making the detection robust.

Implementation note: The Gray code conversion is simple: gray = bin XOR (bin >> 1). Store the Gray counter as the pointer, convert back to binary (requires a loop of XORs) only if you need the absolute address for memory indexing.
14
Hard CDC
What is MTBF in the context of synchronizers and what parameters influence it?

MTBF (Mean Time Between Failures) quantifies how often a synchronizer is expected to allow a metastable signal to propagate into the destination domain. The standard formula is:

MTBF = e^(T_r / τ) / (f_c × f_d × t_w)

Where:

  • T_r — resolution time available for metastability to resolve (≈ T_clk − T_cq_ff1 − T_su_ff2). The more time, the exponentially higher the MTBF.
  • τ — technology metastability time constant. A smaller τ means the FF resolves faster, improving MTBF. Scales with process: ~100ps at 180nm, ~30ps at 7nm.
  • f_c — destination clock frequency. Higher frequency = more sampling opportunities per second = more chances for metastability to cause failure.
  • f_d — data toggle rate. How often does the incoming signal change near the clock edge?
  • t_w — the setup+hold window width. Narrower window = smaller probability of entering metastability per clock cycle.

The exponential dependence on T_r/τ is why adding a second synchronizer flip-flop dramatically improves MTBF — it adds one full clock period to T_r.

Practical target: Design teams typically target MTBF > 10,000 years per synchronizer. At 1 GHz with modern 7nm silicon, a 2-FF synchronizer achieves this comfortably. If not, either add a third FF or investigate if the signal can be designed to toggle only when safely away from clock edges.
15
Easy Low Power
What is the difference between dynamic power and static (leakage) power? How do you reduce each?

Dynamic power = α × C_L × V_DD² × f — consumed when a node switches from 0 to 1 (charges the load capacitance) or 1 to 0 (discharges through the pull-down network). α is the activity factor (fraction of clock cycles the node switches). This power is zero when the circuit is idle.

Static (leakage) power = I_leakage × V_DD — consumed even when no gates are switching, due to sub-threshold current, gate oxide tunneling, and junction leakage. It does not depend on frequency and is present whenever power is supplied.

Historical trend: In nodes above 90nm, dynamic power dominated. Below 28nm and especially at 7nm/5nm, leakage has grown dramatically because transistors cannot be switched fully off at low supply voltages. Modern SoCs spend significant area on leakage management.

Reduction techniques:

  • Dynamic: Clock gating (reduce α), operand isolation (prevent toggling of datapath), voltage scaling (V² dependence), frequency scaling, low-swing signaling
  • Static: Power gating (cut V_DD to an entire domain via header/footer cells), multi-Vt design (use High-Vt cells in non-critical paths — slower but much lower leakage), reverse body biasing, state retention during power down
16
Medium Low Power
How is clock gating implemented correctly? Why can't you just AND the clock with an enable signal?

Clock gating removes the clock signal from a register bank when its value won't change, eliminating the clock-to-Q dynamic power and the switching power of all downstream logic. The clock network can account for 30–40% of total chip dynamic power, making clock gating one of the highest-impact power techniques.

Why you cannot simply write assign gated_clk = clk AND enable:

If enable changes while clk is HIGH, the AND gate output glitches — it produces a truncated clock pulse shorter than a full cycle. This truncated pulse can violate the setup/hold requirements of any flip-flop it clocks, corrupting stored data or causing metastability.

The correct implementation — Integrated Clock Gating (ICG) cell:

  • A latch samples the enable signal on the LOW phase of the clock (when clock = 0)
  • The latched enable is then ANDed with the clock
  • Because the latch captures enable only when the clock is LOW, by the time the clock rises, the latch output is stable — the AND gate sees a stable enable and a clean rising edge → full clock pulse or no pulse, never a partial one

In RTL, you write: if (enable) register <= data; and the synthesis tool infers an ICG cell. Never write clock gating manually at the gate level in RTL — let the tool use the optimized library ICG cell.

Interview follow-up: "What is operand isolation?" — It prevents data inputs to a gated block from toggling even when the block is clock-gated, saving the switching power of the combinational logic feeding the registers.
17
Easy Verification / DFT
What is a scan chain and why is it used in Design for Test (DFT)?

A scan chain connects the flip-flops in a design into a long shift register that can be controlled and observed from the chip's I/O pins, purely for testing purposes.

How it works: Each flip-flop in the design is replaced with a scan flip-flop — identical to a normal FF but with an extra 2:1 mux at the data input:

  • In functional mode (scan_enable = 0): the mux passes normal D input — the design operates as designed.
  • In scan mode (scan_enable = 1): the mux passes the previous FF's output — all FFs form a shift register. You can shift in a test pattern, capture one functional clock cycle, and shift out the results for comparison.

Why it's essential: Without scan, testing whether a stuck-at fault (wire permanently stuck at 0 or 1) exists deep in the chip requires applying just the right sequence of primary input patterns — combinatorially explosive. With scan, an ATPG (Automatic Test Pattern Generation) tool can directly control any FF's state and observe any FF's captured output, enabling near-100% stuck-at fault coverage with a manageable number of test vectors.

Test flow at production: After packaging, every chip is tested on an ATE (Automated Test Equipment). The scan chain shifts in millions of vectors and compares the shifted-out responses against the fault-free model. Any mismatch → chip fails and is discarded.

DFT also covers: BIST (Built-In Self Test) for memories, boundary scan (JTAG IEEE 1149.1) for board-level interconnect testing, and compression (EDT, WBC) to reduce the number of test vectors while maintaining coverage.
18
Easy Protocols
Explain the AXI4 VALID/READY handshake mechanism. What rule must never be broken?

Every AXI4 channel (AW, W, B, AR, R) uses a two-signal handshake: VALID (driven by the sender) and READY (driven by the receiver). A transfer occurs on the rising clock edge when both VALID and READY are simultaneously HIGH.

Rules:

  • The sender asserts VALID when it has valid data/address to send and must not deassert VALID until the transfer completes (both signals HIGH on a clock edge).
  • The receiver asserts READY when it can accept data. READY may be HIGH before VALID (pre-ready) — this is fine.
  • If VALID is asserted and READY is LOW, both sides wait. Neither can "cancel" the transaction by deasserting VALID without completing the handshake.

The rule that must never be broken: VALID must not combinatorially depend on READY. If the master only asserts VALID after it sees READY, and the slave only asserts READY after it sees VALID, the result is a deadlock — neither ever fires first. READY is allowed to depend on VALID, but not vice versa.

AXI4 has 5 independent channels, enabling key features: a new write address (AW) can be accepted while write data (W) from a previous burst is still in-flight. Read and write transactions are completely independent, maximizing bus utilization.

19
Hard Protocols
How does AXI4 support out-of-order transaction completion? What is the role of transaction IDs?

AXI4 allows a master to issue multiple outstanding read or write transactions before receiving responses. Each transaction is tagged with a Transaction ID (ARID for reads, AWID for writes). The slave and interconnect are free to complete transactions in a different order from how they were issued — a fast SRAM access may return data before a slow DRAM access even if the DRAM request was issued first.

How the master reconciles responses: Read data returns on the R channel with RID matching the original ARID. Write responses return on the B channel with BID matching AWID. The master maintains an outstanding transaction table and uses the ID to match each response to the correct request.

Ordering rule per ID: Transactions with the same ID must complete in order. If a master issues two reads both with ARID=3, the interconnect must return them in order. Transactions with different IDs have no ordering guarantee relative to each other.

Interconnect ID widening: When multiple masters share an interconnect, the fabric appends a master-identifying prefix to each ID (e.g., 2-bit master select + original ARID = extended RID). On the response path, the prefix is used to route the response back to the correct master, which strips the prefix before comparing IDs.

AXI4 vs AXI3: AXI3 allowed interleaved write data (WID specified which burst a W beat belonged to). AXI4 removed WID — write data must always be in the same order as write addresses. This simplification makes interconnects significantly cheaper to implement.
20
Easy Architecture
What is pipelining? What does it improve and what are its trade-offs?

Pipelining divides a long combinational operation into N sequential stages, each separated by flip-flops. Instead of one result every T_total clock period (dictated by the slowest path), you get one result per T_total/N clock period — throughput increases N× once the pipeline is full.

Example: A 5-stage 32-bit multiplier at 500 MHz produces one product every 2 ns. Without pipelining, the same logic would run at 100 MHz (5× slower combinational chain). With pipelining, a new multiply starts every cycle — though each individual result still takes 5 cycles of latency.

What pipelining improves: Throughput (results per unit time) — directly, by allowing clock frequency to be multiplied by the number of stages.

Trade-offs:

  • Latency: Each result takes N cycles to complete instead of 1. This is often acceptable for bulk data, but hurts interactive or latency-sensitive operations.
  • Area: N−1 extra register stages add flip-flop area and routing overhead.
  • Power: More registers switching every cycle; however the lower V_DD enabled by higher-frequency operation may offset this.
  • Hazards: Data hazards (RAW — read after write: a stage needs a result not yet produced by a later stage), control hazards (branches), and structural hazards (resource conflicts) require stalls, forwarding, or branch prediction logic — all of which reduce ideal throughput.
  • Balancing: If one stage is slower than others, it bottlenecks the pipeline. All stages must be balanced to the same worst-case delay for the frequency gain to be fully realized.
01
Easy RTL Design
What is the difference between blocking (=) and non-blocking (<=) assignments in Verilog? When should each be used?

Blocking assignment (=) executes sequentially within an always block — each statement completes before the next begins, exactly like a software assignment. The left-hand side updates immediately.

Non-blocking assignment (<=) evaluates all right-hand sides first (using values from the current time step), then schedules all left-hand side updates to happen simultaneously at the end of the time step. This models the parallel behavior of flip-flops sampling their D inputs on a clock edge.

The golden rules:

  • Use = (blocking) for combinational logic in always @(*) blocks. The sequential evaluation correctly implements the logic function.
  • Use <= (non-blocking) for sequential logic in always @(posedge clk) blocks. The simultaneous update models how FFs all sample their D input on the same clock edge.
  • Never mix both types in the same always block.

Classic bug with the wrong choice: A shift register written with blocking assignments (a = in; b = a; c = b;) immediately propagates the input through all stages in a single clock cycle. With non-blocking (a <= in; b <= a; c <= b;), all three FFs sample their current input simultaneously — correct shift register behavior.

Synthesis impact: Mixing = and <= in a clocked block can produce simulation–synthesis mismatches — the simulator and the synthesized hardware behave differently. This is one of the most common Verilog bugs in job interviews and real designs.
02
Easy RTL Design
What are the differences between wire, reg, and logic in SystemVerilog? Why was logic introduced?

wire (Verilog): a net type representing a physical connection. It can only be driven by continuous assignments (assign) or module output ports. Multiple drivers resolve via wired-AND/OR logic depending on the net type. It cannot hold state.

reg (Verilog): a variable that can be driven inside procedural blocks (always, initial). Despite its name, it does NOT necessarily synthesize to a register — a reg inside always @(*) synthesizes to combinational logic. The name is misleading and a common source of confusion.

logic (SystemVerilog): a unified 4-state variable type that replaces both wire and reg for most use cases. It can be driven by both continuous assignments and procedural blocks. The key restriction: logic allows only one driver — the compiler flags multi-driver errors that wire silently allows. This catches accidental bus conflicts at compile time.

Why logic was introduced:

  • Eliminates the confusing reg misnomer — logic communicates data type, not inferred hardware.
  • Provides compile-time multiple-driver checking that wire lacks.
  • Works in both continuous and procedural contexts, reducing declarations.
Modern SV practice: Use logic for almost everything. Use wire only when you explicitly need multiple drivers (e.g., tri-state buses, wired-AND). Avoid reg entirely in new SystemVerilog code.
03
Medium RTL Design
How do you detect full and empty conditions in a synchronous FIFO? What is the "extra bit" trick?

A synchronous FIFO uses a write pointer (wrptr) and a read pointer (rdptr) to track the head and tail. Both pointers start at 0. When the FIFO is empty, both point to the same location — and when it is completely full, both also point to the same location after wrapping around. This ambiguity is the core challenge of FIFO pointer design.

The naive approach fails: If both pointers are N-bit binary counters with range 0 to DEPTH-1, you cannot distinguish full from empty because both conditions result in wrptr == rdptr.

The extra bit trick: Use N+1 bit pointers, where N = log₂(DEPTH). The lower N bits are the actual memory address; the MSB (the "extra bit") acts as an overflow wrap indicator.

  • Empty: wrptr == rdptr (all N+1 bits equal — same wrap count, same address)
  • Full: wrptr[N-1:0] == rdptr[N-1:0] AND wrptr[N] != rdptr[N] (same address, but one extra wrap ahead)

The MSBs differ when the write pointer has wrapped one more time than the read pointer — meaning the FIFO is exactly DEPTH entries deep.

For async FIFOs: The same trick applies, but the pointers are Gray-coded before being synchronized across clock domains. Gray code changes only one bit per count, making the synchronized pointer off by at most one count — safe for full/empty logic.
04
Medium Timing / STA
What is the difference between a false path and a multi-cycle path in SDC constraints? How do you set each?

False path: a timing path that exists in the netlist but will never carry valid data in the real operating system. STA should completely ignore it — no setup or hold analysis. Examples:

  • Paths between two completely unrelated, never-simultaneously-active clock domains
  • Paths from a test-mode-only mux output that is static during functional operation
  • Reset synchronizer paths where the reset is never timing-critical
  • Paths between scan mode logic not active during functional timing

SDC: set_false_path -from [get_cells launch_ff] -to [get_cells capture_ff]

Multi-cycle path (MCP): a path where data is intentionally designed to take N clock cycles to settle. The designer tells STA to use N×T_clk as the available time for the setup check instead of 1×T_clk.

SDC for a 2-cycle setup path: set_multicycle_path 2 -setup -from ... -to ...

Critical rule for MCP: When you relax setup by N cycles, you MUST also adjust the hold check. By default, STA places the hold check one cycle before the setup capture edge — correct for 1-cycle paths. For a 2-cycle setup, the hold check must also move back one cycle:

SDC: set_multicycle_path 1 -hold -from ... -to ...

Common mistake: Setting set_multicycle_path 2 -setup without the matching -hold exception creates an overly pessimistic hold check one cycle before the new setup capture — often an impossible hold requirement that forces unnecessary delay insertion.
05
Hard Timing / STA
What is OCV (On-Chip Variation)? What is the difference between flat OCV, AOCV, and POCV?

OCV (On-Chip Variation) acknowledges that cells at different locations on the same die do not experience identical conditions. Spatial gradients in temperature, VDD (due to IR drop), and manufacturing process (oxide thickness, doping) cause cells in different parts of the chip to have slightly different delays — even if they are the same cell type running at the same nominal conditions.

This matters for STA because the clock path and data path typically run through physically different areas of the chip. If both paths were derated the same way, the error would cancel. But since one may be faster and the other slower, we must be pessimistic.

Flat OCV (Flat Derating): Applies a single multiplicative derating factor to all cells. The launch data path is made slower (multiply delays by e.g. 1.05) and the capture clock path is made faster (multiply by 0.95) for setup — worst-case pessimism everywhere. Simple but overly conservative.

AOCV (Advanced OCV): Uses a lookup table indexed by path depth (number of logic stages) and distance. Longer paths with more stages average out variation — a 30-stage path has less cell-to-cell variation than a 2-stage path. AOCV assigns less derating to deep paths, reducing pessimism and improving timing convergence without sacrificing accuracy.

POCV (Parametric OCV / LVF): Uses full statistical distributions (mean and sigma) for each cell's delay, propagating uncertainties through the path using statistical addition. This is the most accurate method and is becoming the industry standard at 7nm and below, where AOCV is no longer pessimistic enough.

Qualcomm context: Snapdragon SoCs run at high frequency with tight timing margins. Moving from flat OCV to AOCV/POCV can recover 50–150 ps of setup slack that flat OCV was needlessly consuming — directly enabling higher clock frequencies or lower voltage operation.
06
Hard Timing / STA
What is CRPR (Clock Reconvergence Pessimism Removal)? Why does it matter?

When STA analyzes a flip-flop-to-flip-flop path, the launch clock path (from clock source to FF1) and the capture clock path (from clock source to FF2) often share common clock buffers near the root of the clock tree before they diverge.

With OCV derating, the tool pessimistically applies opposite deratings to the launch and capture paths: the launch clock is made slower (derated up) and the capture clock is made faster (derated down) for setup analysis. But the shared portion of the two paths cannot simultaneously be both slow and fast — it is the same physical cell running at the same moment in time.

CRPR removes this double-counting. For the portion of clock tree that is common to both launch and capture paths, the STA tool calculates how much pessimism was added by applying opposite deratings to the same cells, and adds that amount back as credit. The formula:

CRPR credit = max_delay(common) − min_delay(common)

This credit is added back to the setup slack. Typical CRPR values range from 10 ps to 100 ps depending on how much of the clock tree is shared and how aggressive the OCV derating is.

CRPR is sometimes called CPPR (Common Path Pessimism Removal) — both terms mean the same thing. Modern STA tools (PrimeTime, Tempus) apply it automatically.

Why it matters: Without CRPR, many paths that physically meet timing are flagged as violations. This causes unnecessary engineering effort to "fix" timing that is already correct. Enabling CRPR can reduce the number of failing paths by 20–40% without any design changes.
07
Hard CDC
How do you safely transfer a multi-bit data bus across clock domains? Why can't you just synchronize each bit independently?

Why per-bit synchronization fails: Each bit of the bus passes through its own 2-FF synchronizer independently. Each synchronizer may sample from a different source clock cycle — bit 3 might capture the value from cycle N while bit 0 captures the value from cycle N+1. The destination domain then reads a "torn" word that never existed in the source domain. For a 32-bit bus, this can produce completely wrong data.

Safe techniques for multi-bit CDC:

  • Gray code (for counters/pointers): If the bus is a counter that increments by one at a time, encode it in Gray code before the crossing. Only one bit changes per count, so a sampled-in-transition value is at most off by one — which FIFO logic tolerates.
  • Handshake (req/ack): Source asserts a request (req) after data has been stable for at least one source cycle. Destination synchronizes req (2-FF), samples the data only after req is asserted, then asserts ack. Source deasserts req after seeing synchronized ack. Both req and ack use separate 2-FF synchronizers. Low throughput (takes ~4–6 destination clock cycles per transfer) but works for any arbitrary data.
  • Asynchronous FIFO: For streaming data, use an async FIFO with Gray-coded pointers. The FIFO internally handles all multi-bit CDC safely.
  • Qualified sampling: Source keeps data stable for at least 3 destination clock cycles, then asserts a single "data valid" signal. Destination synchronizes the valid signal and samples the data on the synchronized valid. Risky — relies on the source holding data long enough.
Qualcomm modem chips have dozens of clock domains (CPU, DSP, RF, power management, audio) all communicating across boundaries. Multi-bit CDC handling is one of the most common bug sources in modem SoC development.
08
Medium CDC
What is a pulse synchronizer? When would you use it instead of a 2-FF level synchronizer?

A 2-FF level synchronizer is used when the source signal is a steady level that persists for many source clock cycles. The destination captures it safely after 2 destination clocks.

A pulse synchronizer is needed when the source generates a single-cycle pulse — a signal that is HIGH for exactly one source clock cycle. A 2-FF synchronizer cannot reliably capture this: if the destination clock is slower or at an unfortunate phase, the pulse may be missed entirely.

How a toggle-based pulse synchronizer works:

  • Source domain: A toggle flip-flop converts each incoming pulse into a level change. Every time a pulse arrives, the FF inverts its output. The toggle signal therefore holds its value until the next pulse — making it a persistent level that won't be missed.
  • Clock crossing: The toggle signal crosses the domain via a standard 2-FF synchronizer.
  • Destination domain: An XOR of the synchronized output and its one-cycle-delayed copy detects each edge → generates a clean single-cycle pulse in the destination domain.

Constraint: Source pulses must be spaced at least 3 destination clock cycles apart so the previous toggle has fully propagated through the synchronizer before the next pulse arrives. If pulses can arrive faster, use an async FIFO instead.

Use cases: Interrupt signals, one-shot event notifications, handshake request pulses — any scenario where the source generates a distinct, infrequent event and the destination must detect it exactly once.
09
Medium Low Power
What is UPF (Unified Power Format)? What does it define and why is it needed?

UPF (IEEE 1801) is a standard format for capturing the power intent of a chip design in a separate file that accompanies the RTL. As SoCs moved to multiple power domains, it became impossible to express power management purely in RTL — the RTL describes logical functionality, not which block gets what voltage or when a domain shuts off.

What UPF defines:

  • Supply networks: Which voltage rails exist (VDD_CPU, VDD_MODEM, VDD_AON), their nominal voltages, and how they connect to design blocks.
  • Power domains: Which RTL modules belong to which supply rail. Each domain has a defined primary power supply.
  • Power states: Which domains are ON or OFF in each operating mode (e.g., "sleep mode: modem ON, CPU OFF, AON ON").
  • Isolation cells: Specifies where isolation cells must be inserted at the boundary of power-gatable domains, and what value they should clamp to when the domain is off.
  • Retention registers: Which flip-flops need SRPG (State Retention Power Gating) cells to preserve state across a power-off event.
  • Level shifters: Where voltage-level-shifting cells are needed between domains running at different voltages.
  • Power switches: Header (PMOS) or footer (NMOS) transistors that gate the power supply to a domain.
Qualcomm uses UPF extensively in Snapdragon SoCs. A typical Snapdragon has 10+ power domains (CPU clusters, GPU, DSP, modem, camera, display, always-on). Without UPF, manually inserting and verifying isolation cells, retention registers, and level shifters across hundreds of domain boundaries would be error-prone and unmanageable.
10
Medium Low Power
What are isolation cells and retention registers (SRPG)? When are they required?

Isolation cells are required at the output boundary of any power-gated domain. When a domain's power supply is cut, its flip-flops lose their state and outputs become undefined (float to a random value or X). If an always-on domain receives these floating signals, it may malfunction — latching garbage data, causing spurious state transitions, or drawing excessive short-circuit current.

An isolation cell is inserted on each output net of the power-gated block. It is connected to an always-on supply. When the domain is OFF, the isolation cell clamps the output to a safe known value (typically 0 for AND-based isolation, or 1 for OR-based) as specified in UPF. When the domain is ON, the isolation cell passes the signal through transparently.

Retention registers (SRPG — State Retention Power Gating) are special flip-flop variants with a small "shadow latch" connected to a separate always-on power rail (typically a low-leakage supply). The shadow latch holds only a few transistors, consuming a fraction of the normal FF's leakage.

Operation:

  • Before power-off: The power management controller sends a SAVE signal → each SRPG cell captures its current state into its shadow latch.
  • Domain is off: Main supply cut, shadow latch retains state at very low power.
  • After power-on: A RESTORE signal pushes the shadow state back into the main FF.

Without retention, the block must re-initialize from scratch after every power-up, adding latency and requiring software re-programming of registers.

11
Easy Low Power
What is a voltage island in SoC design? What cells are required at the boundaries?

A voltage island is a physically distinct region of the chip that operates at a different supply voltage from surrounding blocks. By running low-activity blocks at a lower V_DD, dynamic power scales as V², giving dramatic savings — dropping from 1.0V to 0.8V reduces dynamic power by 36%.

Why Qualcomm uses voltage islands: A Snapdragon SoC has very different performance and power requirements across blocks. The modem baseband runs continuously but at moderate frequency. The application CPU cores spike to high performance on demand. The always-on sensor hub must run at <0.7V for weeks on battery. A single supply voltage optimized for the fastest block wastes enormous power in slower blocks.

Required boundary cells:

  • Level shifters (LS): Signals crossing between domains at different voltages must be shifted to the receiving domain's logic levels. A signal from a 0.8V domain HIGH (0.8V) is not guaranteed to be a valid HIGH in a 1.1V domain without level shifting. Level shifters are inserted on every signal crossing.
  • Isolation cells: If the lower-voltage island can be powered off completely, isolation cells (see previous question) are needed to clamp its outputs.
  • Level-shifting isolation cells: Combined cells that both shift voltage and isolate — used at boundaries between always-on and power-gatable domains at different voltages.
Design overhead: Level shifters add area, delay (~10–50 ps), and power. Proper floorplanning ensures domain boundaries are short, minimizing the number of crossing signals and therefore level shifters needed.
12
Easy Physical Design
What is Clock Tree Synthesis (CTS)? What does the tool try to achieve and what comes after it?

Clock Tree Synthesis (CTS) is the physical design step that builds the clock distribution network — a buffered tree that delivers the clock signal from the clock source (PLL output or pad) to every flip-flop's clock pin across the entire chip.

Goals of CTS:

  • Minimize clock skew: Every FF should see the clock edge at (nearly) the same time. Unbalanced trees create skew that consumes setup and hold timing margins.
  • Meet insertion delay target: Total latency from clock source to FF clock pins must be within the budgeted range (typically set in SDC via set_clock_latency).
  • Minimize clock power: The clock network toggles every cycle and can consume 30–40% of total chip dynamic power. The tool balances skew reduction against cell count and wire length.
  • Respect no-touch (NDR) routing rules: Clock nets typically use special Non-Default Routing Rules (wider wires, more spacing, preferred upper metal layers) for reduced resistance and better EM reliability.

Flow position: CTS runs after placement (cell locations are fixed) but before detailed routing. After CTS, timing analysis uses real clock arrival times instead of ideal clock assumptions — hold violations often emerge here because real clock trees have skew that didn't exist in pre-CTS analysis.

Post-CTS hold fixing: After CTS, the timer switches from ideal clock to propagated clock. Paths that were hold-clean with ideal clocks often fail with real clock skew. Hold fixing (inserting delay buffers) is a major post-CTS activity before proceeding to routing.
13
Medium Physical Design
What is IR drop in a VLSI design? How does it affect timing and how do you fix it?

IR drop is the voltage reduction along the power delivery network from the supply pins to the power pins of individual cells. The metal power grid has resistance (R), and the switching current (I) causes a voltage drop V = I × R. A cell operating at V_nominal − ΔV is slower than a cell at the full supply voltage.

Two types:

  • Static IR drop: Average current × grid resistance. Determined by the long-term average switching activity. Used for power integrity sign-off of DC operating point.
  • Dynamic (transient) IR drop: When a large number of cells switch simultaneously (e.g., a wide datapath all clocking at once), the instantaneous current surge exceeds the average. The power grid voltage transiently collapses by a larger amount, limited by the inductance and decoupling capacitance. This "voltage droop" is worse than static IR and is the primary concern at high frequencies.

Effect on timing: In a high-IR-drop region, cells are slower than characterized at nominal voltage. A path that passes STA at nominal conditions may violate setup timing in silicon due to IR-induced delay increase. Hold violations are less common (slower cells improve hold margin).

Fixes:

  • Widen power stripes or add more power mesh layers
  • Add decoupling capacitors (decaps) near high-switching density regions
  • Spread high-activity cells during placement to avoid current hot spots
  • Use power gating with controlled wake-up sequences to avoid simultaneous switching
  • In STA: apply voltage derating in high-IR-drop regions for more accurate sign-off
14
Medium Physical Design
What is the antenna effect in VLSI fabrication? How is it detected and fixed?

During VLSI fabrication, metal layers are deposited and patterned one at a time using plasma etching. Plasma charges accumulate on exposed metal wires during etching. If a long metal wire is already connected to a transistor gate oxide but NOT yet connected to a diffusion region (which would discharge the charge safely), the accumulated charges can create a large voltage across the thin gate oxide — sufficient to cause permanent gate oxide damage: threshold voltage shifts, increased leakage, or immediate breakdown.

The antenna ratio = (metal area of the wire connected to the gate) / (gate oxide area). Process Design Kits (PDKs) specify maximum allowable antenna ratios (typically 400–1000 for metal, 200–600 for vias). Exceeding this ratio means the wire can accumulate enough charge to damage the oxide.

How it's detected: The router's DRC (Design Rule Check) engine computes the cumulative antenna ratio for every net using the partial routing built up layer by layer. If it exceeds the limit, an antenna violation is flagged.

Fixes:

  • Metal jumper (layer hopping): Break the long wire by jumping to a higher metal layer and back. This "resets" the antenna accumulation because higher-layer routing is done later, after diffusion connections have been made. Most common fix.
  • Antenna diode: Insert a reverse-biased diode near the gate, connected to the same metal wire. During plasma etching, the diode provides a discharge path to substrate, preventing charge buildup. Small area cost, always effective.
  • Reduce net length: Re-route the net to use shorter wires on lower layers.
15
Easy Verification / DFT
What is the difference between functional coverage and code coverage? Which is more important?

Code coverage measures how much of the RTL source code was exercised by the simulation:

  • Line/statement coverage: Were all lines of RTL executed?
  • Branch coverage: Were both sides of every if/else and every case arm taken?
  • Toggle coverage: Did every signal toggle both 0→1 and 1→0?
  • FSM coverage: Were all states visited and all transitions taken?

Code coverage is automatically collected by the simulator with no extra specification — easy to get, but tells you nothing about what scenarios were verified. You can hit 100% branch coverage while never testing the most critical protocol corner case.

Functional coverage is user-defined. The verification engineer specifies which scenarios, protocol states, and parameter combinations are important to verify — then measures whether simulations actually exercised them:

  • Was an AXI4 burst of ARLEN=255 (256 beats) issued?
  • Did a FIFO simultaneously receive a write and a read when exactly one slot was free?
  • Did a CDC crossing happen with data changing every source cycle?

Which matters more? Both are necessary; neither alone is sufficient. Code coverage ensures no dead code was accidentally left un-exercised. Functional coverage ensures the right scenarios were tested. A mature sign-off process requires both to be above target (typically 95%+ code coverage, 100% defined functional coverpoints).

16
Medium Verification / DFT
Describe the key components of a UVM (Universal Verification Methodology) testbench. How does it differ from a traditional directed testbench?

UVM (IEEE 1800.2) is a standardized SystemVerilog methodology for building reusable, scalable verification environments using an object-oriented framework. It replaces brittle, one-off directed testbenches.

Key UVM components:

  • uvm_test: Top-level test class. Selects which scenario/sequence to run and configures the environment. Different tests reuse the same TB infrastructure.
  • uvm_env: Container that instantiates and connects agents, scoreboards, and coverage collectors for one DUT.
  • uvm_agent: Models one protocol interface (e.g., AXI4 master). Contains: Driver (applies stimulus to DUT pins), Monitor (observes DUT pins and creates transaction objects), Sequencer (arbitrates between sequences and feeds items to the driver).
  • uvm_sequence / uvm_sequence_item: Defines the actual stimulus transactions. Sequences can be layered (a higher-level sequence calls lower-level sequences) and constrained-random.
  • uvm_scoreboard: Compares DUT output (from monitor) against a reference model's expected output. Reports pass/fail.
  • TLM ports (uvm_analysis_port): Standardized communication channels between components — no direct references between classes.

Vs. directed testbench: A directed testbench hand-codes every stimulus vector — it only tests what the engineer explicitly wrote. A UVM testbench with constrained-random stimulus explores the full stimulus space automatically within user-specified constraints, finding corner cases no human would write by hand.

Coverage-driven verification: UVM enables "close-the-loop" verification: run simulations → check functional coverage → add constraints to target uncovered scenarios → repeat until all coverpoints hit. This replaces guesswork with a systematic, measurable sign-off process.
17
Medium Verification / DFT
What are the main fault models used in ATPG? What does each model test for?

ATPG (Automatic Test Pattern Generation) tools model physical manufacturing defects as logical faults and generate patterns to detect them. The main fault models are:

  • Stuck-At Fault (SAF): A wire is permanently stuck at logic 0 (SA0) or 1 (SA1), regardless of what drives it. Models open circuits, resistive shorts to VDD/GND, and broken connections. The most widely used model. A stuck-at fault is detected by finding a test that excites the fault (drives the opposite value) and propagates the effect to a primary output or scan chain output. Industry target: 95–99% fault coverage.
  • Transition Delay Fault (TDF): Tests whether a net can make a complete 0→1 or 1→0 transition within one clock cycle. Detects resistive defects that don't prevent correct logic levels but slow transitions — critical at high frequency where even a slightly slow net causes a setup violation. TDF requires two-pattern tests: launch the transition, then capture the response one cycle later.
  • Path Delay Fault (PDF): Tests the end-to-end propagation delay of a specific signal path. More accurate timing characterization than TDF — detects accumulated small delays across many gates. Requires many patterns but provides the most complete timing sign-off.
  • Bridging Fault: Models an unintended short between two adjacent nets. A short that combines two signals via wired-AND or wired-OR logic. Increasingly important at 7nm/5nm where metal pitch is very tight and coupling between adjacent wires is a common defect.
  • Cell-Aware Fault: Tests for defects inside standard cells at the transistor level (open/short in the cell's internal netlist). Catches defects that SAF, modeled at the cell's logical interface, would miss.
Qualcomm mobile chips use both SA and transition delay fault testing at production. High volume = even a 0.01% defect escape rate means thousands of field failures. Comprehensive fault coverage is non-negotiable.
18
Medium Protocols
What is MIPI CSI-2? How is it used in mobile SoCs and what are its key electrical characteristics?

CSI-2 (Camera Serial Interface 2) is a MIPI Alliance standard for connecting image sensors to application processors. It is the dominant camera interface in smartphones — virtually every mobile camera uses CSI-2.

Physical layer (D-PHY): CSI-2 uses MIPI D-PHY, a differential serial interface with two operating modes:

  • High-Speed (HS) mode: Low-swing differential signaling (100–300 mV differential) at 80 Mbps to 4.5 Gbps per lane. Used for pixel data transmission.
  • Low-Power (LP) mode: CMOS-level single-ended signaling. Used for control, synchronization, and lane management. Much lower speed.

Architecture: One clock lane + 1 to 4 data lanes. Each lane is a differential pair (DP/DN). For a quad-lane sensor at 4.5 Gbps/lane: total bandwidth = 4 × 4.5 = 18 Gbps — sufficient for 200 MP sensors at full frame rate.

Virtual channels: Up to 4 virtual channel IDs allow multiple cameras to share the same physical CSI-2 interface, multiplexed by the sensor or ISP.

C-PHY (newer alternative): Uses 3-wire "trios" with encoded 3-symbol signaling, achieving 5.7 Gsymbols/s per trio = ~2.28 bits per symbol → higher effective data rate without increasing frequency. Used in high-resolution cameras where D-PHY lane count limits bandwidth.

VLSI implementation: The CSI-2 receiver on a Snapdragon SoC consists of a D-PHY frontend (analog deserializer), a lane merger, a CSI-2 protocol decoder, and an interface to the Image Signal Processor (ISP). It must process pixels faster than they arrive to prevent FIFO overflow — typically 500 MHz+ operating frequency.

No questions match your search. Try a different keyword or select "All".