A latch is a level-sensitive storage element. When the enable (or clock) is HIGH, the output follows the input continuously — the latch is "transparent." When the enable goes LOW, the last value is held. A flip-flop is edge-triggered — it samples the input only at the exact rising (or falling) clock edge and ignores the input at all other times.
Why flip-flops dominate RTL design: Their predictable sampling window makes static timing analysis (STA) straightforward — setup and hold times are well-defined relative to one clock edge. Latches create "transparent windows" that make STA far more complex; timing tools must ensure that no combinational path through an open latch violates timing in any cycle.
When to use latches deliberately:
- Power savings: A latch consumes no clock dynamic power when transparent (no clock-to-output toggling).
- High-performance pipelines: In "latch-based" designs (common in custom datapath and CPUs), a latch pair (master + slave) forms a pseudo-FF but allows time-borrowing — a slow first half-cycle can steal time from a fast second half-cycle, improving throughput.
- Specialized cells: Sense amplifiers and certain memory cells use latch-based structures.
D = en ? data_in : Q. The synthesis tool maps this to a clock-gate cell, not a latch.Dividing by an odd number and achieving 50% duty cycle requires using both clock edges. A single-edge counter can only produce a 33%/67% duty cycle.
The technique: Create two signals derived from the same mod-3 counter — one toggled on the rising edge, one on the falling edge — then OR them together.
- Use a 2-bit counter clocked on the rising edge counting 0→1→2→0. Generate
out_r= HIGH when count == 0, LOW when count == 1. - Use the same counter logic clocked on the falling edge. Generate
out_fidentically. - Final output =
out_r OR out_f. Becauseout_randout_fare offset by half a source clock period, their OR produces exactly 1.5 periods HIGH and 1.5 periods LOW out of every 3 source periods — 50% duty cycle.
The FIFO must hold all data written during a burst before the reader catches up. The minimum depth is:
Depth ≥ (Write rate − Read rate) × Burst duration + Synchronizer latency guard
Breaking this down:
- Burst excess: If a writer sends at
fwwords/cycle forTcycles, and the reader drains atfrwords/cycle, the net accumulation is(fw − fr) × Twords. This is the minimum storage needed. - Synchronizer latency: The gray-code read pointer takes 2–3 destination-clock cycles to propagate through the synchronizer. During this window, the write side may falsely see the FIFO as full (or the read side sees it as empty). Add 2–3 words of margin per side.
- Round up to power of 2: Async FIFO address arithmetic requires a power-of-2 depth so that the gray code pointer MSB inversion trick for full/empty detection works correctly.
A glitch (or hazard) is a spurious, short-duration output pulse that occurs when multiple inputs change simultaneously and race through paths of unequal delay. Even though the steady-state output is correct, the transient produces an unwanted transition.
Classic example: A static-1 hazard in a 2-input AND gate where both inputs come from the same signal A through paths with different delays — one direct, one through an inverter. Mathematically A AND NOT(A) = 0, but if the direct path is faster, the gate briefly sees 1 AND 1 = 1 before the inverted path arrives.
Why glitches matter:
- Power waste: Every glitch is a switching event that consumes dynamic power (
αCV²f). High-activity buses can waste significant power. - Clock path corruption: A glitch on a clock or enable line can clock a flip-flop at the wrong time, causing functional failure.
- Latch transparency: Glitches on a latch enable propagate directly to the latch output while it is transparent.
Prevention:
- Register outputs: Sampling glitchy combinational logic in a flip-flop on the clock edge filters all glitches shorter than the setup window.
- Hazard-free logic: In Karnaugh map minimization, add "consensus" prime implicant terms that cover the transition between any two adjacent groups — eliminates static hazards.
- Clock gating cells (ICG): Use library clock gate cells that latch the enable on the clock LOW phase — ensures the gated clock output is always a complete pulse or no pulse at all.
A naive assign clk_out = sel ? clk1 : clk0 will produce a glitch when sel changes — the output can get a truncated pulse from one clock or a merged pulse from both. This corrupts any flop clocked by clk_out.
The safe design uses an interlocked two-branch structure:
- Each branch has a flip-flop clocked on the falling edge of its own clock to gate that branch on or off.
- Branch 0 FF:
D = !sel AND !en1_q, clocked on negedge clk0. Branch 1 FF:D = sel AND !en0_q, clocked on negedge clk1. - Each branch gates its own clock:
clk0_g = clk0 AND en0_q,clk1_g = clk1 AND en1_q. - Output:
clk_out = clk0_g OR clk1_g.
Why falling-edge clocking? Gating on the falling edge ensures that the enable change is captured while the clock is LOW, so the gated clock output is either a complete HIGH pulse or nothing — never a partial pulse.
Why the cross-interlocking? The !en1_q / !en0_q terms ensure only one branch is ever active at a time. The transition from clk0 to clk1 requires clk0's branch to deassert fully before clk1's branch asserts — preventing both from being active simultaneously.
Setup time (t_su) is the minimum time the data input must be stable before the active clock edge for the flip-flop to reliably capture it. Hold time (t_h) is the minimum time the data must remain stable after the clock edge.
Together they define a "forbidden window" around the clock edge where data must not change.
Setup violation (data arrives too late): The flip-flop samples data before it has settled to a valid logic level. The FF may capture the wrong value or enter a metastable state. This is a functional failure at the target frequency — the design either works slowly or not at all. Setup violations are frequency-dependent: slow down the clock enough and they disappear.
Hold violation (data changes too soon after the clock edge): The flip-flop's captured value is overwritten before it is fully stored. This causes the FF to capture a corrupted value — either the new data that hasn't fully arrived, or garbage. Hold violations are frequency-independent — they occur even at 1 Hz and are caused by short combinational paths (fast data propagation relative to clock skew). They are the more dangerous class because no amount of slowing down the clock fixes them.
Yes, the fixes are completely different and cannot be mixed up.
Fixing a setup violation (data arrives too late — reduce data path delay):
- Replace high-drive-strength cells with faster variants (higher Vt cells are slower; swap to lower Vt)
- Reduce logic depth — restructure combinational logic to fewer gate stages
- Use retiming — move registers across combinational logic to balance stages
- Add pipeline registers to split a long path into two shorter ones
- Optimize clock skew — use positive skew on the capture FF (delay the capture clock) to give the data path more time
- Last resort: reduce the clock frequency
Fixing a hold violation (data arrives too early — increase minimum data path delay):
- Insert delay buffers (DEL cells from the standard cell library) on the short data path
- Use higher-drive-strength cells — paradoxically, some cells have more internal delay than smaller ones
- Add logic stages that cancel each other (insert an even number of inverters)
- Adjust clock skew: negative skew on the capture FF (advance the capture clock) reduces the hold window
Clock skew is a static, deterministic difference in clock arrival time between two flip-flops on the same chip. It is fixed for a given netlist and process corner. Skew arises from different buffer depths or wire lengths in the clock tree.
Clock jitter is a dynamic, cycle-to-cycle variation in the clock edge position. It is random (caused by power supply noise, substrate coupling, PLL VCO noise) and varies every cycle. You cannot predict the sign or magnitude of jitter in any given cycle.
Effect on timing:
- Skew and setup: Positive skew (capture FF clock arrives later) helps setup — the data has more time to travel. Negative skew (capture clock earlier) hurts setup.
- Skew and hold: Positive skew hurts hold — data launched early from the launch FF might arrive at the capture FF before its (delayed) clock edge. Negative skew helps hold.
- Jitter and both: Jitter degrades both setup and hold because you can't know which direction the edge will shift. STA tools add a "clock uncertainty" (a worst-case jitter margin) that reduces both setup and hold slack. It cannot be recovered through skew optimization.
For a path from a launch flip-flop (FF1) to a capture flip-flop (FF2):
Setup check — data must arrive before the capture edge:
- Data arrival time =
T_clk_launch + T_cq(FF1) + T_comb - Data required time =
T_clk_capture − T_setup(FF2) - Setup slack = Required − Arrival =
(T_clk_capture − T_su) − (T_clk_launch + T_cq + T_comb)
Hold check — data must not arrive too early:
- Data must arrive after:
T_clk_capture + T_hold(FF2) - Hold slack = Arrival − Hold_required =
(T_clk_launch + T_cq_min + T_comb_min) − (T_clk_capture + T_hold)
Where:
T_cq= clock-to-Q propagation delay of the launch FFT_comb= total combinational path delay (sum of gate + wire delays)T_setup / T_hold= FF timing constraints from the cell libraryT_clk_capture − T_clk_launch= clock skew (positive = capture is later)
Slack = margin above the requirement. Positive slack → timing met. Negative slack → timing violated. The most negative slack in the design = worst negative slack (WNS); summing all negative slacks = total negative slack (TNS).
A simultaneous setup and hold violation on the same path means the combinational delay window is too narrow — the path is fast enough to threaten hold, but not fast enough to meet setup. This typically means the logic between two FFs is very shallow (perhaps only a wire or one gate), but there is also a clock tree imbalance creating large skew.
Diagnosis first:
- Check the clock skew between launch and capture FFs. Excessive positive skew can simultaneously worsen hold (by making the capture clock late) while hurting setup if the data path is borderline.
- Look at the path's actual combinational depth — very few gates means it's a structurally short path.
Fix strategy:
- Rebalance the clock tree first: Reduce skew between these two FFs. This is the most targeted fix — less skew directly improves both simultaneously.
- Insert delay cells: Add buffers on the data path to increase minimum delay (fix hold). Then verify setup is still met — if setup is tight, you may need to also optimize the logic depth.
- Restructure logic: If setup is violated because the path is in a long combinational chain overall, pipelining it (adding an intermediate register) can help. But this changes the design architecture.
Metastability occurs when a flip-flop's setup or hold time is violated — the flip-flop enters a metastable state where its output is neither a valid logic 0 nor a valid logic 1. The internal node of the FF is stuck near the switching threshold (V_DD/2) and takes an unpredictable time to resolve to a valid level.
The physics: A flip-flop is a bistable element with two stable equilibria (0 and 1) and one unstable equilibrium (the metastable point). When forced into the unstable point, it resolves exponentially fast — but how long it takes is governed by thermal noise and is therefore random.
Can it be eliminated? No — not completely. Any time asynchronous data crosses a clock boundary, there is a non-zero probability of violating the setup/hold window. The probability of remaining metastable beyond a time T_r decreases exponentially with T_r, but never reaches exactly zero.
What we do instead: We manage the probability using synchronizers. The key metric is MTBF (Mean Time Between Failures). A 2-flop synchronizer gives the metastable FF one full clock period to resolve — in modern CMOS (τ ≈ 30ps), at 1 GHz this gives MTBF of thousands of years, making failure astronomically unlikely.
A 2-flop synchronizer consists of two back-to-back flip-flops, both clocked by the destination domain clock, inserted on a signal crossing from another clock domain.
How it works: The first FF may go metastable when it samples the asynchronous input. It has one full clock period (minus the FF's own propagation delay and the second FF's setup time) to resolve. Because metastability resolution time is exponential, the probability that it remains metastable long enough to corrupt the second FF is extremely small.
Why not 1 flop? One flop doesn't give enough resolution time. The metastable FF must resolve within roughly T_clk − T_cq − T_setup2 — for a 1 GHz clock, that's ~500ps. The probability of remaining metastable that long is non-trivial in some technologies.
Why not 3 flops? Three flops are rarely necessary. With a 1 GHz destination clock and τ ≈ 30ps (modern 7nm), a 2-flop synchronizer gives:
- Resolution time T_r ≈ 500ps, τ ≈ 30ps
- MTBF ≈ e^(T_r/τ) / (f_c × f_d) ≈ e^16.7 / 10^18 → millions of years
Three flops extend MTBF to an astronomically larger number that adds no practical benefit. Use 3 flops only in safety-critical applications (automotive ASIL-D, aerospace) where even million-year MTBF is required to be proven insufficient with 2.
In an async FIFO, the read pointer lives in the read clock domain and the write pointer lives in the write clock domain. Each pointer must be compared against the other's synchronized version to determine full or empty.
The problem with binary counters: When a binary counter increments, multiple bits change simultaneously. For example, 0111 → 1000 changes all 4 bits. If you sample a binary counter while it's transitioning, you might read any of the 16 possible values — a catastrophic error that could falsely declare the FIFO full or empty, corrupting data.
Why Gray code solves this: A Gray code changes exactly one bit per count. When the pointer transitions from count N to N+1, only one bit flips. If the synchronized copy is sampled mid-transition, the worst case is that it sees either count N or count N+1 — off by at most one.
The FIFO full/empty logic deliberately has one count of tolerance: full is declared when the pointers are exactly DEPTH apart, not DEPTH−1. This one-count margin absorbs the maximum one-count error introduced by Gray code sampling, making the detection robust.
gray = bin XOR (bin >> 1). Store the Gray counter as the pointer, convert back to binary (requires a loop of XORs) only if you need the absolute address for memory indexing.MTBF (Mean Time Between Failures) quantifies how often a synchronizer is expected to allow a metastable signal to propagate into the destination domain. The standard formula is:
MTBF = e^(T_r / τ) / (f_c × f_d × t_w)
Where:
- T_r — resolution time available for metastability to resolve (≈ T_clk − T_cq_ff1 − T_su_ff2). The more time, the exponentially higher the MTBF.
- τ — technology metastability time constant. A smaller τ means the FF resolves faster, improving MTBF. Scales with process: ~100ps at 180nm, ~30ps at 7nm.
- f_c — destination clock frequency. Higher frequency = more sampling opportunities per second = more chances for metastability to cause failure.
- f_d — data toggle rate. How often does the incoming signal change near the clock edge?
- t_w — the setup+hold window width. Narrower window = smaller probability of entering metastability per clock cycle.
The exponential dependence on T_r/τ is why adding a second synchronizer flip-flop dramatically improves MTBF — it adds one full clock period to T_r.
Dynamic power = α × C_L × V_DD² × f — consumed when a node switches from 0 to 1 (charges the load capacitance) or 1 to 0 (discharges through the pull-down network). α is the activity factor (fraction of clock cycles the node switches). This power is zero when the circuit is idle.
Static (leakage) power = I_leakage × V_DD — consumed even when no gates are switching, due to sub-threshold current, gate oxide tunneling, and junction leakage. It does not depend on frequency and is present whenever power is supplied.
Historical trend: In nodes above 90nm, dynamic power dominated. Below 28nm and especially at 7nm/5nm, leakage has grown dramatically because transistors cannot be switched fully off at low supply voltages. Modern SoCs spend significant area on leakage management.
Reduction techniques:
- Dynamic: Clock gating (reduce α), operand isolation (prevent toggling of datapath), voltage scaling (V² dependence), frequency scaling, low-swing signaling
- Static: Power gating (cut V_DD to an entire domain via header/footer cells), multi-Vt design (use High-Vt cells in non-critical paths — slower but much lower leakage), reverse body biasing, state retention during power down
Clock gating removes the clock signal from a register bank when its value won't change, eliminating the clock-to-Q dynamic power and the switching power of all downstream logic. The clock network can account for 30–40% of total chip dynamic power, making clock gating one of the highest-impact power techniques.
Why you cannot simply write assign gated_clk = clk AND enable:
If enable changes while clk is HIGH, the AND gate output glitches — it produces a truncated clock pulse shorter than a full cycle. This truncated pulse can violate the setup/hold requirements of any flip-flop it clocks, corrupting stored data or causing metastability.
The correct implementation — Integrated Clock Gating (ICG) cell:
- A latch samples the enable signal on the LOW phase of the clock (when clock = 0)
- The latched enable is then ANDed with the clock
- Because the latch captures
enableonly when the clock is LOW, by the time the clock rises, the latch output is stable — the AND gate sees a stable enable and a clean rising edge → full clock pulse or no pulse, never a partial one
In RTL, you write: if (enable) register <= data; and the synthesis tool infers an ICG cell. Never write clock gating manually at the gate level in RTL — let the tool use the optimized library ICG cell.
A scan chain connects the flip-flops in a design into a long shift register that can be controlled and observed from the chip's I/O pins, purely for testing purposes.
How it works: Each flip-flop in the design is replaced with a scan flip-flop — identical to a normal FF but with an extra 2:1 mux at the data input:
- In functional mode (scan_enable = 0): the mux passes normal D input — the design operates as designed.
- In scan mode (scan_enable = 1): the mux passes the previous FF's output — all FFs form a shift register. You can shift in a test pattern, capture one functional clock cycle, and shift out the results for comparison.
Why it's essential: Without scan, testing whether a stuck-at fault (wire permanently stuck at 0 or 1) exists deep in the chip requires applying just the right sequence of primary input patterns — combinatorially explosive. With scan, an ATPG (Automatic Test Pattern Generation) tool can directly control any FF's state and observe any FF's captured output, enabling near-100% stuck-at fault coverage with a manageable number of test vectors.
Test flow at production: After packaging, every chip is tested on an ATE (Automated Test Equipment). The scan chain shifts in millions of vectors and compares the shifted-out responses against the fault-free model. Any mismatch → chip fails and is discarded.
Every AXI4 channel (AW, W, B, AR, R) uses a two-signal handshake: VALID (driven by the sender) and READY (driven by the receiver). A transfer occurs on the rising clock edge when both VALID and READY are simultaneously HIGH.
Rules:
- The sender asserts VALID when it has valid data/address to send and must not deassert VALID until the transfer completes (both signals HIGH on a clock edge).
- The receiver asserts READY when it can accept data. READY may be HIGH before VALID (pre-ready) — this is fine.
- If VALID is asserted and READY is LOW, both sides wait. Neither can "cancel" the transaction by deasserting VALID without completing the handshake.
The rule that must never be broken: VALID must not combinatorially depend on READY. If the master only asserts VALID after it sees READY, and the slave only asserts READY after it sees VALID, the result is a deadlock — neither ever fires first. READY is allowed to depend on VALID, but not vice versa.
AXI4 has 5 independent channels, enabling key features: a new write address (AW) can be accepted while write data (W) from a previous burst is still in-flight. Read and write transactions are completely independent, maximizing bus utilization.
AXI4 allows a master to issue multiple outstanding read or write transactions before receiving responses. Each transaction is tagged with a Transaction ID (ARID for reads, AWID for writes). The slave and interconnect are free to complete transactions in a different order from how they were issued — a fast SRAM access may return data before a slow DRAM access even if the DRAM request was issued first.
How the master reconciles responses: Read data returns on the R channel with RID matching the original ARID. Write responses return on the B channel with BID matching AWID. The master maintains an outstanding transaction table and uses the ID to match each response to the correct request.
Ordering rule per ID: Transactions with the same ID must complete in order. If a master issues two reads both with ARID=3, the interconnect must return them in order. Transactions with different IDs have no ordering guarantee relative to each other.
Interconnect ID widening: When multiple masters share an interconnect, the fabric appends a master-identifying prefix to each ID (e.g., 2-bit master select + original ARID = extended RID). On the response path, the prefix is used to route the response back to the correct master, which strips the prefix before comparing IDs.
Pipelining divides a long combinational operation into N sequential stages, each separated by flip-flops. Instead of one result every T_total clock period (dictated by the slowest path), you get one result per T_total/N clock period — throughput increases N× once the pipeline is full.
Example: A 5-stage 32-bit multiplier at 500 MHz produces one product every 2 ns. Without pipelining, the same logic would run at 100 MHz (5× slower combinational chain). With pipelining, a new multiply starts every cycle — though each individual result still takes 5 cycles of latency.
What pipelining improves: Throughput (results per unit time) — directly, by allowing clock frequency to be multiplied by the number of stages.
Trade-offs:
- Latency: Each result takes N cycles to complete instead of 1. This is often acceptable for bulk data, but hurts interactive or latency-sensitive operations.
- Area: N−1 extra register stages add flip-flop area and routing overhead.
- Power: More registers switching every cycle; however the lower V_DD enabled by higher-frequency operation may offset this.
- Hazards: Data hazards (RAW — read after write: a stage needs a result not yet produced by a later stage), control hazards (branches), and structural hazards (resource conflicts) require stalls, forwarding, or branch prediction logic — all of which reduce ideal throughput.
- Balancing: If one stage is slower than others, it bottlenecks the pipeline. All stages must be balanced to the same worst-case delay for the frequency gain to be fully realized.
Blocking assignment (=) executes sequentially within an always block — each statement completes before the next begins, exactly like a software assignment. The left-hand side updates immediately.
Non-blocking assignment (<=) evaluates all right-hand sides first (using values from the current time step), then schedules all left-hand side updates to happen simultaneously at the end of the time step. This models the parallel behavior of flip-flops sampling their D inputs on a clock edge.
The golden rules:
- Use
=(blocking) for combinational logic inalways @(*)blocks. The sequential evaluation correctly implements the logic function. - Use
<=(non-blocking) for sequential logic inalways @(posedge clk)blocks. The simultaneous update models how FFs all sample their D input on the same clock edge. - Never mix both types in the same
alwaysblock.
Classic bug with the wrong choice: A shift register written with blocking assignments (a = in; b = a; c = b;) immediately propagates the input through all stages in a single clock cycle. With non-blocking (a <= in; b <= a; c <= b;), all three FFs sample their current input simultaneously — correct shift register behavior.
wire, reg, and logic in SystemVerilog? Why was logic introduced?wire (Verilog): a net type representing a physical connection. It can only be driven by continuous assignments (assign) or module output ports. Multiple drivers resolve via wired-AND/OR logic depending on the net type. It cannot hold state.
reg (Verilog): a variable that can be driven inside procedural blocks (always, initial). Despite its name, it does NOT necessarily synthesize to a register — a reg inside always @(*) synthesizes to combinational logic. The name is misleading and a common source of confusion.
logic (SystemVerilog): a unified 4-state variable type that replaces both wire and reg for most use cases. It can be driven by both continuous assignments and procedural blocks. The key restriction: logic allows only one driver — the compiler flags multi-driver errors that wire silently allows. This catches accidental bus conflicts at compile time.
Why logic was introduced:
- Eliminates the confusing
regmisnomer —logiccommunicates data type, not inferred hardware. - Provides compile-time multiple-driver checking that
wirelacks. - Works in both continuous and procedural contexts, reducing declarations.
logic for almost everything. Use wire only when you explicitly need multiple drivers (e.g., tri-state buses, wired-AND). Avoid reg entirely in new SystemVerilog code.A synchronous FIFO uses a write pointer (wrptr) and a read pointer (rdptr) to track the head and tail. Both pointers start at 0. When the FIFO is empty, both point to the same location — and when it is completely full, both also point to the same location after wrapping around. This ambiguity is the core challenge of FIFO pointer design.
The naive approach fails: If both pointers are N-bit binary counters with range 0 to DEPTH-1, you cannot distinguish full from empty because both conditions result in wrptr == rdptr.
The extra bit trick: Use N+1 bit pointers, where N = log₂(DEPTH). The lower N bits are the actual memory address; the MSB (the "extra bit") acts as an overflow wrap indicator.
- Empty:
wrptr == rdptr(all N+1 bits equal — same wrap count, same address) - Full:
wrptr[N-1:0] == rdptr[N-1:0]ANDwrptr[N] != rdptr[N](same address, but one extra wrap ahead)
The MSBs differ when the write pointer has wrapped one more time than the read pointer — meaning the FIFO is exactly DEPTH entries deep.
False path: a timing path that exists in the netlist but will never carry valid data in the real operating system. STA should completely ignore it — no setup or hold analysis. Examples:
- Paths between two completely unrelated, never-simultaneously-active clock domains
- Paths from a test-mode-only mux output that is static during functional operation
- Reset synchronizer paths where the reset is never timing-critical
- Paths between scan mode logic not active during functional timing
SDC: set_false_path -from [get_cells launch_ff] -to [get_cells capture_ff]
Multi-cycle path (MCP): a path where data is intentionally designed to take N clock cycles to settle. The designer tells STA to use N×T_clk as the available time for the setup check instead of 1×T_clk.
SDC for a 2-cycle setup path: set_multicycle_path 2 -setup -from ... -to ...
Critical rule for MCP: When you relax setup by N cycles, you MUST also adjust the hold check. By default, STA places the hold check one cycle before the setup capture edge — correct for 1-cycle paths. For a 2-cycle setup, the hold check must also move back one cycle:
SDC: set_multicycle_path 1 -hold -from ... -to ...
set_multicycle_path 2 -setup without the matching -hold exception creates an overly pessimistic hold check one cycle before the new setup capture — often an impossible hold requirement that forces unnecessary delay insertion.OCV (On-Chip Variation) acknowledges that cells at different locations on the same die do not experience identical conditions. Spatial gradients in temperature, VDD (due to IR drop), and manufacturing process (oxide thickness, doping) cause cells in different parts of the chip to have slightly different delays — even if they are the same cell type running at the same nominal conditions.
This matters for STA because the clock path and data path typically run through physically different areas of the chip. If both paths were derated the same way, the error would cancel. But since one may be faster and the other slower, we must be pessimistic.
Flat OCV (Flat Derating): Applies a single multiplicative derating factor to all cells. The launch data path is made slower (multiply delays by e.g. 1.05) and the capture clock path is made faster (multiply by 0.95) for setup — worst-case pessimism everywhere. Simple but overly conservative.
AOCV (Advanced OCV): Uses a lookup table indexed by path depth (number of logic stages) and distance. Longer paths with more stages average out variation — a 30-stage path has less cell-to-cell variation than a 2-stage path. AOCV assigns less derating to deep paths, reducing pessimism and improving timing convergence without sacrificing accuracy.
POCV (Parametric OCV / LVF): Uses full statistical distributions (mean and sigma) for each cell's delay, propagating uncertainties through the path using statistical addition. This is the most accurate method and is becoming the industry standard at 7nm and below, where AOCV is no longer pessimistic enough.
When STA analyzes a flip-flop-to-flip-flop path, the launch clock path (from clock source to FF1) and the capture clock path (from clock source to FF2) often share common clock buffers near the root of the clock tree before they diverge.
With OCV derating, the tool pessimistically applies opposite deratings to the launch and capture paths: the launch clock is made slower (derated up) and the capture clock is made faster (derated down) for setup analysis. But the shared portion of the two paths cannot simultaneously be both slow and fast — it is the same physical cell running at the same moment in time.
CRPR removes this double-counting. For the portion of clock tree that is common to both launch and capture paths, the STA tool calculates how much pessimism was added by applying opposite deratings to the same cells, and adds that amount back as credit. The formula:
CRPR credit = max_delay(common) − min_delay(common)
This credit is added back to the setup slack. Typical CRPR values range from 10 ps to 100 ps depending on how much of the clock tree is shared and how aggressive the OCV derating is.
CRPR is sometimes called CPPR (Common Path Pessimism Removal) — both terms mean the same thing. Modern STA tools (PrimeTime, Tempus) apply it automatically.
Why per-bit synchronization fails: Each bit of the bus passes through its own 2-FF synchronizer independently. Each synchronizer may sample from a different source clock cycle — bit 3 might capture the value from cycle N while bit 0 captures the value from cycle N+1. The destination domain then reads a "torn" word that never existed in the source domain. For a 32-bit bus, this can produce completely wrong data.
Safe techniques for multi-bit CDC:
- Gray code (for counters/pointers): If the bus is a counter that increments by one at a time, encode it in Gray code before the crossing. Only one bit changes per count, so a sampled-in-transition value is at most off by one — which FIFO logic tolerates.
- Handshake (req/ack): Source asserts a request (req) after data has been stable for at least one source cycle. Destination synchronizes req (2-FF), samples the data only after req is asserted, then asserts ack. Source deasserts req after seeing synchronized ack. Both req and ack use separate 2-FF synchronizers. Low throughput (takes ~4–6 destination clock cycles per transfer) but works for any arbitrary data.
- Asynchronous FIFO: For streaming data, use an async FIFO with Gray-coded pointers. The FIFO internally handles all multi-bit CDC safely.
- Qualified sampling: Source keeps data stable for at least 3 destination clock cycles, then asserts a single "data valid" signal. Destination synchronizes the valid signal and samples the data on the synchronized valid. Risky — relies on the source holding data long enough.
A 2-FF level synchronizer is used when the source signal is a steady level that persists for many source clock cycles. The destination captures it safely after 2 destination clocks.
A pulse synchronizer is needed when the source generates a single-cycle pulse — a signal that is HIGH for exactly one source clock cycle. A 2-FF synchronizer cannot reliably capture this: if the destination clock is slower or at an unfortunate phase, the pulse may be missed entirely.
How a toggle-based pulse synchronizer works:
- Source domain: A toggle flip-flop converts each incoming pulse into a level change. Every time a pulse arrives, the FF inverts its output. The toggle signal therefore holds its value until the next pulse — making it a persistent level that won't be missed.
- Clock crossing: The toggle signal crosses the domain via a standard 2-FF synchronizer.
- Destination domain: An XOR of the synchronized output and its one-cycle-delayed copy detects each edge → generates a clean single-cycle pulse in the destination domain.
Constraint: Source pulses must be spaced at least 3 destination clock cycles apart so the previous toggle has fully propagated through the synchronizer before the next pulse arrives. If pulses can arrive faster, use an async FIFO instead.
UPF (IEEE 1801) is a standard format for capturing the power intent of a chip design in a separate file that accompanies the RTL. As SoCs moved to multiple power domains, it became impossible to express power management purely in RTL — the RTL describes logical functionality, not which block gets what voltage or when a domain shuts off.
What UPF defines:
- Supply networks: Which voltage rails exist (
VDD_CPU,VDD_MODEM,VDD_AON), their nominal voltages, and how they connect to design blocks. - Power domains: Which RTL modules belong to which supply rail. Each domain has a defined primary power supply.
- Power states: Which domains are ON or OFF in each operating mode (e.g., "sleep mode: modem ON, CPU OFF, AON ON").
- Isolation cells: Specifies where isolation cells must be inserted at the boundary of power-gatable domains, and what value they should clamp to when the domain is off.
- Retention registers: Which flip-flops need SRPG (State Retention Power Gating) cells to preserve state across a power-off event.
- Level shifters: Where voltage-level-shifting cells are needed between domains running at different voltages.
- Power switches: Header (PMOS) or footer (NMOS) transistors that gate the power supply to a domain.
Isolation cells are required at the output boundary of any power-gated domain. When a domain's power supply is cut, its flip-flops lose their state and outputs become undefined (float to a random value or X). If an always-on domain receives these floating signals, it may malfunction — latching garbage data, causing spurious state transitions, or drawing excessive short-circuit current.
An isolation cell is inserted on each output net of the power-gated block. It is connected to an always-on supply. When the domain is OFF, the isolation cell clamps the output to a safe known value (typically 0 for AND-based isolation, or 1 for OR-based) as specified in UPF. When the domain is ON, the isolation cell passes the signal through transparently.
Retention registers (SRPG — State Retention Power Gating) are special flip-flop variants with a small "shadow latch" connected to a separate always-on power rail (typically a low-leakage supply). The shadow latch holds only a few transistors, consuming a fraction of the normal FF's leakage.
Operation:
- Before power-off: The power management controller sends a SAVE signal → each SRPG cell captures its current state into its shadow latch.
- Domain is off: Main supply cut, shadow latch retains state at very low power.
- After power-on: A RESTORE signal pushes the shadow state back into the main FF.
Without retention, the block must re-initialize from scratch after every power-up, adding latency and requiring software re-programming of registers.
A voltage island is a physically distinct region of the chip that operates at a different supply voltage from surrounding blocks. By running low-activity blocks at a lower V_DD, dynamic power scales as V², giving dramatic savings — dropping from 1.0V to 0.8V reduces dynamic power by 36%.
Why Qualcomm uses voltage islands: A Snapdragon SoC has very different performance and power requirements across blocks. The modem baseband runs continuously but at moderate frequency. The application CPU cores spike to high performance on demand. The always-on sensor hub must run at <0.7V for weeks on battery. A single supply voltage optimized for the fastest block wastes enormous power in slower blocks.
Required boundary cells:
- Level shifters (LS): Signals crossing between domains at different voltages must be shifted to the receiving domain's logic levels. A signal from a 0.8V domain HIGH (0.8V) is not guaranteed to be a valid HIGH in a 1.1V domain without level shifting. Level shifters are inserted on every signal crossing.
- Isolation cells: If the lower-voltage island can be powered off completely, isolation cells (see previous question) are needed to clamp its outputs.
- Level-shifting isolation cells: Combined cells that both shift voltage and isolate — used at boundaries between always-on and power-gatable domains at different voltages.
Clock Tree Synthesis (CTS) is the physical design step that builds the clock distribution network — a buffered tree that delivers the clock signal from the clock source (PLL output or pad) to every flip-flop's clock pin across the entire chip.
Goals of CTS:
- Minimize clock skew: Every FF should see the clock edge at (nearly) the same time. Unbalanced trees create skew that consumes setup and hold timing margins.
- Meet insertion delay target: Total latency from clock source to FF clock pins must be within the budgeted range (typically set in SDC via
set_clock_latency). - Minimize clock power: The clock network toggles every cycle and can consume 30–40% of total chip dynamic power. The tool balances skew reduction against cell count and wire length.
- Respect no-touch (NDR) routing rules: Clock nets typically use special Non-Default Routing Rules (wider wires, more spacing, preferred upper metal layers) for reduced resistance and better EM reliability.
Flow position: CTS runs after placement (cell locations are fixed) but before detailed routing. After CTS, timing analysis uses real clock arrival times instead of ideal clock assumptions — hold violations often emerge here because real clock trees have skew that didn't exist in pre-CTS analysis.
IR drop is the voltage reduction along the power delivery network from the supply pins to the power pins of individual cells. The metal power grid has resistance (R), and the switching current (I) causes a voltage drop V = I × R. A cell operating at V_nominal − ΔV is slower than a cell at the full supply voltage.
Two types:
- Static IR drop: Average current × grid resistance. Determined by the long-term average switching activity. Used for power integrity sign-off of DC operating point.
- Dynamic (transient) IR drop: When a large number of cells switch simultaneously (e.g., a wide datapath all clocking at once), the instantaneous current surge exceeds the average. The power grid voltage transiently collapses by a larger amount, limited by the inductance and decoupling capacitance. This "voltage droop" is worse than static IR and is the primary concern at high frequencies.
Effect on timing: In a high-IR-drop region, cells are slower than characterized at nominal voltage. A path that passes STA at nominal conditions may violate setup timing in silicon due to IR-induced delay increase. Hold violations are less common (slower cells improve hold margin).
Fixes:
- Widen power stripes or add more power mesh layers
- Add decoupling capacitors (decaps) near high-switching density regions
- Spread high-activity cells during placement to avoid current hot spots
- Use power gating with controlled wake-up sequences to avoid simultaneous switching
- In STA: apply voltage derating in high-IR-drop regions for more accurate sign-off
During VLSI fabrication, metal layers are deposited and patterned one at a time using plasma etching. Plasma charges accumulate on exposed metal wires during etching. If a long metal wire is already connected to a transistor gate oxide but NOT yet connected to a diffusion region (which would discharge the charge safely), the accumulated charges can create a large voltage across the thin gate oxide — sufficient to cause permanent gate oxide damage: threshold voltage shifts, increased leakage, or immediate breakdown.
The antenna ratio = (metal area of the wire connected to the gate) / (gate oxide area). Process Design Kits (PDKs) specify maximum allowable antenna ratios (typically 400–1000 for metal, 200–600 for vias). Exceeding this ratio means the wire can accumulate enough charge to damage the oxide.
How it's detected: The router's DRC (Design Rule Check) engine computes the cumulative antenna ratio for every net using the partial routing built up layer by layer. If it exceeds the limit, an antenna violation is flagged.
Fixes:
- Metal jumper (layer hopping): Break the long wire by jumping to a higher metal layer and back. This "resets" the antenna accumulation because higher-layer routing is done later, after diffusion connections have been made. Most common fix.
- Antenna diode: Insert a reverse-biased diode near the gate, connected to the same metal wire. During plasma etching, the diode provides a discharge path to substrate, preventing charge buildup. Small area cost, always effective.
- Reduce net length: Re-route the net to use shorter wires on lower layers.
Code coverage measures how much of the RTL source code was exercised by the simulation:
- Line/statement coverage: Were all lines of RTL executed?
- Branch coverage: Were both sides of every
if/elseand everycasearm taken? - Toggle coverage: Did every signal toggle both 0→1 and 1→0?
- FSM coverage: Were all states visited and all transitions taken?
Code coverage is automatically collected by the simulator with no extra specification — easy to get, but tells you nothing about what scenarios were verified. You can hit 100% branch coverage while never testing the most critical protocol corner case.
Functional coverage is user-defined. The verification engineer specifies which scenarios, protocol states, and parameter combinations are important to verify — then measures whether simulations actually exercised them:
- Was an AXI4 burst of ARLEN=255 (256 beats) issued?
- Did a FIFO simultaneously receive a write and a read when exactly one slot was free?
- Did a CDC crossing happen with data changing every source cycle?
Which matters more? Both are necessary; neither alone is sufficient. Code coverage ensures no dead code was accidentally left un-exercised. Functional coverage ensures the right scenarios were tested. A mature sign-off process requires both to be above target (typically 95%+ code coverage, 100% defined functional coverpoints).
UVM (IEEE 1800.2) is a standardized SystemVerilog methodology for building reusable, scalable verification environments using an object-oriented framework. It replaces brittle, one-off directed testbenches.
Key UVM components:
- uvm_test: Top-level test class. Selects which scenario/sequence to run and configures the environment. Different tests reuse the same TB infrastructure.
- uvm_env: Container that instantiates and connects agents, scoreboards, and coverage collectors for one DUT.
- uvm_agent: Models one protocol interface (e.g., AXI4 master). Contains: Driver (applies stimulus to DUT pins), Monitor (observes DUT pins and creates transaction objects), Sequencer (arbitrates between sequences and feeds items to the driver).
- uvm_sequence / uvm_sequence_item: Defines the actual stimulus transactions. Sequences can be layered (a higher-level sequence calls lower-level sequences) and constrained-random.
- uvm_scoreboard: Compares DUT output (from monitor) against a reference model's expected output. Reports pass/fail.
- TLM ports (uvm_analysis_port): Standardized communication channels between components — no direct references between classes.
Vs. directed testbench: A directed testbench hand-codes every stimulus vector — it only tests what the engineer explicitly wrote. A UVM testbench with constrained-random stimulus explores the full stimulus space automatically within user-specified constraints, finding corner cases no human would write by hand.
ATPG (Automatic Test Pattern Generation) tools model physical manufacturing defects as logical faults and generate patterns to detect them. The main fault models are:
- Stuck-At Fault (SAF): A wire is permanently stuck at logic 0 (SA0) or 1 (SA1), regardless of what drives it. Models open circuits, resistive shorts to VDD/GND, and broken connections. The most widely used model. A stuck-at fault is detected by finding a test that excites the fault (drives the opposite value) and propagates the effect to a primary output or scan chain output. Industry target: 95–99% fault coverage.
- Transition Delay Fault (TDF): Tests whether a net can make a complete 0→1 or 1→0 transition within one clock cycle. Detects resistive defects that don't prevent correct logic levels but slow transitions — critical at high frequency where even a slightly slow net causes a setup violation. TDF requires two-pattern tests: launch the transition, then capture the response one cycle later.
- Path Delay Fault (PDF): Tests the end-to-end propagation delay of a specific signal path. More accurate timing characterization than TDF — detects accumulated small delays across many gates. Requires many patterns but provides the most complete timing sign-off.
- Bridging Fault: Models an unintended short between two adjacent nets. A short that combines two signals via wired-AND or wired-OR logic. Increasingly important at 7nm/5nm where metal pitch is very tight and coupling between adjacent wires is a common defect.
- Cell-Aware Fault: Tests for defects inside standard cells at the transistor level (open/short in the cell's internal netlist). Catches defects that SAF, modeled at the cell's logical interface, would miss.
CSI-2 (Camera Serial Interface 2) is a MIPI Alliance standard for connecting image sensors to application processors. It is the dominant camera interface in smartphones — virtually every mobile camera uses CSI-2.
Physical layer (D-PHY): CSI-2 uses MIPI D-PHY, a differential serial interface with two operating modes:
- High-Speed (HS) mode: Low-swing differential signaling (100–300 mV differential) at 80 Mbps to 4.5 Gbps per lane. Used for pixel data transmission.
- Low-Power (LP) mode: CMOS-level single-ended signaling. Used for control, synchronization, and lane management. Much lower speed.
Architecture: One clock lane + 1 to 4 data lanes. Each lane is a differential pair (DP/DN). For a quad-lane sensor at 4.5 Gbps/lane: total bandwidth = 4 × 4.5 = 18 Gbps — sufficient for 200 MP sensors at full frame rate.
Virtual channels: Up to 4 virtual channel IDs allow multiple cameras to share the same physical CSI-2 interface, multiplexed by the sensor or ISP.
C-PHY (newer alternative): Uses 3-wire "trios" with encoded 3-symbol signaling, achieving 5.7 Gsymbols/s per trio = ~2.28 bits per symbol → higher effective data rate without increasing frequency. Used in high-resolution cameras where D-PHY lane count limits bandwidth.
VLSI implementation: The CSI-2 receiver on a Snapdragon SoC consists of a D-PHY frontend (analog deserializer), a lane merger, a CSI-2 protocol decoder, and an interface to the Image Signal Processor (ISP). It must process pixels faster than they arrive to prevent FIFO overflow — typically 500 MHz+ operating frequency.