Interview Prep

VLSI Engineer Interview Questions

Real questions asked at top semiconductor and tech companies — with detailed, interview-ready answers covering RTL, STA, CDC, low power, DFT, and protocols.

146
Questions
9
Topics
3
Difficulty Levels
13
Companies
Select a Company

Showing 20 of 20 questions

01
Easy RTL Design
What is the difference between a latch and a flip-flop? When would you use each?

A latch is a level-sensitive storage element. When the enable (or clock) is HIGH, the output follows the input continuously — the latch is "transparent." When the enable goes LOW, the last value is held. A flip-flop is edge-triggered — it samples the input only at the exact rising (or falling) clock edge and ignores the input at all other times.

Why flip-flops dominate RTL design: Their predictable sampling window makes static timing analysis (STA) straightforward — setup and hold times are well-defined relative to one clock edge. Latches create "transparent windows" that make STA far more complex; timing tools must ensure that no combinational path through an open latch violates timing in any cycle.

When to use latches deliberately:

  • Power savings: A latch consumes no clock dynamic power when transparent (no clock-to-output toggling).
  • High-performance pipelines: In "latch-based" designs (common in custom datapath and CPUs), a latch pair (master + slave) forms a pseudo-FF but allows time-borrowing — a slow first half-cycle can steal time from a fast second half-cycle, improving throughput.
  • Specialized cells: Sense amplifiers and certain memory cells use latch-based structures.
Interview tip: Google interviewers often follow up: "What is an enable flip-flop and how is it synthesized?" — Answer: it's a mux feeding the D input: D = en ? data_in : Q. The synthesis tool maps this to a clock-gate cell, not a latch.
02
Medium RTL Design
How do you design a clock divider by 3 with exactly 50% duty cycle?

Dividing by an odd number and achieving 50% duty cycle requires using both clock edges. A single-edge counter can only produce a 33%/67% duty cycle.

The technique: Create two signals derived from the same mod-3 counter — one toggled on the rising edge, one on the falling edge — then OR them together.

  • Use a 2-bit counter clocked on the rising edge counting 0→1→2→0. Generate out_r = HIGH when count == 0, LOW when count == 1.
  • Use the same counter logic clocked on the falling edge. Generate out_f identically.
  • Final output = out_r OR out_f. Because out_r and out_f are offset by half a source clock period, their OR produces exactly 1.5 periods HIGH and 1.5 periods LOW out of every 3 source periods — 50% duty cycle.
Why this works: The rising-edge FF goes HIGH at the source rising edge of count 0, and the falling-edge FF goes HIGH at the source falling edge of count 0 — they overlap for 1.5 source cycles. The combined waveform has transitions at every 1.5 source clock periods.
03
Medium RTL Design
How do you calculate the required depth of an asynchronous FIFO?

The FIFO must hold all data written during a burst before the reader catches up. The minimum depth is:

Depth ≥ (Write rate − Read rate) × Burst duration + Synchronizer latency guard

Breaking this down:

  • Burst excess: If a writer sends at fw words/cycle for T cycles, and the reader drains at fr words/cycle, the net accumulation is (fw − fr) × T words. This is the minimum storage needed.
  • Synchronizer latency: The gray-code read pointer takes 2–3 destination-clock cycles to propagate through the synchronizer. During this window, the write side may falsely see the FIFO as full (or the read side sees it as empty). Add 2–3 words of margin per side.
  • Round up to power of 2: Async FIFO address arithmetic requires a power-of-2 depth so that the gray code pointer MSB inversion trick for full/empty detection works correctly.
Example: Writer at 400 MHz (1 word/cycle), reader at 200 MHz (1 word/cycle), burst of 16 words. Excess = (1−0.5)×16 = 8 words, plus 4 guard words → depth = 12, round up to 16.
04
Easy RTL Design
What is a glitch in combinational logic and how do you prevent it?

A glitch (or hazard) is a spurious, short-duration output pulse that occurs when multiple inputs change simultaneously and race through paths of unequal delay. Even though the steady-state output is correct, the transient produces an unwanted transition.

Classic example: A static-1 hazard in a 2-input AND gate where both inputs come from the same signal A through paths with different delays — one direct, one through an inverter. Mathematically A AND NOT(A) = 0, but if the direct path is faster, the gate briefly sees 1 AND 1 = 1 before the inverted path arrives.

Why glitches matter:

  • Power waste: Every glitch is a switching event that consumes dynamic power (αCV²f). High-activity buses can waste significant power.
  • Clock path corruption: A glitch on a clock or enable line can clock a flip-flop at the wrong time, causing functional failure.
  • Latch transparency: Glitches on a latch enable propagate directly to the latch output while it is transparent.

Prevention:

  • Register outputs: Sampling glitchy combinational logic in a flip-flop on the clock edge filters all glitches shorter than the setup window.
  • Hazard-free logic: In Karnaugh map minimization, add "consensus" prime implicant terms that cover the transition between any two adjacent groups — eliminates static hazards.
  • Clock gating cells (ICG): Use library clock gate cells that latch the enable on the clock LOW phase — ensures the gated clock output is always a complete pulse or no pulse at all.
05
Hard RTL Design
How do you implement a glitch-free clock multiplexer for two asynchronous clocks?

A naive assign clk_out = sel ? clk1 : clk0 will produce a glitch when sel changes — the output can get a truncated pulse from one clock or a merged pulse from both. This corrupts any flop clocked by clk_out.

The safe design uses an interlocked two-branch structure:

  • Each branch has a flip-flop clocked on the falling edge of its own clock to gate that branch on or off.
  • Branch 0 FF: D = !sel AND !en1_q, clocked on negedge clk0. Branch 1 FF: D = sel AND !en0_q, clocked on negedge clk1.
  • Each branch gates its own clock: clk0_g = clk0 AND en0_q, clk1_g = clk1 AND en1_q.
  • Output: clk_out = clk0_g OR clk1_g.

Why falling-edge clocking? Gating on the falling edge ensures that the enable change is captured while the clock is LOW, so the gated clock output is either a complete HIGH pulse or nothing — never a partial pulse.

Why the cross-interlocking? The !en1_q / !en0_q terms ensure only one branch is ever active at a time. The transition from clk0 to clk1 requires clk0's branch to deassert fully before clk1's branch asserts — preventing both from being active simultaneously.

Caveat: The switch takes 1–2 cycles of the slower clock to complete (synchronizer latency). This is expected and acceptable. Never use this for cycle-accurate switching without understanding the latency.
06
Easy Timing / STA
What are setup time and hold time? What happens when each is violated?

Setup time (t_su) is the minimum time the data input must be stable before the active clock edge for the flip-flop to reliably capture it. Hold time (t_h) is the minimum time the data must remain stable after the clock edge.

Together they define a "forbidden window" around the clock edge where data must not change.

Setup violation (data arrives too late): The flip-flop samples data before it has settled to a valid logic level. The FF may capture the wrong value or enter a metastable state. This is a functional failure at the target frequency — the design either works slowly or not at all. Setup violations are frequency-dependent: slow down the clock enough and they disappear.

Hold violation (data changes too soon after the clock edge): The flip-flop's captured value is overwritten before it is fully stored. This causes the FF to capture a corrupted value — either the new data that hasn't fully arrived, or garbage. Hold violations are frequency-independent — they occur even at 1 Hz and are caused by short combinational paths (fast data propagation relative to clock skew). They are the more dangerous class because no amount of slowing down the clock fixes them.

Key insight: Setup violations → fix by making data arrive earlier or giving more time (slow clock). Hold violations → fix by making data arrive later (insert delays), completely unrelated to clock frequency.
07
Medium Timing / STA
How do you fix a setup timing violation? How about a hold violation? Are the fixes different?

Yes, the fixes are completely different and cannot be mixed up.

Fixing a setup violation (data arrives too late — reduce data path delay):

  • Replace high-drive-strength cells with faster variants (higher Vt cells are slower; swap to lower Vt)
  • Reduce logic depth — restructure combinational logic to fewer gate stages
  • Use retiming — move registers across combinational logic to balance stages
  • Add pipeline registers to split a long path into two shorter ones
  • Optimize clock skew — use positive skew on the capture FF (delay the capture clock) to give the data path more time
  • Last resort: reduce the clock frequency

Fixing a hold violation (data arrives too early — increase minimum data path delay):

  • Insert delay buffers (DEL cells from the standard cell library) on the short data path
  • Use higher-drive-strength cells — paradoxically, some cells have more internal delay than smaller ones
  • Add logic stages that cancel each other (insert an even number of inverters)
  • Adjust clock skew: negative skew on the capture FF (advance the capture clock) reduces the hold window
Critical rule: When fixing hold violations by adding buffers, always re-check setup slack — the added delay consumes setup margin. Both must pass simultaneously.
08
Medium Timing / STA
What is the difference between clock skew and clock jitter? How does each affect timing?

Clock skew is a static, deterministic difference in clock arrival time between two flip-flops on the same chip. It is fixed for a given netlist and process corner. Skew arises from different buffer depths or wire lengths in the clock tree.

Clock jitter is a dynamic, cycle-to-cycle variation in the clock edge position. It is random (caused by power supply noise, substrate coupling, PLL VCO noise) and varies every cycle. You cannot predict the sign or magnitude of jitter in any given cycle.

Effect on timing:

  • Skew and setup: Positive skew (capture FF clock arrives later) helps setup — the data has more time to travel. Negative skew (capture clock earlier) hurts setup.
  • Skew and hold: Positive skew hurts hold — data launched early from the launch FF might arrive at the capture FF before its (delayed) clock edge. Negative skew helps hold.
  • Jitter and both: Jitter degrades both setup and hold because you can't know which direction the edge will shift. STA tools add a "clock uncertainty" (a worst-case jitter margin) that reduces both setup and hold slack. It cannot be recovered through skew optimization.
Rule of thumb: Skew is a tool — you can use it (via clock tree tuning) to deliberately help tight paths. Jitter is noise — you can only characterize it and budget for it.
09
Hard Timing / STA
Walk through a complete flip-flop to flip-flop timing path analysis. What is slack?

For a path from a launch flip-flop (FF1) to a capture flip-flop (FF2):

Setup check — data must arrive before the capture edge:

  • Data arrival time = T_clk_launch + T_cq(FF1) + T_comb
  • Data required time = T_clk_capture − T_setup(FF2)
  • Setup slack = Required − Arrival = (T_clk_capture − T_su) − (T_clk_launch + T_cq + T_comb)

Hold check — data must not arrive too early:

  • Data must arrive after: T_clk_capture + T_hold(FF2)
  • Hold slack = Arrival − Hold_required = (T_clk_launch + T_cq_min + T_comb_min) − (T_clk_capture + T_hold)

Where:

  • T_cq = clock-to-Q propagation delay of the launch FF
  • T_comb = total combinational path delay (sum of gate + wire delays)
  • T_setup / T_hold = FF timing constraints from the cell library
  • T_clk_capture − T_clk_launch = clock skew (positive = capture is later)

Slack = margin above the requirement. Positive slack → timing met. Negative slack → timing violated. The most negative slack in the design = worst negative slack (WNS); summing all negative slacks = total negative slack (TNS).

STA tools use MMMC: Multi-mode, multi-corner analysis — checking setup at slow-corner (slow cells, high temp, low Vdd) and hold at fast-corner (fast cells, low temp, high Vdd) simultaneously, since those are the worst cases for each check.
10
Hard Timing / STA
After place-and-route, a flip-flop path has both a setup AND a hold violation. How do you approach this?

A simultaneous setup and hold violation on the same path means the combinational delay window is too narrow — the path is fast enough to threaten hold, but not fast enough to meet setup. This typically means the logic between two FFs is very shallow (perhaps only a wire or one gate), but there is also a clock tree imbalance creating large skew.

Diagnosis first:

  • Check the clock skew between launch and capture FFs. Excessive positive skew can simultaneously worsen hold (by making the capture clock late) while hurting setup if the data path is borderline.
  • Look at the path's actual combinational depth — very few gates means it's a structurally short path.

Fix strategy:

  • Rebalance the clock tree first: Reduce skew between these two FFs. This is the most targeted fix — less skew directly improves both simultaneously.
  • Insert delay cells: Add buffers on the data path to increase minimum delay (fix hold). Then verify setup is still met — if setup is tight, you may need to also optimize the logic depth.
  • Restructure logic: If setup is violated because the path is in a long combinational chain overall, pipelining it (adding an intermediate register) can help. But this changes the design architecture.
At advanced nodes (7nm, 5nm): This scenario is common because cells are extremely fast and clock skew control is harder with dense routing. PDKs provide dedicated delay cells (e.g., DLY4, BUF_DEL) tuned specifically for hold fixing without wasting area.
11
Medium CDC
What is metastability? Can it be completely eliminated?

Metastability occurs when a flip-flop's setup or hold time is violated — the flip-flop enters a metastable state where its output is neither a valid logic 0 nor a valid logic 1. The internal node of the FF is stuck near the switching threshold (V_DD/2) and takes an unpredictable time to resolve to a valid level.

The physics: A flip-flop is a bistable element with two stable equilibria (0 and 1) and one unstable equilibrium (the metastable point). When forced into the unstable point, it resolves exponentially fast — but how long it takes is governed by thermal noise and is therefore random.

Can it be eliminated? No — not completely. Any time asynchronous data crosses a clock boundary, there is a non-zero probability of violating the setup/hold window. The probability of remaining metastable beyond a time T_r decreases exponentially with T_r, but never reaches exactly zero.

What we do instead: We manage the probability using synchronizers. The key metric is MTBF (Mean Time Between Failures). A 2-flop synchronizer gives the metastable FF one full clock period to resolve — in modern CMOS (τ ≈ 30ps), at 1 GHz this gives MTBF of thousands of years, making failure astronomically unlikely.

Critical: A metastable output propagating into combinational logic is dangerous — the intermediate voltage can cause multiple gates to output contradictory values, causing unpredictable circuit-wide failures. Always ensure metastability resolves before the signal fans out.
12
Medium CDC
What is a 2-flop synchronizer and why exactly 2 flops? Why not 1 or 3?

A 2-flop synchronizer consists of two back-to-back flip-flops, both clocked by the destination domain clock, inserted on a signal crossing from another clock domain.

How it works: The first FF may go metastable when it samples the asynchronous input. It has one full clock period (minus the FF's own propagation delay and the second FF's setup time) to resolve. Because metastability resolution time is exponential, the probability that it remains metastable long enough to corrupt the second FF is extremely small.

Why not 1 flop? One flop doesn't give enough resolution time. The metastable FF must resolve within roughly T_clk − T_cq − T_setup2 — for a 1 GHz clock, that's ~500ps. The probability of remaining metastable that long is non-trivial in some technologies.

Why not 3 flops? Three flops are rarely necessary. With a 1 GHz destination clock and τ ≈ 30ps (modern 7nm), a 2-flop synchronizer gives:

  • Resolution time T_r ≈ 500ps, τ ≈ 30ps
  • MTBF ≈ e^(T_r/τ) / (f_c × f_d) ≈ e^16.7 / 10^18 → millions of years

Three flops extend MTBF to an astronomically larger number that adds no practical benefit. Use 3 flops only in safety-critical applications (automotive ASIL-D, aerospace) where even million-year MTBF is required to be proven insufficient with 2.

13
Medium CDC
Why are Gray code counters used for asynchronous FIFO read/write pointers?

In an async FIFO, the read pointer lives in the read clock domain and the write pointer lives in the write clock domain. Each pointer must be compared against the other's synchronized version to determine full or empty.

The problem with binary counters: When a binary counter increments, multiple bits change simultaneously. For example, 0111 → 1000 changes all 4 bits. If you sample a binary counter while it's transitioning, you might read any of the 16 possible values — a catastrophic error that could falsely declare the FIFO full or empty, corrupting data.

Why Gray code solves this: A Gray code changes exactly one bit per count. When the pointer transitions from count N to N+1, only one bit flips. If the synchronized copy is sampled mid-transition, the worst case is that it sees either count N or count N+1 — off by at most one.

The FIFO full/empty logic deliberately has one count of tolerance: full is declared when the pointers are exactly DEPTH apart, not DEPTH−1. This one-count margin absorbs the maximum one-count error introduced by Gray code sampling, making the detection robust.

Implementation note: The Gray code conversion is simple: gray = bin XOR (bin >> 1). Store the Gray counter as the pointer, convert back to binary (requires a loop of XORs) only if you need the absolute address for memory indexing.
14
Hard CDC
What is MTBF in the context of synchronizers and what parameters influence it?

MTBF (Mean Time Between Failures) quantifies how often a synchronizer is expected to allow a metastable signal to propagate into the destination domain. The standard formula is:

MTBF = e^(T_r / τ) / (f_c × f_d × t_w)

Where:

  • T_r — resolution time available for metastability to resolve (≈ T_clk − T_cq_ff1 − T_su_ff2). The more time, the exponentially higher the MTBF.
  • τ — technology metastability time constant. A smaller τ means the FF resolves faster, improving MTBF. Scales with process: ~100ps at 180nm, ~30ps at 7nm.
  • f_c — destination clock frequency. Higher frequency = more sampling opportunities per second = more chances for metastability to cause failure.
  • f_d — data toggle rate. How often does the incoming signal change near the clock edge?
  • t_w — the setup+hold window width. Narrower window = smaller probability of entering metastability per clock cycle.

The exponential dependence on T_r/τ is why adding a second synchronizer flip-flop dramatically improves MTBF — it adds one full clock period to T_r.

Practical target: Design teams typically target MTBF > 10,000 years per synchronizer. At 1 GHz with modern 7nm silicon, a 2-FF synchronizer achieves this comfortably. If not, either add a third FF or investigate if the signal can be designed to toggle only when safely away from clock edges.
15
Easy Low Power
What is the difference between dynamic power and static (leakage) power? How do you reduce each?

Dynamic power = α × C_L × V_DD² × f — consumed when a node switches from 0 to 1 (charges the load capacitance) or 1 to 0 (discharges through the pull-down network). α is the activity factor (fraction of clock cycles the node switches). This power is zero when the circuit is idle.

Static (leakage) power = I_leakage × V_DD — consumed even when no gates are switching, due to sub-threshold current, gate oxide tunneling, and junction leakage. It does not depend on frequency and is present whenever power is supplied.

Historical trend: In nodes above 90nm, dynamic power dominated. Below 28nm and especially at 7nm/5nm, leakage has grown dramatically because transistors cannot be switched fully off at low supply voltages. Modern SoCs spend significant area on leakage management.

Reduction techniques:

  • Dynamic: Clock gating (reduce α), operand isolation (prevent toggling of datapath), voltage scaling (V² dependence), frequency scaling, low-swing signaling
  • Static: Power gating (cut V_DD to an entire domain via header/footer cells), multi-Vt design (use High-Vt cells in non-critical paths — slower but much lower leakage), reverse body biasing, state retention during power down
16
Medium Low Power
How is clock gating implemented correctly? Why can't you just AND the clock with an enable signal?

Clock gating removes the clock signal from a register bank when its value won't change, eliminating the clock-to-Q dynamic power and the switching power of all downstream logic. The clock network can account for 30–40% of total chip dynamic power, making clock gating one of the highest-impact power techniques.

Why you cannot simply write assign gated_clk = clk AND enable:

If enable changes while clk is HIGH, the AND gate output glitches — it produces a truncated clock pulse shorter than a full cycle. This truncated pulse can violate the setup/hold requirements of any flip-flop it clocks, corrupting stored data or causing metastability.

The correct implementation — Integrated Clock Gating (ICG) cell:

  • A latch samples the enable signal on the LOW phase of the clock (when clock = 0)
  • The latched enable is then ANDed with the clock
  • Because the latch captures enable only when the clock is LOW, by the time the clock rises, the latch output is stable — the AND gate sees a stable enable and a clean rising edge → full clock pulse or no pulse, never a partial one

In RTL, you write: if (enable) register <= data; and the synthesis tool infers an ICG cell. Never write clock gating manually at the gate level in RTL — let the tool use the optimized library ICG cell.

Interview follow-up: "What is operand isolation?" — It prevents data inputs to a gated block from toggling even when the block is clock-gated, saving the switching power of the combinational logic feeding the registers.
17
Easy Verification / DFT
What is a scan chain and why is it used in Design for Test (DFT)?

A scan chain connects the flip-flops in a design into a long shift register that can be controlled and observed from the chip's I/O pins, purely for testing purposes.

How it works: Each flip-flop in the design is replaced with a scan flip-flop — identical to a normal FF but with an extra 2:1 mux at the data input:

  • In functional mode (scan_enable = 0): the mux passes normal D input — the design operates as designed.
  • In scan mode (scan_enable = 1): the mux passes the previous FF's output — all FFs form a shift register. You can shift in a test pattern, capture one functional clock cycle, and shift out the results for comparison.

Why it's essential: Without scan, testing whether a stuck-at fault (wire permanently stuck at 0 or 1) exists deep in the chip requires applying just the right sequence of primary input patterns — combinatorially explosive. With scan, an ATPG (Automatic Test Pattern Generation) tool can directly control any FF's state and observe any FF's captured output, enabling near-100% stuck-at fault coverage with a manageable number of test vectors.

Test flow at production: After packaging, every chip is tested on an ATE (Automated Test Equipment). The scan chain shifts in millions of vectors and compares the shifted-out responses against the fault-free model. Any mismatch → chip fails and is discarded.

DFT also covers: BIST (Built-In Self Test) for memories, boundary scan (JTAG IEEE 1149.1) for board-level interconnect testing, and compression (EDT, WBC) to reduce the number of test vectors while maintaining coverage.
18
Easy Protocols
Explain the AXI4 VALID/READY handshake mechanism. What rule must never be broken?

Every AXI4 channel (AW, W, B, AR, R) uses a two-signal handshake: VALID (driven by the sender) and READY (driven by the receiver). A transfer occurs on the rising clock edge when both VALID and READY are simultaneously HIGH.

Rules:

  • The sender asserts VALID when it has valid data/address to send and must not deassert VALID until the transfer completes (both signals HIGH on a clock edge).
  • The receiver asserts READY when it can accept data. READY may be HIGH before VALID (pre-ready) — this is fine.
  • If VALID is asserted and READY is LOW, both sides wait. Neither can "cancel" the transaction by deasserting VALID without completing the handshake.

The rule that must never be broken: VALID must not combinatorially depend on READY. If the master only asserts VALID after it sees READY, and the slave only asserts READY after it sees VALID, the result is a deadlock — neither ever fires first. READY is allowed to depend on VALID, but not vice versa.

AXI4 has 5 independent channels, enabling key features: a new write address (AW) can be accepted while write data (W) from a previous burst is still in-flight. Read and write transactions are completely independent, maximizing bus utilization.

19
Hard Protocols
How does AXI4 support out-of-order transaction completion? What is the role of transaction IDs?

AXI4 allows a master to issue multiple outstanding read or write transactions before receiving responses. Each transaction is tagged with a Transaction ID (ARID for reads, AWID for writes). The slave and interconnect are free to complete transactions in a different order from how they were issued — a fast SRAM access may return data before a slow DRAM access even if the DRAM request was issued first.

How the master reconciles responses: Read data returns on the R channel with RID matching the original ARID. Write responses return on the B channel with BID matching AWID. The master maintains an outstanding transaction table and uses the ID to match each response to the correct request.

Ordering rule per ID: Transactions with the same ID must complete in order. If a master issues two reads both with ARID=3, the interconnect must return them in order. Transactions with different IDs have no ordering guarantee relative to each other.

Interconnect ID widening: When multiple masters share an interconnect, the fabric appends a master-identifying prefix to each ID (e.g., 2-bit master select + original ARID = extended RID). On the response path, the prefix is used to route the response back to the correct master, which strips the prefix before comparing IDs.

AXI4 vs AXI3: AXI3 allowed interleaved write data (WID specified which burst a W beat belonged to). AXI4 removed WID — write data must always be in the same order as write addresses. This simplification makes interconnects significantly cheaper to implement.
20
Easy Architecture
What is pipelining? What does it improve and what are its trade-offs?

Pipelining divides a long combinational operation into N sequential stages, each separated by flip-flops. Instead of one result every T_total clock period (dictated by the slowest path), you get one result per T_total/N clock period — throughput increases N× once the pipeline is full.

Example: A 5-stage 32-bit multiplier at 500 MHz produces one product every 2 ns. Without pipelining, the same logic would run at 100 MHz (5× slower combinational chain). With pipelining, a new multiply starts every cycle — though each individual result still takes 5 cycles of latency.

What pipelining improves: Throughput (results per unit time) — directly, by allowing clock frequency to be multiplied by the number of stages.

Trade-offs:

  • Latency: Each result takes N cycles to complete instead of 1. This is often acceptable for bulk data, but hurts interactive or latency-sensitive operations.
  • Area: N−1 extra register stages add flip-flop area and routing overhead.
  • Power: More registers switching every cycle; however the lower V_DD enabled by higher-frequency operation may offset this.
  • Hazards: Data hazards (RAW — read after write: a stage needs a result not yet produced by a later stage), control hazards (branches), and structural hazards (resource conflicts) require stalls, forwarding, or branch prediction logic — all of which reduce ideal throughput.
  • Balancing: If one stage is slower than others, it bottlenecks the pipeline. All stages must be balanced to the same worst-case delay for the frequency gain to be fully realized.
01
Easy RTL Design
What is the difference between blocking (=) and non-blocking (<=) assignments in Verilog? When should each be used?

Blocking assignment (=) executes sequentially within an always block — each statement completes before the next begins, exactly like a software assignment. The left-hand side updates immediately.

Non-blocking assignment (<=) evaluates all right-hand sides first (using values from the current time step), then schedules all left-hand side updates to happen simultaneously at the end of the time step. This models the parallel behavior of flip-flops sampling their D inputs on a clock edge.

The golden rules:

  • Use = (blocking) for combinational logic in always @(*) blocks. The sequential evaluation correctly implements the logic function.
  • Use <= (non-blocking) for sequential logic in always @(posedge clk) blocks. The simultaneous update models how FFs all sample their D input on the same clock edge.
  • Never mix both types in the same always block.

Classic bug with the wrong choice: A shift register written with blocking assignments (a = in; b = a; c = b;) immediately propagates the input through all stages in a single clock cycle. With non-blocking (a <= in; b <= a; c <= b;), all three FFs sample their current input simultaneously — correct shift register behavior.

Synthesis impact: Mixing = and <= in a clocked block can produce simulation–synthesis mismatches — the simulator and the synthesized hardware behave differently. This is one of the most common Verilog bugs in job interviews and real designs.
02
Easy RTL Design
What are the differences between wire, reg, and logic in SystemVerilog? Why was logic introduced?

wire (Verilog): a net type representing a physical connection. It can only be driven by continuous assignments (assign) or module output ports. Multiple drivers resolve via wired-AND/OR logic depending on the net type. It cannot hold state.

reg (Verilog): a variable that can be driven inside procedural blocks (always, initial). Despite its name, it does NOT necessarily synthesize to a register — a reg inside always @(*) synthesizes to combinational logic. The name is misleading and a common source of confusion.

logic (SystemVerilog): a unified 4-state variable type that replaces both wire and reg for most use cases. It can be driven by both continuous assignments and procedural blocks. The key restriction: logic allows only one driver — the compiler flags multi-driver errors that wire silently allows. This catches accidental bus conflicts at compile time.

Why logic was introduced:

  • Eliminates the confusing reg misnomer — logic communicates data type, not inferred hardware.
  • Provides compile-time multiple-driver checking that wire lacks.
  • Works in both continuous and procedural contexts, reducing declarations.
Modern SV practice: Use logic for almost everything. Use wire only when you explicitly need multiple drivers (e.g., tri-state buses, wired-AND). Avoid reg entirely in new SystemVerilog code.
03
Medium RTL Design
How do you detect full and empty conditions in a synchronous FIFO? What is the "extra bit" trick?

A synchronous FIFO uses a write pointer (wrptr) and a read pointer (rdptr) to track the head and tail. Both pointers start at 0. When the FIFO is empty, both point to the same location — and when it is completely full, both also point to the same location after wrapping around. This ambiguity is the core challenge of FIFO pointer design.

The naive approach fails: If both pointers are N-bit binary counters with range 0 to DEPTH-1, you cannot distinguish full from empty because both conditions result in wrptr == rdptr.

The extra bit trick: Use N+1 bit pointers, where N = log₂(DEPTH). The lower N bits are the actual memory address; the MSB (the "extra bit") acts as an overflow wrap indicator.

  • Empty: wrptr == rdptr (all N+1 bits equal — same wrap count, same address)
  • Full: wrptr[N-1:0] == rdptr[N-1:0] AND wrptr[N] != rdptr[N] (same address, but one extra wrap ahead)

The MSBs differ when the write pointer has wrapped one more time than the read pointer — meaning the FIFO is exactly DEPTH entries deep.

For async FIFOs: The same trick applies, but the pointers are Gray-coded before being synchronized across clock domains. Gray code changes only one bit per count, making the synchronized pointer off by at most one count — safe for full/empty logic.
04
Medium Timing / STA
What is the difference between a false path and a multi-cycle path in SDC constraints? How do you set each?

False path: a timing path that exists in the netlist but will never carry valid data in the real operating system. STA should completely ignore it — no setup or hold analysis. Examples:

  • Paths between two completely unrelated, never-simultaneously-active clock domains
  • Paths from a test-mode-only mux output that is static during functional operation
  • Reset synchronizer paths where the reset is never timing-critical
  • Paths between scan mode logic not active during functional timing

SDC: set_false_path -from [get_cells launch_ff] -to [get_cells capture_ff]

Multi-cycle path (MCP): a path where data is intentionally designed to take N clock cycles to settle. The designer tells STA to use N×T_clk as the available time for the setup check instead of 1×T_clk.

SDC for a 2-cycle setup path: set_multicycle_path 2 -setup -from ... -to ...

Critical rule for MCP: When you relax setup by N cycles, you MUST also adjust the hold check. By default, STA places the hold check one cycle before the setup capture edge — correct for 1-cycle paths. For a 2-cycle setup, the hold check must also move back one cycle:

SDC: set_multicycle_path 1 -hold -from ... -to ...

Common mistake: Setting set_multicycle_path 2 -setup without the matching -hold exception creates an overly pessimistic hold check one cycle before the new setup capture — often an impossible hold requirement that forces unnecessary delay insertion.
05
Hard Timing / STA
What is OCV (On-Chip Variation)? What is the difference between flat OCV, AOCV, and POCV?

OCV (On-Chip Variation) acknowledges that cells at different locations on the same die do not experience identical conditions. Spatial gradients in temperature, VDD (due to IR drop), and manufacturing process (oxide thickness, doping) cause cells in different parts of the chip to have slightly different delays — even if they are the same cell type running at the same nominal conditions.

This matters for STA because the clock path and data path typically run through physically different areas of the chip. If both paths were derated the same way, the error would cancel. But since one may be faster and the other slower, we must be pessimistic.

Flat OCV (Flat Derating): Applies a single multiplicative derating factor to all cells. The launch data path is made slower (multiply delays by e.g. 1.05) and the capture clock path is made faster (multiply by 0.95) for setup — worst-case pessimism everywhere. Simple but overly conservative.

AOCV (Advanced OCV): Uses a lookup table indexed by path depth (number of logic stages) and distance. Longer paths with more stages average out variation — a 30-stage path has less cell-to-cell variation than a 2-stage path. AOCV assigns less derating to deep paths, reducing pessimism and improving timing convergence without sacrificing accuracy.

POCV (Parametric OCV / LVF): Uses full statistical distributions (mean and sigma) for each cell's delay, propagating uncertainties through the path using statistical addition. This is the most accurate method and is becoming the industry standard at 7nm and below, where AOCV is no longer pessimistic enough.

Qualcomm context: Snapdragon SoCs run at high frequency with tight timing margins. Moving from flat OCV to AOCV/POCV can recover 50–150 ps of setup slack that flat OCV was needlessly consuming — directly enabling higher clock frequencies or lower voltage operation.
06
Hard Timing / STA
What is CRPR (Clock Reconvergence Pessimism Removal)? Why does it matter?

When STA analyzes a flip-flop-to-flip-flop path, the launch clock path (from clock source to FF1) and the capture clock path (from clock source to FF2) often share common clock buffers near the root of the clock tree before they diverge.

With OCV derating, the tool pessimistically applies opposite deratings to the launch and capture paths: the launch clock is made slower (derated up) and the capture clock is made faster (derated down) for setup analysis. But the shared portion of the two paths cannot simultaneously be both slow and fast — it is the same physical cell running at the same moment in time.

CRPR removes this double-counting. For the portion of clock tree that is common to both launch and capture paths, the STA tool calculates how much pessimism was added by applying opposite deratings to the same cells, and adds that amount back as credit. The formula:

CRPR credit = max_delay(common) − min_delay(common)

This credit is added back to the setup slack. Typical CRPR values range from 10 ps to 100 ps depending on how much of the clock tree is shared and how aggressive the OCV derating is.

CRPR is sometimes called CPPR (Common Path Pessimism Removal) — both terms mean the same thing. Modern STA tools (PrimeTime, Tempus) apply it automatically.

Why it matters: Without CRPR, many paths that physically meet timing are flagged as violations. This causes unnecessary engineering effort to "fix" timing that is already correct. Enabling CRPR can reduce the number of failing paths by 20–40% without any design changes.
07
Hard CDC
How do you safely transfer a multi-bit data bus across clock domains? Why can't you just synchronize each bit independently?

Why per-bit synchronization fails: Each bit of the bus passes through its own 2-FF synchronizer independently. Each synchronizer may sample from a different source clock cycle — bit 3 might capture the value from cycle N while bit 0 captures the value from cycle N+1. The destination domain then reads a "torn" word that never existed in the source domain. For a 32-bit bus, this can produce completely wrong data.

Safe techniques for multi-bit CDC:

  • Gray code (for counters/pointers): If the bus is a counter that increments by one at a time, encode it in Gray code before the crossing. Only one bit changes per count, so a sampled-in-transition value is at most off by one — which FIFO logic tolerates.
  • Handshake (req/ack): Source asserts a request (req) after data has been stable for at least one source cycle. Destination synchronizes req (2-FF), samples the data only after req is asserted, then asserts ack. Source deasserts req after seeing synchronized ack. Both req and ack use separate 2-FF synchronizers. Low throughput (takes ~4–6 destination clock cycles per transfer) but works for any arbitrary data.
  • Asynchronous FIFO: For streaming data, use an async FIFO with Gray-coded pointers. The FIFO internally handles all multi-bit CDC safely.
  • Qualified sampling: Source keeps data stable for at least 3 destination clock cycles, then asserts a single "data valid" signal. Destination synchronizes the valid signal and samples the data on the synchronized valid. Risky — relies on the source holding data long enough.
Qualcomm modem chips have dozens of clock domains (CPU, DSP, RF, power management, audio) all communicating across boundaries. Multi-bit CDC handling is one of the most common bug sources in modem SoC development.
08
Medium CDC
What is a pulse synchronizer? When would you use it instead of a 2-FF level synchronizer?

A 2-FF level synchronizer is used when the source signal is a steady level that persists for many source clock cycles. The destination captures it safely after 2 destination clocks.

A pulse synchronizer is needed when the source generates a single-cycle pulse — a signal that is HIGH for exactly one source clock cycle. A 2-FF synchronizer cannot reliably capture this: if the destination clock is slower or at an unfortunate phase, the pulse may be missed entirely.

How a toggle-based pulse synchronizer works:

  • Source domain: A toggle flip-flop converts each incoming pulse into a level change. Every time a pulse arrives, the FF inverts its output. The toggle signal therefore holds its value until the next pulse — making it a persistent level that won't be missed.
  • Clock crossing: The toggle signal crosses the domain via a standard 2-FF synchronizer.
  • Destination domain: An XOR of the synchronized output and its one-cycle-delayed copy detects each edge → generates a clean single-cycle pulse in the destination domain.

Constraint: Source pulses must be spaced at least 3 destination clock cycles apart so the previous toggle has fully propagated through the synchronizer before the next pulse arrives. If pulses can arrive faster, use an async FIFO instead.

Use cases: Interrupt signals, one-shot event notifications, handshake request pulses — any scenario where the source generates a distinct, infrequent event and the destination must detect it exactly once.
09
Medium Low Power
What is UPF (Unified Power Format)? What does it define and why is it needed?

UPF (IEEE 1801) is a standard format for capturing the power intent of a chip design in a separate file that accompanies the RTL. As SoCs moved to multiple power domains, it became impossible to express power management purely in RTL — the RTL describes logical functionality, not which block gets what voltage or when a domain shuts off.

What UPF defines:

  • Supply networks: Which voltage rails exist (VDD_CPU, VDD_MODEM, VDD_AON), their nominal voltages, and how they connect to design blocks.
  • Power domains: Which RTL modules belong to which supply rail. Each domain has a defined primary power supply.
  • Power states: Which domains are ON or OFF in each operating mode (e.g., "sleep mode: modem ON, CPU OFF, AON ON").
  • Isolation cells: Specifies where isolation cells must be inserted at the boundary of power-gatable domains, and what value they should clamp to when the domain is off.
  • Retention registers: Which flip-flops need SRPG (State Retention Power Gating) cells to preserve state across a power-off event.
  • Level shifters: Where voltage-level-shifting cells are needed between domains running at different voltages.
  • Power switches: Header (PMOS) or footer (NMOS) transistors that gate the power supply to a domain.
Qualcomm uses UPF extensively in Snapdragon SoCs. A typical Snapdragon has 10+ power domains (CPU clusters, GPU, DSP, modem, camera, display, always-on). Without UPF, manually inserting and verifying isolation cells, retention registers, and level shifters across hundreds of domain boundaries would be error-prone and unmanageable.
10
Medium Low Power
What are isolation cells and retention registers (SRPG)? When are they required?

Isolation cells are required at the output boundary of any power-gated domain. When a domain's power supply is cut, its flip-flops lose their state and outputs become undefined (float to a random value or X). If an always-on domain receives these floating signals, it may malfunction — latching garbage data, causing spurious state transitions, or drawing excessive short-circuit current.

An isolation cell is inserted on each output net of the power-gated block. It is connected to an always-on supply. When the domain is OFF, the isolation cell clamps the output to a safe known value (typically 0 for AND-based isolation, or 1 for OR-based) as specified in UPF. When the domain is ON, the isolation cell passes the signal through transparently.

Retention registers (SRPG — State Retention Power Gating) are special flip-flop variants with a small "shadow latch" connected to a separate always-on power rail (typically a low-leakage supply). The shadow latch holds only a few transistors, consuming a fraction of the normal FF's leakage.

Operation:

  • Before power-off: The power management controller sends a SAVE signal → each SRPG cell captures its current state into its shadow latch.
  • Domain is off: Main supply cut, shadow latch retains state at very low power.
  • After power-on: A RESTORE signal pushes the shadow state back into the main FF.

Without retention, the block must re-initialize from scratch after every power-up, adding latency and requiring software re-programming of registers.

11
Easy Low Power
What is a voltage island in SoC design? What cells are required at the boundaries?

A voltage island is a physically distinct region of the chip that operates at a different supply voltage from surrounding blocks. By running low-activity blocks at a lower V_DD, dynamic power scales as V², giving dramatic savings — dropping from 1.0V to 0.8V reduces dynamic power by 36%.

Why Qualcomm uses voltage islands: A Snapdragon SoC has very different performance and power requirements across blocks. The modem baseband runs continuously but at moderate frequency. The application CPU cores spike to high performance on demand. The always-on sensor hub must run at <0.7V for weeks on battery. A single supply voltage optimized for the fastest block wastes enormous power in slower blocks.

Required boundary cells:

  • Level shifters (LS): Signals crossing between domains at different voltages must be shifted to the receiving domain's logic levels. A signal from a 0.8V domain HIGH (0.8V) is not guaranteed to be a valid HIGH in a 1.1V domain without level shifting. Level shifters are inserted on every signal crossing.
  • Isolation cells: If the lower-voltage island can be powered off completely, isolation cells (see previous question) are needed to clamp its outputs.
  • Level-shifting isolation cells: Combined cells that both shift voltage and isolate — used at boundaries between always-on and power-gatable domains at different voltages.
Design overhead: Level shifters add area, delay (~10–50 ps), and power. Proper floorplanning ensures domain boundaries are short, minimizing the number of crossing signals and therefore level shifters needed.
12
Easy Physical Design
What is Clock Tree Synthesis (CTS)? What does the tool try to achieve and what comes after it?

Clock Tree Synthesis (CTS) is the physical design step that builds the clock distribution network — a buffered tree that delivers the clock signal from the clock source (PLL output or pad) to every flip-flop's clock pin across the entire chip.

Goals of CTS:

  • Minimize clock skew: Every FF should see the clock edge at (nearly) the same time. Unbalanced trees create skew that consumes setup and hold timing margins.
  • Meet insertion delay target: Total latency from clock source to FF clock pins must be within the budgeted range (typically set in SDC via set_clock_latency).
  • Minimize clock power: The clock network toggles every cycle and can consume 30–40% of total chip dynamic power. The tool balances skew reduction against cell count and wire length.
  • Respect no-touch (NDR) routing rules: Clock nets typically use special Non-Default Routing Rules (wider wires, more spacing, preferred upper metal layers) for reduced resistance and better EM reliability.

Flow position: CTS runs after placement (cell locations are fixed) but before detailed routing. After CTS, timing analysis uses real clock arrival times instead of ideal clock assumptions — hold violations often emerge here because real clock trees have skew that didn't exist in pre-CTS analysis.

Post-CTS hold fixing: After CTS, the timer switches from ideal clock to propagated clock. Paths that were hold-clean with ideal clocks often fail with real clock skew. Hold fixing (inserting delay buffers) is a major post-CTS activity before proceeding to routing.
13
Medium Physical Design
What is IR drop in a VLSI design? How does it affect timing and how do you fix it?

IR drop is the voltage reduction along the power delivery network from the supply pins to the power pins of individual cells. The metal power grid has resistance (R), and the switching current (I) causes a voltage drop V = I × R. A cell operating at V_nominal − ΔV is slower than a cell at the full supply voltage.

Two types:

  • Static IR drop: Average current × grid resistance. Determined by the long-term average switching activity. Used for power integrity sign-off of DC operating point.
  • Dynamic (transient) IR drop: When a large number of cells switch simultaneously (e.g., a wide datapath all clocking at once), the instantaneous current surge exceeds the average. The power grid voltage transiently collapses by a larger amount, limited by the inductance and decoupling capacitance. This "voltage droop" is worse than static IR and is the primary concern at high frequencies.

Effect on timing: In a high-IR-drop region, cells are slower than characterized at nominal voltage. A path that passes STA at nominal conditions may violate setup timing in silicon due to IR-induced delay increase. Hold violations are less common (slower cells improve hold margin).

Fixes:

  • Widen power stripes or add more power mesh layers
  • Add decoupling capacitors (decaps) near high-switching density regions
  • Spread high-activity cells during placement to avoid current hot spots
  • Use power gating with controlled wake-up sequences to avoid simultaneous switching
  • In STA: apply voltage derating in high-IR-drop regions for more accurate sign-off
14
Medium Physical Design
What is the antenna effect in VLSI fabrication? How is it detected and fixed?

During VLSI fabrication, metal layers are deposited and patterned one at a time using plasma etching. Plasma charges accumulate on exposed metal wires during etching. If a long metal wire is already connected to a transistor gate oxide but NOT yet connected to a diffusion region (which would discharge the charge safely), the accumulated charges can create a large voltage across the thin gate oxide — sufficient to cause permanent gate oxide damage: threshold voltage shifts, increased leakage, or immediate breakdown.

The antenna ratio = (metal area of the wire connected to the gate) / (gate oxide area). Process Design Kits (PDKs) specify maximum allowable antenna ratios (typically 400–1000 for metal, 200–600 for vias). Exceeding this ratio means the wire can accumulate enough charge to damage the oxide.

How it's detected: The router's DRC (Design Rule Check) engine computes the cumulative antenna ratio for every net using the partial routing built up layer by layer. If it exceeds the limit, an antenna violation is flagged.

Fixes:

  • Metal jumper (layer hopping): Break the long wire by jumping to a higher metal layer and back. This "resets" the antenna accumulation because higher-layer routing is done later, after diffusion connections have been made. Most common fix.
  • Antenna diode: Insert a reverse-biased diode near the gate, connected to the same metal wire. During plasma etching, the diode provides a discharge path to substrate, preventing charge buildup. Small area cost, always effective.
  • Reduce net length: Re-route the net to use shorter wires on lower layers.
15
Easy Verification / DFT
What is the difference between functional coverage and code coverage? Which is more important?

Code coverage measures how much of the RTL source code was exercised by the simulation:

  • Line/statement coverage: Were all lines of RTL executed?
  • Branch coverage: Were both sides of every if/else and every case arm taken?
  • Toggle coverage: Did every signal toggle both 0→1 and 1→0?
  • FSM coverage: Were all states visited and all transitions taken?

Code coverage is automatically collected by the simulator with no extra specification — easy to get, but tells you nothing about what scenarios were verified. You can hit 100% branch coverage while never testing the most critical protocol corner case.

Functional coverage is user-defined. The verification engineer specifies which scenarios, protocol states, and parameter combinations are important to verify — then measures whether simulations actually exercised them:

  • Was an AXI4 burst of ARLEN=255 (256 beats) issued?
  • Did a FIFO simultaneously receive a write and a read when exactly one slot was free?
  • Did a CDC crossing happen with data changing every source cycle?

Which matters more? Both are necessary; neither alone is sufficient. Code coverage ensures no dead code was accidentally left un-exercised. Functional coverage ensures the right scenarios were tested. A mature sign-off process requires both to be above target (typically 95%+ code coverage, 100% defined functional coverpoints).

16
Medium Verification / DFT
Describe the key components of a UVM (Universal Verification Methodology) testbench. How does it differ from a traditional directed testbench?

UVM (IEEE 1800.2) is a standardized SystemVerilog methodology for building reusable, scalable verification environments using an object-oriented framework. It replaces brittle, one-off directed testbenches.

Key UVM components:

  • uvm_test: Top-level test class. Selects which scenario/sequence to run and configures the environment. Different tests reuse the same TB infrastructure.
  • uvm_env: Container that instantiates and connects agents, scoreboards, and coverage collectors for one DUT.
  • uvm_agent: Models one protocol interface (e.g., AXI4 master). Contains: Driver (applies stimulus to DUT pins), Monitor (observes DUT pins and creates transaction objects), Sequencer (arbitrates between sequences and feeds items to the driver).
  • uvm_sequence / uvm_sequence_item: Defines the actual stimulus transactions. Sequences can be layered (a higher-level sequence calls lower-level sequences) and constrained-random.
  • uvm_scoreboard: Compares DUT output (from monitor) against a reference model's expected output. Reports pass/fail.
  • TLM ports (uvm_analysis_port): Standardized communication channels between components — no direct references between classes.

Vs. directed testbench: A directed testbench hand-codes every stimulus vector — it only tests what the engineer explicitly wrote. A UVM testbench with constrained-random stimulus explores the full stimulus space automatically within user-specified constraints, finding corner cases no human would write by hand.

Coverage-driven verification: UVM enables "close-the-loop" verification: run simulations → check functional coverage → add constraints to target uncovered scenarios → repeat until all coverpoints hit. This replaces guesswork with a systematic, measurable sign-off process.
17
Medium Verification / DFT
What are the main fault models used in ATPG? What does each model test for?

ATPG (Automatic Test Pattern Generation) tools model physical manufacturing defects as logical faults and generate patterns to detect them. The main fault models are:

  • Stuck-At Fault (SAF): A wire is permanently stuck at logic 0 (SA0) or 1 (SA1), regardless of what drives it. Models open circuits, resistive shorts to VDD/GND, and broken connections. The most widely used model. A stuck-at fault is detected by finding a test that excites the fault (drives the opposite value) and propagates the effect to a primary output or scan chain output. Industry target: 95–99% fault coverage.
  • Transition Delay Fault (TDF): Tests whether a net can make a complete 0→1 or 1→0 transition within one clock cycle. Detects resistive defects that don't prevent correct logic levels but slow transitions — critical at high frequency where even a slightly slow net causes a setup violation. TDF requires two-pattern tests: launch the transition, then capture the response one cycle later.
  • Path Delay Fault (PDF): Tests the end-to-end propagation delay of a specific signal path. More accurate timing characterization than TDF — detects accumulated small delays across many gates. Requires many patterns but provides the most complete timing sign-off.
  • Bridging Fault: Models an unintended short between two adjacent nets. A short that combines two signals via wired-AND or wired-OR logic. Increasingly important at 7nm/5nm where metal pitch is very tight and coupling between adjacent wires is a common defect.
  • Cell-Aware Fault: Tests for defects inside standard cells at the transistor level (open/short in the cell's internal netlist). Catches defects that SAF, modeled at the cell's logical interface, would miss.
Qualcomm mobile chips use both SA and transition delay fault testing at production. High volume = even a 0.01% defect escape rate means thousands of field failures. Comprehensive fault coverage is non-negotiable.
18
Medium Protocols
What is MIPI CSI-2? How is it used in mobile SoCs and what are its key electrical characteristics?

CSI-2 (Camera Serial Interface 2) is a MIPI Alliance standard for connecting image sensors to application processors. It is the dominant camera interface in smartphones — virtually every mobile camera uses CSI-2.

Physical layer (D-PHY): CSI-2 uses MIPI D-PHY, a differential serial interface with two operating modes:

  • High-Speed (HS) mode: Low-swing differential signaling (100–300 mV differential) at 80 Mbps to 4.5 Gbps per lane. Used for pixel data transmission.
  • Low-Power (LP) mode: CMOS-level single-ended signaling. Used for control, synchronization, and lane management. Much lower speed.

Architecture: One clock lane + 1 to 4 data lanes. Each lane is a differential pair (DP/DN). For a quad-lane sensor at 4.5 Gbps/lane: total bandwidth = 4 × 4.5 = 18 Gbps — sufficient for 200 MP sensors at full frame rate.

Virtual channels: Up to 4 virtual channel IDs allow multiple cameras to share the same physical CSI-2 interface, multiplexed by the sensor or ISP.

C-PHY (newer alternative): Uses 3-wire "trios" with encoded 3-symbol signaling, achieving 5.7 Gsymbols/s per trio = ~2.28 bits per symbol → higher effective data rate without increasing frequency. Used in high-resolution cameras where D-PHY lane count limits bandwidth.

VLSI implementation: The CSI-2 receiver on a Snapdragon SoC consists of a D-PHY frontend (analog deserializer), a lane merger, a CSI-2 protocol decoder, and an interface to the Image Signal Processor (ISP). It must process pixels faster than they arrive to prevent FIFO overflow — typically 500 MHz+ operating frequency.

01
Easy GPU Architecture
What is SIMD execution in a GPU? What is a warp and what happens during warp divergence?

SIMD (Single Instruction, Multiple Data) means one instruction operates on many data elements in parallel. In a GPU's Streaming Multiprocessor (SM), this is implemented as SIMT — Single Instruction, Multiple Threads: 32 threads are grouped into a warp, and all 32 threads execute the same instruction simultaneously, each on its own private data.

The warp is the fundamental scheduling unit. A single SM can hold many warps in flight (e.g., 64 warps × 32 threads = 2048 threads per SM on H100). When one warp stalls on a memory access, the SM's warp scheduler instantly switches to another ready warp — this is how GPUs hide memory latency through latency hiding rather than large caches.

Warp divergence occurs when threads within a warp take different execution paths — for example, in if (threadIdx.x % 2 == 0), even-numbered threads go one way, odd threads another. The GPU must serialize execution:

  • First execute all threads that took the "true" branch (odd threads masked off)
  • Then execute all threads that took the "false" branch (even threads masked off)

A 50/50 divergent branch halves effective throughput. Nested divergence multiplies the penalty. Minimizing divergence is one of the most important GPU kernel optimization principles.

NVIDIA's term: NVIDIA calls their SIMT execution model "CUDA cores" at the thread level. Each SM has 128 CUDA cores (H100), meaning 128 FP32 operations per clock cycle — 4 warps executing simultaneously.
02
Medium GPU Architecture
Describe the GPU memory hierarchy. What are the latency and bandwidth characteristics at each level?

From fastest/smallest to slowest/largest (H100 as reference):

  • Registers: Per-thread, 65,536 × 32-bit registers per SM. Latency: 0 cycles (operand bypass). Total bandwidth: ~20 TB/s across all SMs. Zero-latency access when available; register pressure causes spill to local memory.
  • Shared Memory / L1 cache: Per-SM, programmer-managed scratchpad, 228 KB per SM (H100), configurable split with L1. Latency: ~4–5 cycles. Bandwidth: ~19 TB/s. Used for inter-thread communication within a thread block.
  • L2 cache: Chip-wide, 50 MB in H100, shared by all 132 SMs. Latency: ~100–200 cycles. Bandwidth: ~12 TB/s. Caches both data and instructions.
  • HBM (device memory): Off-chip stacked DRAM, 80 GB in H100 SXM5. Latency: ~400–800 cycles. Bandwidth: 3.35 TB/s. The primary memory bottleneck for memory-bound kernels.
  • Peer GPU memory (NVLink): Another GPU's HBM accessed via NVLink 4.0. Bandwidth: 900 GB/s bidirectional, cache-coherent. Latency: ~1–2 µs.
  • Host CPU memory: Via PCIe 5.0 x16. Bandwidth: 64 GB/s, Latency: ~5–10 µs. The slowest tier — minimize host↔device transfers.
Key GPU optimization principle: Most AI/HPC workloads are memory-bandwidth-bound at the HBM level. The goal is to maximize arithmetic intensity (FLOPs per byte of HBM access) — this is captured by the Roofline model, which NVIDIA engineers use to analyze and optimize kernel performance.
03
Medium GPU Architecture
What is a shared memory bank conflict in a GPU? How do you detect and avoid it?

GPU shared memory is physically organized into 32 banks, matching the warp width. Each bank can serve exactly one 32-bit access per clock cycle. If multiple threads in the same warp access different addresses within the same bank simultaneously, those accesses are serialized — reducing effective bandwidth proportionally.

Bank mapping rule: Address A maps to bank A % 32. So addresses 0, 128, 256… all map to bank 0; addresses 4, 132, 260… all map to bank 1 (for 4-byte elements).

Classic bank conflict example: A 32×32 matrix stored in shared memory, accessed column-first by a warp. Thread 0 reads M[0][0] (bank 0), Thread 1 reads M[1][0] (bank 32 % 32 = bank 0 also) → 32-way bank conflict → 32× slowdown.

Solutions:

  • Padding: Declare the array as float M[32][33] instead of [32][32]. The extra column shifts each row's bank alignment, breaking the conflict pattern. One of the most common GPU optimization tricks.
  • Access reordering: Restructure the algorithm so consecutive threads access consecutive 4-byte addresses (consecutive banks).
  • Broadcast: If all threads access the same address in one bank, the memory system broadcasts it — no conflict. This is free.
Detection: NVIDIA Nsight Compute profiler reports "shared memory bank conflicts" per kernel. A non-zero count directly indicates throughput loss that can be fixed at the algorithm level.
04
Hard GPU Architecture
What is HBM (High Bandwidth Memory)? Why does NVIDIA use it in data center GPUs instead of GDDR?

HBM (High Bandwidth Memory) is a stacked DRAM architecture. Multiple DRAM dies are stacked vertically using Through-Silicon Vias (TSVs) and connected to the GPU die via a silicon interposer — a passive silicon layer with very fine-pitch connections. The stack sits physically adjacent to the GPU die on the interposer, connected by thousands of short, dense wires rather than long PCB traces.

Why this matters:

  • Massive bus width: HBM3 provides a 1024-bit-wide bus per stack. An H100 with 5 stacks has a 5120-bit total memory bus. Compare to GDDR6X: 16-bit per chip × 24 chips on a high-end gaming card = 384-bit total. HBM is 13× wider.
  • Bandwidth: H100 achieves 3.35 TB/s of HBM3 bandwidth. An RTX 4090 with GDDR6X achieves ~1 TB/s. The H100 is 3.35× faster despite being a smaller number of dies.
  • Power efficiency: Short interconnects (millimeters vs centimeters on PCB) mean much lower capacitance per bit — lower switching energy. HBM typically consumes 50% less power per GB/s than GDDR.
  • Package area: No GDDR chips around the periphery of a large PCB. HBM stacks sit compactly next to the die on the interposer.

Why GDDR wins for gaming GPUs: HBM requires a silicon interposer, which is expensive (2–3× PCB packaging cost). For gaming budgets, GDDR6X offers enough bandwidth at lower cost. HBM's cost is justified only where bandwidth-per-dollar is critical — AI training, HPC, data center.

05
Medium Protocols
What are the key specs of PCIe Gen5? What signal integrity challenges arise at 32 GT/s?

PCIe Gen5 (PCIe 5.0) doubles the per-lane data rate from 16 GT/s (Gen4) to 32 GT/s. Using 128b/130b encoding (2-bit overhead vs 8b/10b's 20% overhead), the effective throughput per lane is ~31 Gbps. A ×16 link delivers 64 GB/s per direction (128 GB/s total bidirectional).

Signal integrity challenges at 32 GT/s:

  • Insertion loss: FR4 PCB material absorbs high-frequency signal energy. At 16 GHz Nyquist (32 GT/s), losses across even short traces become severe. Requires low-loss dielectric materials (Megtron 6, Rogers) or very short trace lengths.
  • Crosstalk: Adjacent differential pairs couple more aggressively at higher frequencies. Tighter guard spacing and reference planes are needed.
  • Equalization demands: Both transmitter and receiver require aggressive equalization — CTLE (Continuous Time Linear Equalization) and DFE (Decision Feedback Equalization) at the receiver, and FIR (finite impulse response) transmitter pre-emphasis. Gen5 standardizes more complex equalization than Gen4.
  • Connector and via design: Even PCB connectors and vias introduce resonant stubs at Gen5 frequencies. Via back-drilling (removing unused via stubs) becomes mandatory.
NVIDIA's H100 SXM5 uses PCIe Gen5 for host connectivity but relies on NVLink 4.0 for GPU-to-GPU communication within a node — because even 64 GB/s per direction is insufficient for the collective operations needed in large AI model training.
06
Hard Protocols
What is NVLink? Why can't PCIe serve GPU-to-GPU communication at the scale of modern AI training?

NVLink is NVIDIA's proprietary high-speed, cache-coherent interconnect for direct GPU-to-GPU communication. NVLink 4.0 (H100) provides 900 GB/s bidirectional bandwidth per GPU across 18 NVLink connections. Compare this to PCIe 5.0 x16: 64 GB/s per direction — NVLink is 14× higher bandwidth.

Why PCIe fails for GPU-to-GPU at scale:

  • Topology: PCIe is a CPU-centric tree topology. GPU-to-GPU data must traverse CPU root complex, adding ~2–5 µs latency and halving effective bandwidth (each hop is bidirectional but the shared root complex is a bottleneck).
  • Bandwidth: A GPT-4 scale model training run performs AllReduce across 8+ GPUs every few hundred milliseconds. Each AllReduce requires each GPU to send and receive its full gradient tensor (~10s of GB). PCIe's 64 GB/s would be saturated; NVLink's 900 GB/s handles it comfortably.
  • No cache coherency in PCIe: PCIe lacks hardware cache coherency between GPUs. Memory copies must be explicit and managed by software/driver. NVLink supports hardware cache coherence — GPU A can read GPU B's memory with the same semantics as its own, dramatically simplifying programming models like NVLink's NVSHM.

NVSwitch: NVIDIA's NVSwitch chip (3.2 Tbps per switch in NVSwitch 3.0) connects 8 GPUs in an all-to-all topology inside a DGX H100 node. Every GPU can communicate with every other GPU at full NVLink bandwidth simultaneously — no head-of-line blocking.

07
Easy RTL Design
What are the three types of pipeline hazards? How is each resolved?

A hazard is a condition that prevents the next instruction from executing in the following clock cycle, threatening to give incorrect results.

1. Structural Hazard: Two instructions need the same hardware resource simultaneously (e.g., both need to write to the register file in the same cycle, but there is only one write port). Resolution: add a second hardware unit (more write ports, more functional units), or stall one instruction.

2. Data Hazard: An instruction depends on the result of a preceding instruction that hasn't yet written its output to the register file.

  • RAW (Read After Write): Most common. Instruction B reads a register before instruction A has written it. Resolution: forwarding/bypassing — route the ALU output directly back to the ALU input without waiting for register writeback. If forwarding can't bridge the gap (e.g., load-use hazard), insert a stall (pipeline bubble).
  • WAW (Write After Write): Two instructions write the same register — later one might arrive first in out-of-order execution. Resolved via register renaming.
  • WAR (Write After Read): Instruction writes a register before an earlier instruction reads it (only in out-of-order). Resolved via register renaming.

3. Control Hazard: A branch changes the program counter, but instructions after the branch have already been fetched. Resolution: branch prediction (speculate the branch outcome, flush on misprediction), delayed branching (execute one instruction after branch regardless — RISC classic), or speculative execution with rollback.

GPU relevance: GPU warp schedulers avoid control hazards by switching warps on every stall rather than speculatively executing. Branch prediction is rarely used in GPU pipelines because the massive number of warps provides enough instruction-level parallelism without it.
08
Medium RTL Design
How do you efficiently implement a wide datapath (e.g., 512-bit ALU or adder) in RTL? What timing and synthesis concerns arise?

Slicing + generate: Break the 512-bit operation into N independent or semi-independent slices and instantiate them with a generate loop. For operations that are fully parallel (bitwise AND, OR, XOR), this gives linear throughput with the number of slices and synthesis has no trouble.

The carry problem for adders: A naive 512-bit binary adder using ripple carry has a critical path through all 512 stages — completely unacceptable. Use a hierarchical adder structure:

  • Carry Lookahead Adder (CLA): Compute generate (G=A&B) and propagate (P=A^B) for each bit, then compute carries in parallel using a tree. Reduces depth from O(N) to O(log N).
  • Carry-Select Adder: Pre-compute two copies of each upper slice (one assuming carry-in=0, one assuming carry-in=1), then mux between them when the actual carry arrives. Trades area for speed.
  • For synthesis: write the adder behaviorally (assign sum = a + b;) and let the synthesis tool select the adder topology from the technology library. Modern tools (DC, Genus) will choose appropriately based on timing constraints.

Additional concerns:

  • Routing congestion: A 512-bit bus creates dense wiring in physical design. The datapath should be placed in a compact rectangular region with proper floorplanning.
  • Retiming: If the datapath is pipelined, enable retiming (set_optimize_registers true in DC) to let the tool rebalance registers across the datapath stages automatically.
  • Operand isolation: Wide datapaths toggle a lot of bits. Gate the inputs when the datapath result is not needed to reduce switching power.
09
Hard Timing / STA
What new timing challenges dominate at 4nm/5nm compared to older nodes? Why is timing closure harder?

At older nodes (28nm+), gate delay dominated wire delay. At 4nm/5nm, the relationship has reversed — wire RC delay now dominates on many paths, making interconnect the primary timing bottleneck.

Key new challenges:

  • Wire resistance explosion: As metal pitch shrinks, wire cross-section shrinks, resistance per unit length increases dramatically. A long metal wire at 5nm can have 5–10× higher resistance than the same wire at 28nm. RC delay scales as R×C — both R and C worsen at smaller nodes.
  • FinFET / GAAFET parasitic capacitance: Gate-to-drain capacitance (Miller capacitance) in FinFET and Gate-All-Around FET structures is proportionally larger, adding input/output loading that didn't exist at planar bulk CMOS.
  • Variation (OCV/POCV): Random dopant fluctuation and gate length variation are larger fractions of the total delay at small nodes. Statistical timing (POCV/LVF) is required rather than flat derating — adding complexity to STA flow.
  • Double/triple patterning constraints: Metal layers below M4 require multi-patterning. This forces additional spacing rules, reducing routing freedom and forcing the router to use longer detours → more wire → worse RC delay.
  • IR drop impact on timing: Higher current density at same power with smaller metal → worse static and dynamic IR drop. Cells in IR-drop hot spots run slower and create timing violations not visible in STA at nominal VDD.
  • Crosstalk aggressor coupling: Tighter metal pitch increases coupling capacitance between adjacent wires. A switching aggressor can add delay (or reduce delay) to a victim net, creating timing violations that only appear with specific data patterns.
10
Medium Physical Design
What is electromigration (EM) in VLSI? How does it constrain metal routing and what are the fixes?

Electromigration (EM) is the gradual displacement of metal atoms caused by momentum transfer from electron flow (high current density). At elevated current density, atoms migrate toward the cathode end, creating voids (open circuits) at the anode and hillocks (shorts) elsewhere — a wearout failure that grows over months or years in the field.

Why it matters for NVIDIA: High-TDP GPUs (300–700W) draw hundreds of amperes. The power distribution network carries enormous currents through metal layers. Clock nets and wide datapath buses also carry high current due to their switching activity.

The constraint: PDKs define maximum DC current density (J_DC, mA/µm²) and RMS current density (J_RMS for AC/switching) per metal layer and via. Routers check these limits during sign-off. Violations require fixes before tape-out.

Common EM weak points:

  • Vias: Single-cut vias have much higher current density than the wire they connect. Via-EM is the most common EM failure mechanism. Fix: use minimum 2 vias wherever current exceeds the single-via limit (the router enforces this via "via doubling" rules).
  • Clock buffers: Clock networks drive large loads at full VDD swing every cycle — highest RMS current in the design. Clock routing uses upper metal layers with wide wires by design.
  • Power grid stripes: Size power stripes based on average current drawn by each domain, with margin.

Fixes: Widen the wire, add parallel wires, add redundant vias, move to upper metal layers (lower resistance, higher J_max), spread high-current logic across more stripes.

11
Hard Physical Design
What are the challenges of delivering 400W to a GPU die? How is the power delivery network (PDN) designed?

A 400W GPU at 0.85V core voltage draws approximately 470 amperes. Delivering this cleanly — without voltage droops that cause functional failures or reliability damage — is a major system and chip design challenge.

Voltage droop: When the GPU transitions from light to heavy workload (e.g., a kernel launch), the current demand steps up in nanoseconds. The PDN inductance (L) resists this fast current change: V_droop = L × dI/dt. A 10 nH package inductance with a 100 A/ns current ramp creates 1 V of instantaneous droop — catastrophic for a 0.85V rail.

PDN design hierarchy — capacitors at three timescales:

  • Die-level decaps (on-chip, ~1–10 ns): Inserted as standard cells in unused routing areas and power domain boundaries. Handle the fastest transients. Limited by available die area. Capacitance: ~10–100 nF total.
  • Package decaps (~10–100 ns): Capacitors embedded in the package substrate or placed as discrete SMDs on the package interposer. Handle intermediate transients. Capacitance: ~100 nF – 10 µF.
  • Board bulk caps (>100 ns): Large ceramic and electrolytic capacitors on the PCB, close to the VRM (Voltage Regulator Module). Handle slow load steps. Capacitance: 100 µF – 1 mF+.

Target PDN impedance: Design the PDN so its impedance Z(f) = V_droop_budget / I_max is flat across all relevant frequencies (DC to ~1 GHz). Resonant peaks in impedance cause amplified droop at those frequencies and must be damped.

Chip-level: Top metal layers (M8–M14 in advanced nodes) are dedicated entirely to power distribution — wide horizontal and vertical stripes forming a mesh. The mesh resistance across the die must be kept below ~0.1 mΩ to limit static IR drop to under 50 mV.

12
Medium CDC
A GPU design has thousands of flip-flops across multiple clock domains. How does CDC verification scale, and what role does formal verification play?

Simulation-based CDC verification does not scale for a GPU-class design. There are too many clock phase combinations, too many data patterns, and too many cycles to simulate to ever encounter all CDC metastability scenarios. Missing a single unsafe synchronizer can cause silent data corruption that only appears under specific workloads in the field.

Formal CDC verification is the industry-standard solution. Tools like Synopsys SpyGlass CDC, Mentor Questa CDC, or Cadence JasperGold CDC analyze the entire netlist structurally and mathematically:

  • Topology analysis: Identify every net that crosses a clock domain boundary (source FF in domain A, destination FF in domain B).
  • Synchronizer recognition: Detect whether each crossing has a valid synchronizer structure (2-FF chain, handshake, async FIFO pointers). Compliant structures are "promoted" to safe.
  • Multi-bit coherency: Flag any multi-bit bus where individual bits are synchronized independently (torn-word risk). Require Gray code, handshake, or FIFO.
  • Reconvergence analysis: Detect where two signals from the same CDC crossing reconverge into the same logic — one through a synchronizer, one not. This is the most dangerous CDC pattern and is hard to find manually.

RTL coding guidelines enforce synchronizer templates that formal tools can recognize. Deviations from approved templates are flagged automatically. Waivers document paths that are safe for non-RTL reasons (e.g., a path only crosses during reset when data is irrelevant).

13
Medium Low Power
What is DVFS (Dynamic Voltage and Frequency Scaling)? How is it implemented in a GPU, and what are the hardware constraints?

DVFS dynamically adjusts both supply voltage (V_DD) and clock frequency (f) based on workload demand and thermal conditions. Since dynamic power scales as P = αCV²f, simultaneously reducing V and f provides cubic power reduction in the ideal case — halving both V and f reduces power by 8×.

The relationship: Maximum safe frequency is approximately proportional to (V_DD − V_th)/V_DD. Lower voltage → lower max frequency. The curve of achievable (V, f) operating points is characterized during silicon bring-up and stored as a VF table (V-F curve).

Hardware implementation in a GPU:

  • On-chip Performance Monitoring Unit (PMU): Monitors SM utilization, power draw, die temperature, and throttle signals every ~1 ms. Decides the target power state (P-state).
  • Multiple P-states: The GPU ships with a defined set of (V, f) operating points — e.g., Base Clock (guaranteed), Boost Clock (sustained under thermal headroom), Max Boost (burst, thermally limited). NVIDIA's GPU Boost algorithm dynamically selects among these.
  • Voltage regulator: An external PMIC or on-package VR changes V_DD on command. Voltage settling takes ~10–50 µs — during this window, frequency must stay within the safe range for the transitioning voltage.
  • PLL reprogramming: The on-chip PLL changes frequency by updating its divider ratios. Must happen after voltage is stable when scaling up, and before voltage drops when scaling down — violating this order can cause timing failures.
Thermal throttling: When die temperature approaches T_junction_max (~90°C for H100), the PMU reduces P-state to stay within thermal limits. This is the primary reason GPU TDP ratings exist — they define the sustainable power the cooling solution must handle to prevent throttling.
14
Medium Verification / DFT
How does formal verification (model checking / property checking) differ from simulation? When is formal preferred?

Simulation checks specific test cases: apply a sequence of inputs, observe the outputs, compare against expected. Even with millions of random vectors (constrained-random UVM), simulation covers only a tiny fraction of the total state space. It proves the design is correct for those specific scenarios, not in general.

Formal verification (property checking / model checking) mathematically proves that a property holds for all possible input sequences and all reachable states — or produces a concrete counterexample. No test vectors needed. Tools like JasperGold (Cadence), VC Formal (Synopsys), and Questa Formal (Siemens) use SAT/BDD solvers to explore the full state space.

When formal is preferred:

  • Safety-critical properties: "The FIFO never overflows", "The arbiter never grants two requestors simultaneously", "A valid handshake always completes within 16 cycles." These must be proven, not just tested.
  • CDC checking: Formal tools exhaustively verify all synchronizer topologies across all clock domains (as described in the previous question).
  • RTL-to-gate equivalence checking: After synthesis, prove the gate-level netlist is functionally identical to the RTL. Catches synthesis tool bugs. Industry standard for ASIC tape-out.
  • Reset/initialization verification: Prove all state elements reach known values after reset sequences.
  • Protocol compliance: Verify that an AXI4 interface implementation correctly follows the spec for all legal sequences of VALID/READY.

Limitation: State space explosion. Complex datapaths (e.g., floating-point units with 52-bit mantissas) have state spaces too large for formal to solve without heroic abstraction. Formal is most powerful on control paths; datapath is verified by simulation.

15
Easy Verification / DFT
What is hardware emulation? Why is it essential for verifying a GPU-scale design?

Hardware emulation compiles RTL onto a large array of FPGAs (or custom emulation processors like Cadence Palladium or Siemens Veloce). The emulated design runs at 1–10 MHz — 100–10,000× faster than RTL simulation. At 5 MHz emulation speed vs 500 Hz simulation speed, a test that takes 1 year in simulation completes in hours on an emulator.

Why emulation is essential for GPUs:

  • Software bring-up: The GPU driver stack, firmware, CUDA runtime, and OS interactions involve billions of transactions and complex state machines. Running real software stacks on the emulator is the only way to validate pre-silicon software behavior. Simulation is simply too slow.
  • Latent bugs: Some bugs only manifest after millions or billions of transactions under realistic workloads — memory coherency races, power state machine errors, firmware edge cases. Emulation can run full AI model training passes on pre-silicon hardware.
  • System-level integration: Connect the emulated GPU RTL to real PCIe hardware and run real-world benchmarks (ResNet training, CUDA programs) to validate system integration months before silicon is available.
  • DFT validation: Run production test patterns on the emulator to validate scan chain behavior, scan shift/capture at-speed before committing to tape-out.
NVIDIA's pre-silicon flow: RTL simulation handles targeted unit-level verification. Emulation handles system-level software bring-up and regression at scale. Then silicon samples validate both at full speed. Emulation dramatically compresses the software readiness timeline so drivers and frameworks are ready on day 1 of silicon availability.
16
Medium Architecture
What is out-of-order execution? How does a Reorder Buffer (ROB) maintain program-order commit?

Out-of-order (OOO) execution allows a CPU to execute instructions in a different order from program order when earlier instructions stall (e.g., on a cache miss), so that later independent instructions can proceed. This improves instruction-level parallelism (ILP) and hides memory latency.

The challenge: Although instructions execute out of order, they must commit (become architecturally visible — update registers and memory) in program order. If an instruction causes an exception or a branch misprediction is detected, all subsequent instructions must be discarded as if they never executed. This requires the ability to "undo" out-of-order execution.

The Reorder Buffer (ROB) solves this:

  • A circular buffer that tracks all in-flight instructions in program order. Each entry stores the instruction, its destination register, its result (when computed), and its status (executing / done / excepted).
  • Instructions are allocated (in order) at dispatch and deallocated (committed) only from the head — always in program order.
  • An instruction commits when: it is at the ROB head AND its result is ready AND no exception occurred. Commitment writes the result to the architectural register file.
  • On misprediction/exception: The ROB is flushed from the faulting instruction to the tail — all results are discarded, the architectural state is restored to the last committed state. Execution resumes from the correct PC.

Register renaming: Architecturally, there may be 32 registers. The OOO engine maps these to a larger physical register file (hundreds of registers). This eliminates WAW and WAR hazards by giving each instruction its own private physical register for its result — multiple "versions" of the same architectural register can be in-flight simultaneously.

01
Easy Protocols
What are the three main AMBA bus standards — APB, AHB, and AXI4? When is each used in an SoC?

APB (Advanced Peripheral Bus) is the simplest — non-pipelined, low-speed, no burst support. Designed for low-bandwidth peripherals: UARTs, timers, GPIOs, I2C controllers. Transfers take minimum two cycles. Minimal area overhead makes it ideal for the "leaf" nodes at the edge of the SoC.

AHB (Advanced High-performance Bus) is pipelined with burst support. Address is presented one cycle before data. Used for mid-tier peripherals: DMA controllers, USB, internal SRAM, Ethernet MACs. AHB-Lite (single master) is the standard form used in Cortex-M-based systems.

AXI4 (Advanced eXtensible Interface 4) is the highest-performance AMBA bus. Separate read/write address and data channels support out-of-order transactions and up to 256-beat bursts. Used for high-bandwidth masters: CPU cores, GPU, DRAM controllers, DMA engines. AXI4-Lite (no bursts) and AXI4-Stream (no addressing) are simplified variants.

  • Rule of thumb: AXI for processors and high-BW masters; AHB for mid-BW SoC subsystems; APB for peripheral registers needing under 1 MB/s.
  • Standard ARM bridges (AXI-to-AHB, AHB-to-APB) connect the protocol layers. A typical SoC has one AXI fabric at the center, with AHB and APB subtrees hanging off it.
ARM interview context: If asked to design an SoC interconnect, always answer with AXI at the core, AHB for mid-tier, APB for peripherals — this mirrors the CoreLink NIC-400 / NIC-450 architecture ARM ships as interconnect IP.
02
Medium Protocols
Explain the AXI4 five-channel handshake. What enables outstanding transactions and how is per-ID ordering maintained?

AXI4 has five independent channels, each with its own VALID/READY handshake:

  • AW (Write Address): Master presents AWADDR, AWID, AWLEN, AWSIZE, AWBURST. Handshake completes when both AWVALID and AWREADY are high.
  • W (Write Data): Master sends data beats with WDATA, WSTRB, WLAST. Independent of AW — data can be issued before or after the address.
  • B (Write Response): Slave returns BRESP and BID after accepting all write data. Master asserts BREADY.
  • AR (Read Address): Master presents ARADDR, ARID, ARLEN. Slave asserts ARREADY.
  • R (Read Data): Slave returns RDATA, RRESP, RLAST, RID. Master asserts RREADY.

Outstanding transactions: Because address and data channels are decoupled, a master can issue multiple addresses before receiving responses. The ARID/AWID tags identify which response belongs to which transaction. An interconnect can have N simultaneous in-flight transactions (N = ID space depth).

Ordering rule: Transactions with the same ID must complete in order. Transactions with different IDs can complete out of order, allowing the interconnect to service faster-responding slaves first without corrupting per-source ordering.

Common mistake: Confusing AXI4 (full, 256-beat bursts, ID-based ordering) with AXI4-Lite (no bursts, single beat, no interleaving) and AXI4-Stream (unidirectional, no addressing). ARM uses all three variants in different IP blocks.
03
Hard Protocols
What is AMBA ACE? How does it extend AXI4 to maintain cache coherency across Cortex-A cores?

ACE (AXI Coherency Extensions) extends AXI4 with three additional channels and coherent transaction types to maintain hardware cache coherency between CPU clusters connected to a shared interconnect (CCI-400, CCI-500).

Three additional ACE channels:

  • AC (Snoop Address): Interconnect sends to a cache master — "do you have address X cached, and is it dirty?"
  • CR (Snoop Response): Cache master replies — found/not-found, clean/dirty, passed/invalidated.
  • CD (Snoop Data): If the snooped cache has a dirty line, it sends data back on CD. The interconnect forwards it to the requester or writes it back to memory.

Coherent transaction types: ACE adds new transaction types (ReadShared, ReadUnique, CleanUnique, MakeUnique, WriteBack) specifying what cache state the requester needs — enabling the interconnect to orchestrate the correct snoop sequence.

AxDOMAIN field: Each transaction specifies its shareability domain (Non-shareable, Inner-shareable, Outer-shareable, System). The interconnect only snoops agents within the specified domain, avoiding unnecessary broadcast snoops.

ACE-Lite: For non-caching masters (DMA engines) that need coherent reads without holding cache lines, ARM defines ACE-Lite — adds coherent read transaction types on the AR/R channels but omits the AC/CR/CD snoop channels (non-caching masters can't be snooped).
04
Hard Protocols
What is AMBA CHI (Coherent Hub Interface)? How does it differ from ACE and what makes it more scalable?

CHI (Coherent Hub Interface), introduced in AMBA 5, is ARM's next-generation coherent protocol for high core-count SoCs (8–64+ cores). Used in CoreLink CMN-600/CMN-700 and Neoverse N1/N2 server-class CPUs.

Key differences from ACE:

  • Packet-based vs. channel-based: ACE extends AXI4 channels on a shared-bus/crossbar topology. CHI uses a packet-based, link-layer protocol over a point-to-point mesh network with credit-based flow control — no shared bus bottleneck.
  • Distributed Home Node (HN): Manages coherency per address range and holds a snoop filter (which nodes have each line). Snoops go only to nodes that actually have the line — no broadcast snooping.
  • Peer-to-peer data: When a snoop hits a dirty line, the snooped node returns data directly to the requester — bypassing the HN for data, reducing latency compared to ACE's centralized data path.
  • Node types: RN-F (fully coherent, CPU cluster), RN-I (I/O coherent, GPU/DMA), SN-F (slave, DRAM controller), HN-F (home node with snoop filter).

ACE scales well to 4–8 cores with a central CCI. CHI's mesh topology scales to 64+ cores — critical for ARM's Neoverse infrastructure CPUs targeting 64–96 core server chips.

05
Easy Architecture
What are the key RISC design principles? How does the ARM ISA embody them and what are the tradeoffs vs. CISC?

RISC principles as embodied in ARM:

  • Load/Store architecture: Only LOAD and STORE access memory. All arithmetic operates only on registers — no "add [memory], reg" like x86. Simplifies the pipeline: execution stage always operates on register values, never needing a memory read mid-instruction.
  • Fixed instruction width: Classic ARM uses 32-bit fixed-width instructions (Thumb: 16-bit). Fixed width simplifies fetch/decode — the pipeline always knows where the next instruction starts.
  • Large register file: 16 GPRs in AArch32, 31 in AArch64. More registers reduce memory spills — data stays in fast registers longer, improving compiler output.
  • Simple addressing modes: Fewer, more regular modes than x86 — simpler decode logic, lower decode energy.
  • Single-cycle execution (originally): Most instructions complete in one clock cycle, enabling simple in-order pipelines and accurate power modeling.

vs. CISC: CISC (x86) has complex multi-cycle instructions — smaller code size per program. Modern x86 CPUs decode CISC to micro-ops internally (recovering RISC pipeline efficiency). The difference is now mainly at the ISA level: ARM is simpler to verify, lower decode energy, historically better power efficiency — why ARM dominates mobile and is winning in servers.

AArch64 (ARMv8+) changes: Removed conditional execution on most instructions (a classic ARM quirk), expanded to 31 GPRs, dropped Thumb complexity. Cleaner ISA that's easier to build high-performance OOO pipelines on without the architectural legacy baggage.
06
Medium Architecture
What is ARM TrustZone? How does the NS bit propagate through the memory system and what hardware enforces it?

TrustZone partitions the SoC into two worlds: Secure World (trusted OS, cryptographic keys, secure boot) and Normal World (Android/Linux, untrusted apps). The CPU operates in either world; transitions go through the Secure Monitor (EL3 in AArch64) via the SMC instruction.

The NS bit: Every AMBA transaction carries NS via ARPROT[1] / AWPROT[1]. Set by the CPU based on current security world:

  • NS=0 (Secure): can access both Secure and Non-Secure resources.
  • NS=1 (Normal): can only access Non-Secure resources.

Hardware enforcement:

  • TZASC: Between interconnect and DRAM. Secure software programs it to mark address ranges as Secure or Non-Secure. Blocks NS=1 transactions from Secure DRAM — returns a bus error.
  • TZPC: Controls per-peripheral access. Marks peripherals as Secure-only; Normal world accesses are rejected at the bus level.
  • Caches: Lines carry the NS bit. NS=1 accesses cannot hit NS=0 cache lines — treated as a miss, preventing secure data leakage to Normal world.
Cortex-M TrustZone (ARMv8-M): Uses SAU (Security Attribution Unit) instead of TZASC. Defines Secure, Non-Secure, and NSC (Non-Secure Callable) regions. NSC regions are the controlled entry points that allow Normal world code to safely call Secure world functions via ARM's veneer mechanism.
07
Medium Architecture
What is NEON and what is SVE in ARM? What are the hardware design implications of supporting each?

NEON is ARM's fixed-width 128-bit SIMD extension, mandatory in AArch64. Processes 4×32-bit floats, 8×16-bit integers, etc. in parallel. Uses V0–V31 (128-bit vector registers, shared with the FP register file). Accelerates media codecs, image processing, signal processing, and early ML inference.

SVE (Scalable Vector Extension), introduced in ARMv8.2, solves NEON's fixed-width limitation. SVE vectors can be any multiple of 128 bits. The ISA is width-agnostic — the same binary runs on a 128-bit SVE core and a 512-bit SVE core without recompilation. The CPU exposes its vector length via a system register; loops query it at runtime to set strides correctly.

Hardware design implications:

  • Datapath width: A 512-bit SVE unit requires 512-bit-wide ALUs and register file ports — 4× the area and power of a 128-bit NEON unit. Significant die area and power budget impact.
  • Predicate registers: SVE adds 16 predicate registers (P0–P15), each (VL/8) bits wide — 1 bit per byte. At 512-bit SVE: 64-bit predicates × 16 = 128 bytes of predicate register file. Enables per-element masking without branch-based loop tails.
  • SVE2 (ARMv9): Adds gather/scatter, complex arithmetic, feeds SME (Scalable Matrix Extension) for native matrix multiply — targeting AI workloads directly in the CPU pipeline.
Fujitsu A64FX: First SVE implementation with 512-bit vectors. Powers the Fugaku supercomputer (Top500 #1 in 2020). Proves SVE scales to HPC workloads without rewriting software for each vector width — the key architectural advantage over fixed-width SIMD.
08
Easy RTL Design
What are the ARM memory barrier instructions — DMB, DSB, and ISB? What hardware mechanism does each control?

Out-of-order CPUs and write-combining memory systems do not guarantee memory operations complete in program order. Memory barriers force ordering constraints required by concurrent software or hardware interaction.

DMB (Data Memory Barrier): Ensures all memory accesses before DMB are visible to the memory system before any accesses after it. Does not stall instruction execution — subsequent non-memory instructions proceed. Use case: write data to a shared buffer, then DMB, then set the "data ready" flag.

DSB (Data Synchronization Barrier): Stronger. Stalls instruction fetch until all outstanding memory transactions — including cache maintenance operations and TLB invalidates — complete. Required before modifying page tables, enabling caches, or switching address spaces. Always precedes an ISB when system registers change.

ISB (Instruction Synchronization Barrier): Flushes the instruction pipeline. Guarantees instructions fetched after ISB use updated processor state — new exception vectors, MMU configuration, system register values. Always follows a DSB when writing to system registers that affect fetch behavior (SCTLR_EL1, VBAR_EL1, TTBR0_EL1).

Hardware cost: The CPU's load/store unit drains outstanding memory transactions — write buffer empty, store queue drained, ACE/CHI coherency acknowledgments received. On a loaded memory system, DSB can stall the pipeline hundreds of cycles — use sparingly at actual synchronization points only.

09
Medium RTL Design
How do you implement a glitch-free clock multiplexer in RTL? What specific problem does it solve in ARM Cortex-M SoCs?

A combinational mux on a clock signal produces glitches — partial pulses when select changes while the clock is active. Even a single glitch can corrupt flip-flop state. A glitch-free mux must switch between two clock sources without producing any pulse shorter than a full clock period.

Standard 4-FF implementation:

  • Two parallel flip-flop paths — one clocked by CLK_A, one by CLK_B.
  • The select signal is ANDed with a feedback term (inverted grant of the other source) before being clocked by each source. Creates a mutual-exclusion handshake: one source deasserts before the other asserts.
  • Outputs drive a simple OR gate. Only one input can be HIGH at a time — no overlap, no glitch.
  • Switch latency: 2–3 cycles of each clock domain for the handshake to complete.

In Cortex-M SoCs: DVFS requires switching between a PLL (high-frequency, active mode) and an RC oscillator (low-frequency, sleep mode) without glitches. During DFT scan, switching to the test clock must happen glitch-free — a partial pulse on the scan clock corrupts the shift chain state. ARM's Reference Methodology documents the exact RTL template and the STA false-path annotations for the switching window.

STA treatment: The mux output is asynchronous to both inputs during the switch. Use set_false_path -from clk_sel_ff or set_clock_groups -asynchronous to suppress STA violations on the select synchronization path — the bounded switch latency is correct by design.
10
Hard RTL Design
Describe the Cortex-M exception entry sequence. What hardware state is saved automatically and why exactly those registers?

When a Cortex-M CPU takes an exception, hardware automatically stacks 8 registers before jumping to the handler — allowing handlers to be written as plain C functions with no special prologue/epilogue.

Automatically saved (pushed to stack):

  • R0, R1, R2, R3: ARM AAPCS caller-saved argument/scratch registers. Exception handlers call sub-functions that may clobber these — saving them lets the interrupted code resume correctly.
  • R12 (IP): Intra-procedure scratch register — also caller-saved, freely clobbered by function calls.
  • LR (R14): Link register — the interrupted function's return address. Must be saved so the handler can make function calls without corrupting the interrupted context's return path.
  • PC (R15): Address of the instruction that would have executed next — restored on exception return to resume interrupted code.
  • xPSR: Program Status Register — contains APSR flags (N/Z/C/V), thumb state bit, and active exception number. Must be restored to return to the same execution state.

Why exactly these 8? The AAPCS defines R0–R3 and R12 as caller-saved. By saving exactly caller-saved registers + PC/LR/xPSR, hardware guarantees the handler can use R0–R3/R12 freely and make sub-function calls normally. R4–R11 are callee-saved — if the handler uses them, the C compiler saves/restores them automatically in the function prologue/epilogue.

FPU lazy stacking: If FPU is enabled, hardware reserves stack space for S0–S15 + FPSCR immediately but defers actually pushing them until the handler first executes an FP instruction. This eliminates FP save overhead when the handler doesn't use the FPU — the common case.
11
Medium Timing / STA
What is a multi-cycle path exception in STA? When should you apply one and what are the risks?

By default, STA assumes every combinational path between flip-flops has one clock cycle. A multi-cycle path (MCP) exception tells the tool a specific path legitimately has more cycles:

  • set_multicycle_path 2 -setup -from FF_A -to FF_B — setup check uses 2 cycles. Data launched at cycle N must be valid by N+2.
  • A matching hold adjustment is almost always required: set_multicycle_path 1 -hold on the same path — shifts the hold check back to prevent impossibly tight hold requirements.

When to apply:

  • Paths that physically cannot meet single-cycle timing (deep combinational trees: large muxes, iterative dividers) in designs where increasing frequency isn't required.
  • Datapath stages deliberately using 2 cycles — pipeline registers omitted for area savings when throughput allows.
  • Interfaces that present valid data only every N cycles by protocol definition — the downstream FF only samples every N cycles by design.

Risks:

  • Wrong application: An MCP on a path that runs every cycle silently hides a real timing violation — a serious tape-out risk.
  • RTL/SDC divergence: Design intent must be consistent between RTL and SDC. Divergence is a common cause of post-silicon failures that pass timing sign-off.
ARM IP SDC files: ARM's IP blocks ship with pre-characterized SDC including MCP exceptions for their internal pipeline stages. Never modify these without ARM support — they reflect internal microarchitectural constraints not visible in the encrypted netlist.
12
Hard Timing / STA
How does ARM deliver timing constraints with its hard IP? What are the typical SDC exceptions needed to integrate a Cortex-A core?

ARM delivers hard IP (Cortex-A78, X2, Neoverse V1, etc.) as encrypted DEF/GDSII macros with characterized Liberty (.lib) timing models — the macro is a timing black box with accurate I/O pin arcs.

What the integrator receives:

  • Liberty files for each PVT corner (SS/TT/FF × -40/25/125°C) — model all interface pin timing arcs.
  • SDC constraints filecreate_clock for macro-boundary clocks, MCP exceptions for AXI/APB interfaces not running every cycle, false paths for async reset trees and internal CDC crossings.
  • MCMM constraints — separate SDC for functional, scan, and MBIST operating modes.

Typical SoC-level exceptions needed:

  • Async reset false path: set_false_path -to [get_pins ...nreset...] — NRESET is asynchronous, no setup check applicable.
  • AXI interface CDC false paths: AXI between core clock and SoC interconnect clock is synchronized inside the macro. STA sees apparent CDC crossing paths — must be false-pathed at SoC level to avoid thousands of bogus violations.
  • ACLKEN multi-cycle paths: If AXI clock enable throttles the interface to every N cycles, matching MCPs must be set on all AXI data paths.
  • Scan mode false paths: set_false_path -from scan_enable suppresses violations during scan shift analysis.
Critical rule: Never run STA on an ARM hard macro with only a create_clock. Without the ARM-provided SDC, you'll see thousands of false violations — real violations become invisible in the noise. Always source the ARM SDC before adding SoC-level constraints.
13
Hard CDC
In a quad-core Cortex-A cluster, how does the MESI cache coherency protocol work? Walk through a dirty-line snoop scenario.

Each L1 cache line is in one of four MESI states: Modified (dirty, owned exclusively), Exclusive (clean, owned exclusively), Shared (clean, may be in other caches), or Invalid. The CCI manages coherency across cores.

Dirty-line snoop scenario:

  • Core 0 writes address 0x4000 → Core 0's L1 line is Modified (dirty). DRAM is stale.
  • Core 1 reads 0x4000. Issues a ReadShared request to the CCI.
  • CCI's snoop filter knows Core 0 has 0x4000 Modified. CCI sends a snoop (CleanShared) on AC channel to Core 0.
  • Core 0 responds on CR: "found, dirty." Sends dirty data on CD channel back to CCI (or L2). Transitions its copy Modified → Shared.
  • CCI forwards clean data to Core 1. Core 1's L1 → Shared. DRAM is now up-to-date.
  • Both cores have clean Shared copies. A subsequent write by either core requires another coherency transaction (MakeUnique) before writing.

CDC angle: In multi-frequency ARM clusters, the AC/CR/CD snoop channels cross clock domain boundaries between the CCI clock and each core's private clock domain. ARM's ACE protocol uses VALID/READY handshaking on each channel to handle these crossings safely — snoop response timing is bounded by the protocol, not by a fixed cycle count.

14
Medium Low Power
What is ARM big.LITTLE? How does task migration between efficiency and performance cores reduce power consumption?

big.LITTLE (2011, Cortex-A15/A7) pairs a high-performance "big" core (e.g., Cortex-A78) with a power-efficient "LITTLE" core (e.g., Cortex-A55) on the same die. Both implement the same ISA — software runs unchanged on either. The OS scheduler migrates tasks based on workload demand.

Power reduction mechanism:

  • A background task on Cortex-A78 at 2.8 GHz might consume 1.5 W. The same task on Cortex-A55 at 1.0 GHz might consume 0.15 W — 10× less power — with only 2–3× lower performance. Excellent tradeoff for background work.
  • Since P ∝ CV²f, the LITTLE core has lower V (lower supply voltage), lower f, and smaller C (simpler microarchitecture = fewer transistors switching). Savings multiply across all three terms.
  • Cache coherency requirement: Migration requires the task to see the same memory state on the new core. ACE/CCI ensures L1 cache state is visible across clusters — no software cache flushes needed on migration.

Scheduler: Linux EAS (Energy Aware Scheduler) models energy cost of each task at each frequency on each core type and assigns to minimize energy while meeting performance deadlines.

Limitation: Original big.LITTLE ran tasks on either big OR LITTLE — not both simultaneously in one cluster. DynamIQ (2017) fixed this by placing all core types in one cluster with a shared L3, eliminating the CCI bridge overhead and enabling true per-core DVFS.
15
Hard Low Power
What is ARM DynamIQ? How does it improve on big.LITTLE and what does it enable for heterogeneous SoC design?

DynamIQ (2017, starting with Cortex-A75/A55) places all CPU cores — big and LITTLE — in a single cluster sharing one L3 cache via the DSU (DynamIQ Shared Unit), rather than separate clusters with a CCI bridge.

Key improvements over big.LITTLE:

  • Unified L3 cache: All cores share the same L3 directly. Coherency no longer crosses a CCI bridge — uses the simpler, lower-latency DSU fabric. Inter-core communication latency drops ~30–40%.
  • Per-core power domains: In big.LITTLE, all cores in a cluster shared one voltage rail. DynamIQ allows each core its own power domain — LITTLE core at 0.6V/0.8 GHz simultaneously with a big core at 0.9V/2.8 GHz. True per-core DVFS.
  • Flexible core mix: Up to 8 cores in any combination — 1×X2 + 3×A78 + 4×A55 in one cluster. How Qualcomm Snapdragon's Prime+Performance+Efficiency topology is built.
  • NPU integration: DynamIQ clusters connect to ARM Ethos NPU via the DSU fabric — low-latency data sharing between CPU and NPU for ML inference workloads.

DSU hardware: ARM delivers the DSU as parameterizable RTL — configurable for core count, L3 size, power domain count, and PMU event counters. The DSU also hosts the GIC (interrupt controller) interface and the CoreSight external debug interface. SoC integrators configure it at synthesis time; the OS reads DSU registers at runtime to understand cluster topology.

16
Medium Verification / DFT
What is ARM CoreSight? What debug components does it define and how is it used during SoC bring-up?

CoreSight is ARM's standardized debug and trace architecture — hardware IP components providing non-invasive visibility into a running SoC. Built into every Cortex-A/M/R core, it is the foundation for all JTAG-based debugging of ARM chips.

Key CoreSight components:

  • DAP (Debug Access Port): Entry point for an external debugger. Connects JTAG or SWD (Serial Wire Debug — ARM's 2-pin alternative to 5-pin JTAG) to the internal APB debug bus. Every ARM chip exposes at least one DAP.
  • CTI (Cross Trigger Interface): Routes trigger signals between debug components. A breakpoint on Core 0 can simultaneously trigger trace capture on Core 1 — enabling synchronized multi-core debugging and cross-trigger halt.
  • ETM (Embedded Trace Macrocell): Per-core instruction trace — records every executed instruction and its address. Streams to the trace bus for offline analysis. Used for profiling, coverage measurement, and reproducing timing-dependent bugs.
  • STM (System Trace Macrocell): Software-generated trace via memory-mapped registers. Firmware writes messages to STM ports; they appear in the trace stream alongside ETM hardware traces — mixing printf-style output with hardware instruction flow.
  • ETF / TPIU: ETF buffers trace on-chip for post-mortem capture after a crash. TPIU streams trace off-chip to a host analyzer via a parallel trace port.

SoC bring-up use: On first silicon, engineers connect an ARM DSTREAM probe to the JTAG/SWD port. CoreSight allows setting breakpoints in the ROM bootloader, reading/writing memory and registers, single-stepping through reset sequences, and verifying memory-map initialization — all before the silicon can boot. CoreSight is the first debug tool used after power-on, before even the first successful boot.

1
MediumRTL Design
What is the difference between blocking (=) and non-blocking (<=) assignments in Verilog? Why should you never mix them in the same always block?

Blocking (=): Executes sequentially within the current time step — each line completes before the next begins, like C code. Used in combinational always_comb blocks.

Non-blocking (<=): Schedules the RHS evaluation immediately but defers the LHS update until the end of the current time step. All flops in the block sample their old values simultaneously — matching real flip-flop behavior where all Q outputs change after the clock edge.

Why never mix: Mixing creates order-dependent, simulation-synthesis mismatches. A shift register written with blocking in a clocked block simulates as a wire (each stage immediately sees the updated value of the previous), while synthesis infers the correct flip-flop chain. The simulation passes but silicon is wrong.

Golden rules: always_ff @(posedge clk) → always use <=. always_comb → always use =. Never mix in one block.

2
HardRTL Design
How do you design a parameterized round-robin arbiter in SystemVerilog? What fairness guarantee does it provide?

A round-robin arbiter uses a rotating priority mask. After granting requestor N, the mask disables requestors 0..N for the next cycle, giving priority to N+1..last. When no masked request exists, it falls back to unmasked priority (wrap-around).

module rr_arb #(parameter N=4)(
  input  logic        clk, rst_n,
  input  logic [N-1:0] req,
  output logic [N-1:0] gnt
);
  logic [N-1:0] mask;
  logic [N-1:0] masked_req;

  assign masked_req = req & mask;

  // Grant lowest active bit within mask; fallback to unmasked
  assign gnt = masked_req  ? (masked_req  & (~masked_req  + 1))
                           : (req         & (~req          + 1));

  always_ff @(posedge clk or negedge rst_n)
    if (!rst_n) mask <= '1;
    else if (|gnt) mask <= ~((gnt << 1) - 1); // rotate past granted bit
endmodule

Fairness guarantee: Every active requestor is served within N consecutive grant cycles — no requestor can be starved as long as it holds its request. Under fully loaded conditions all N requestors share bandwidth equally.

3
MediumRTL Design
Name five common RTL coding mistakes that cause simulation-synthesis mismatch. How does AMD's RTL style guide prevent them?
  • Incomplete sensitivity list: Missing signals in always @(...) make simulation miss updates. Fix: use always_comb — tool auto-completes the list.
  • Unintentional latch inference: An if/case without a complete else/default infers a latch in synthesis, but simulation sees the last assigned value. Fix: always assign every output in every branch; use always_comb which flags latches as errors.
  • Blocking in clocked blocks: A shift register with = simulates as a wire, synthesizes as flops. Fix: always use <= in always_ff.
  • Uninitialized registers: Verilog initializes regs to X; synthesis assumes 0 or 1. RTL simulation passes; gate-level simulation reveals X-propagation failures. Fix: always include a synchronous or asynchronous reset.
  • Implicit truncation: Assigning a wide bus to a narrow reg silently drops MSBs in both sim and synth — but the intent is wrong. Fix: use explicit width casts and enable lint truncation warnings.

AMD RTL style: mandate always_comb/always_ff, run SpyGlass/Jasper lint gates as a pre-synthesis check, and require sign-off on zero lint errors before RTL freeze.

4
MediumStatic Timing
Walk through the setup slack calculation for a flip-flop to flip-flop path. What clock and data path terms are involved?

Setup Slack = Data Required Time − Data Arrival Time

Data Arrival Time (how late the data gets to the capture flop's D pin):

= Launch clock edge + Clock source latency + Launch clock network delay + Tco (launch flop clock-to-Q) + Combinational logic delay + Net delay to capture D

Data Required Time (latest the data can arrive and still be captured):

= Capture clock edge + Clock source latency + Capture clock network delay − Tsetup (capture flop setup time)

Clock skew = capture clock arrival − launch clock arrival. Positive skew (capture arrives later) helps setup; negative skew hurts it.

Positive slack → timing met. Negative slack → setup violation — the critical path. STA tools report the worst negative slack (WNS) and total negative slack (TNS) per corner. AMD Zen sign-off runs SS (slow-slow) corner at worst-case voltage/temperature for setup, and FF (fast-fast) for hold.

5
HardStatic Timing
What is OCV (On-Chip Variation) and how does POCV improve on it? Why did AMD move to POCV for 7nm and below?

OCV (On-Chip Variation): At advanced nodes, identical cells at different physical locations experience different threshold voltage (Vt), temperature, and IR drop — making their delays vary. OCV models this by applying a flat derate factor to the entire launch path (slow it) and capture path (speed it), creating a pessimistic worst-case margin.

Problem with flat OCV: A short path with 2 gates gets the same derate as a 50-gate critical path. Statistically, variation averages out over many stages — a 50-gate path has far less total variation than a 2-gate path. Flat OCV is massively over-pessimistic for long paths, forcing unnecessary ECO iterations and area/power overhead.

POCV (Parametric OCV): Applies a statistical model where variation reduces with path length (central limit theorem — more stages → more averaging). Each cell gets a sensitivity coefficient derived from Monte Carlo silicon characterization. Long paths get less derate; short paths get more. POCV is accurate rather than merely pessimistic.

AOCV (Advanced OCV): A simpler intermediate — uses lookup tables indexed by path depth and physical distance. Less accurate than full POCV but simpler to apply.

At TSMC 7nm/5nm (Zen 3/4), POCV recovers 5–15% timing margin vs flat OCV — directly translating to lower voltage guardbands, enabling higher frequency or lower power at the same frequency. AMD uses Synopsys PrimeTime with POCV (CRPR + POCV derates) for Zen 4 sign-off.

6
MediumCDC
A 4-bit data bus crosses from a 1 GHz domain to a 200 MHz domain. Why can't you use four independent 2-FF synchronizers? What is the correct approach?

A 2-FF synchronizer handles metastability for a single bit. When applied to a 4-bit bus independently, each bit settles to its synchronized value in a different cycle — the capture domain may sample 2 bits from the new value and 2 from the old, forming a spurious intermediate that never existed in the source domain.

Example: Bus transitions from 0b0111 (7) to 0b1000 (8). Bit[3] synchronizes one cycle early; bits[2:0] synchronize one cycle later. Destination briefly sees 0b1111 (15) — a value that never occurred in the source.

Correct approaches:

  • Gray coding: If the bus is a counter, encode it as Gray code (only 1 bit changes per step). Pass through a 2-FF sync. Decode on receive side. This is the basis of async FIFO pointer crossing.
  • Handshake: Assert a request signal, wait for acknowledge synchronized back. Only then is the data sampled. Safe for any data, but one transaction per round-trip latency.
  • Async FIFO: For streaming data, use an async FIFO with Gray-coded read/write pointers. Standard solution for a continuous data stream crossing clock domains.

At 1 GHz → 200 MHz (5:1 ratio), the receive domain has 5 source cycles per receive cycle — ample metastability resolution time, but data integrity across the boundary still requires one of the above protocols.

7
HardCDC
What is a reconvergence fanout hazard in CDC analysis? Describe a scenario where individual synchronizers pass but the system behavior is wrong.

Reconvergence hazard: A single source signal fans out to two registers (R1, R2) in the source domain. Both cross to the destination via separate 2-FF synchronizers. In the destination domain, logic combines the synchronized R1' and R2' signals.

Because R1 and R2 may have different routing delays, they can be captured in different source clock cycles. After synchronization, R1' and R2' may represent different "generations" of the originating signal — a combination that never simultaneously existed in the source domain.

Concrete example: Source generates a 2-bit state: valid=1, data=0xAB. Valid crosses through sync path A (2 cycles). Data crosses through sync path B (3 cycles). Destination sees valid=1 while data still holds the old value — it processes stale data with a valid flag, causing a silent data corruption.

Fix:

  • Synchronize a single control signal (valid/enable) and use it to capture all associated data simultaneously in the destination domain using a load register.
  • Move reconvergent logic entirely into the source domain before crossing.
  • Use an async FIFO which safely bundles all associated bits into one crossing.

CDC tools (Questa CDC, Meridian, Jasper CDC) specifically flag reconvergence paths because 2-FF synchronizers alone cannot fix this class of bug.

8
MediumLow Power
What is the difference between clock gating and power gating? When is each used, and what RTL pattern enables automatic clock gate insertion?

Clock gating: Stops the clock to a register bank when its data won't change. An Integrated Clock Gating (ICG) cell ANDs the enable signal with CLK on the falling edge (latch-based to prevent glitches). Supply rail remains on — state is preserved. Wake-up is instantaneous (next clock edge). Eliminates dynamic power (P = C·V²·f·α — α drops to 0 for gated registers). Used for microsecond-to-millisecond idle periods.

RTL pattern automatically recognized by synthesis:

always_ff @(posedge clk)
  if (enable) q <= d;  // synthesis tool inserts ICG automatically

Power gating: Cuts the VDD supply rail entirely to a block using a PMOS header or NMOS footer switch cell. Eliminates both dynamic and leakage power. State is lost unless retention flip-flops (with separate always-on supply) are used. Wake-up requires power rail ramp + state restore — typically 1–100 µs penalty. Used for blocks idle for long durations.

When to use each: At 5nm, leakage can exceed dynamic power for idle blocks. For fine-grained, frequent idle (display pipeline idle for one frame): clock gate. For coarse-grained, infrequent idle (USB controller during airplane mode): power gate. AMD Zen uses both — clock gating per execution unit, power gating per core and per CCD.

9
HardLow Power
AMD's Zen 4 achieves high IPC with competitive power efficiency. What are three RTL micro-architectural techniques that reduce power without sacrificing performance?
  • Operand isolation: Gate the inputs to a functional unit (multiplier, FP adder) when its output won't be consumed. Even with the clock running, propagating switching activity through a 64-bit multiplier dissipates significant power. A simple AND gate on inputs with an enable signal prevents this. Zen's FP units are operand-isolated when the scheduler has no ready FP micro-ops.
  • Fine-grained per-register clock gating: Rather than gating an entire register file, gate individual rows or entries. Zen uses per-physical-register clock gating so that inactive rename table entries and retirement slots consume zero dynamic power — critical in a 320-entry ROB where most entries are idle each cycle.
  • Bus-invert encoding: Wide internal buses (128-bit operand buses, 256-bit cache read data) toggle frequently. Bus-invert encoding checks if the Hamming distance between the current and previous bus value exceeds N/2 — if so, invert the bus and set an invert bit. The receiver reinverts. This halves the worst-case toggle count and reduces average bus switching by 20–35%. AMD applies this on the L1↔L2 data bus in Zen cores.
10
MediumVerification / DFT
What is scan compression? How does it reduce test application time, and what are its trade-offs?

Standard scan chains connect all flip-flops end-to-end. Test time = (flops/chains) × patterns × 2 (shift in + shift out). For a 50M-flop AMD Zen CCD with 200 scan chains and 100K patterns — this is weeks of tester time at high cost per second.

Scan compression inserts a decompressor/compressor network between the external tester and the internal scan chains. A small number of external channels (e.g., 16) drive many more internal chains (e.g., 3200) through an LFSR-based decompressor. Responses are compressed through a MISR (Multiple Input Signature Register) before going to the tester.

Compression ratio = internal chains / external channels. Typical: 50:1 to 200:1. A 200:1 ratio reduces test time and tester data by ~200×. AMD uses Synopsys DFT Compiler with compression ratios of 100–150:1 across Zen designs.

Trade-offs:

  • Fault coverage slightly reduces (compressed stimulus can't fully control every flop independently — called "care bits" limitation)
  • X-sources (unknowns from uninitialized memories, analog blocks) corrupt the MISR — must be masked or controlled during DFT planning
  • Area overhead: 2–3% for compressor/decompressor logic — worthwhile given the 100–200× test time savings
11
MediumVerification / DFT
Explain the IEEE 1149.1 JTAG standard. What mandatory registers must every compliant chip implement, and how does AMD use JTAG in silicon bring-up?

JTAG defines a 4-pin serial interface: TCK (clock), TMS (mode select), TDI (data in), TDO (data out). A 16-state TAP (Test Access Port) controller state machine manages all operations.

Mandatory registers:

  • Instruction Register (IR): Selects the active data register. Typically 4–8 bits wide. Loaded via the Shift-IR state.
  • Bypass Register: Single bit. When selected, TDI connects directly to TDO — allows a chip in a daisy-chain to be bypassed without shifting through its full scan chain.
  • Boundary Scan Register (BSR): One cell per I/O pin. EXTEST mode drives/captures pin values for board-level continuity testing. INTEST mode tests internal logic.
  • IDCODE Register: 32-bit device identification (JEDEC manufacturer ID, part number, version). Read automatically after TAP reset — the first thing a bring-up engineer checks.

AMD bring-up use: On first silicon of a new Zen or RDNA die, engineers connect a JTAG probe before the chip can boot. JTAG allows loading bootcode into internal SRAM, reading/writing CSRs (control/status registers), single-stepping through the reset sequence, and verifying memory-map initialization — all before the first successful boot. It is the primary debug entry point on day one of silicon validation.

12
HardProtocols
AMD's Infinity Fabric connects compute chiplets to the I/O die. What physical and protocol-level challenges arise in die-to-die interconnect design?

Physical challenges:

  • Bump pitch bandwidth limit: Micro-bumps (~40 µm pitch on EMIB, ~9 µm on TSMC SoIC) limit how many signal wires cross the die boundary. Monolithic on-chip metals can be <1 µm pitch — die-to-die bandwidth per mm of perimeter is severely constrained.
  • Signal integrity: Package traces have higher resistance, capacitance, and inductance than on-chip metals. Multi-Gbps Infinity Fabric links require equalization (CTLE/DFE) and careful impedance matching.
  • Latency: A cross-chiplet cache miss on Infinity Fabric adds ~50–100 ns versus ~1–5 ns for on-die L3 access. This affects cache architecture decisions (EPYC's large L3 per CCD reduces cross-chiplet traffic).
  • Power per bit: Die-to-die PHYs consume ~1–2 pJ/bit versus ~0.05 pJ/bit for on-chip wires. Total fabric power is a key design constraint.

Protocol challenges:

  • Cache coherence across die boundary: MOESI protocol messages (probes, responses, data) must traverse Infinity Fabric. The I/O die acts as the coherence home node — every cross-CCD cache miss touches it.
  • Credit-based flow control: High, non-deterministic fabric latency requires credit-based flow control so the transmitter never overflows the receiver's buffers without per-cycle handshaking.
  • CRC per flit: On-chip wires are effectively error-free. Die-to-die links need per-flit CRC for error detection, with link-level retry for corrections.

AMD addressed these with XGMI protocol layered over a UCIe-compatible PHY, and a directory-based coherence home in the I/O die for Zen 4 / EPYC Genoa.

13
MediumProtocols
Explain the PCIe TLP model. What is the difference between a Memory Write, Memory Read, and Completion transaction? How does the credit system prevent overflow?

PCIe is a packet-based, full-duplex serial bus. The Transaction Layer creates TLPs (Transaction Layer Packets) for all host-device communication. TLPs carry a 12 or 16-byte header plus optional data payload.

  • Memory Write (MWr) — posted: The initiator sends the TLP (address + data) and immediately moves on. No completion is returned. Highest throughput — used for DMA streaming from GPU to host. "Posted" means the initiator trusts delivery without acknowledgment.
  • Memory Read (MRd) — non-posted: Initiator sends a request TLP (address + length). The target responds with a Completion with Data (CplD) TLP. Latency-bound — initiator waits. Outstanding requests tracked by a tag field (up to 256 tags in PCIe 4.0). Used for register reads, configuration access.
  • Completion (Cpl/CplD): Response to a non-posted request. Contains the data and a status field indicating success or error (CA/UR).

Credit system: Each receiver advertises its buffer capacity in credits (1 credit = 1 header or 4 DW of data). Three independent credit pools: Posted (P), Non-Posted (NP), Completion (Cpl). The transmitter tracks available credits and never sends a TLP that would overflow the receiver. Credits are returned as the receiver's buffers drain. This removes per-cycle handshaking — the link runs at full bandwidth as long as credits are available.

14
MediumArchitecture
AMD's Zen uses separate compute chiplets (CCD) and an I/O die (IOD). What are the design advantages over a monolithic die, and what new verification challenges does it introduce?

Advantages:

  • Yield: A 5nm CCD is ~80 mm². A monolithic 16-core 5nm die would be ~500+ mm² with catastrophically low yield. Multiple small, high-yield CCDs combined via packaging achieve equivalent compute with far better economics.
  • Process node mixing: Compute logic (CCDs) uses 5nm — fast, low leakage. I/O, memory PHY, and analog (IOD) use 6nm/12nm — cheaper and more reliable for analog. Monolithic designs must compromise.
  • Reuse and modularity: The same CCD is used across Ryzen, EPYC, and Threadripper. Scale core count by adding CCDs. NRE cost is amortized across millions of units.
  • Competitive time-to-market: CCD and IOD can be taped out in parallel and validated independently before integration.

Verification challenges:

  • Interface complexity: Infinity Fabric D2D must be verified at block level (UVM) and as a full system. Latency modeling across the fabric must be cycle-accurate.
  • Cross-die cache coherence: MOESI correctness across the die boundary, including simultaneous probe + local write corner cases.
  • Power-on sequencing: CCD and IOD power up separately; link training and initialization order must be verified and fault-tolerant.
  • Full-system simulation cost: Cycle-accurate RTL simulation of 8 CCDs + IOD is intractable — AMD uses FPGA emulation and architectural simulators for system-level validation.
15
HardArchitecture
Explain the MOESI cache coherence protocol. How does the Owned (O) state improve on MESI, and how does AMD apply MOESI across Zen's multi-core hierarchy?

MESI states: Modified (dirty, exclusive) → Exclusive (clean, exclusive) → Shared (clean, multiple copies) → Invalid.

MESI problem: When a line is in Modified state and another core requests a read, the owner must write back to DRAM (slow — full DRAM latency), then supply the data. Two memory transactions per shared read.

MOESI adds the Owned (O) state: A dirty line that is being shared. The Owner holds the authoritative (dirty) copy without writing back to DRAM. Other sharers hold Shared (S) copies. The Owner supplies data directly to new requestors cache-to-cache — L3 speed, not DRAM speed. The Owner is responsible for eventual write-back on eviction.

MOESI advantage: Eliminates the mandatory write-back on a shared read, replacing DRAM-latency write-back with a fast cache-to-cache transfer. In producer-consumer workloads (one core writes, several read), DRAM traffic drops dramatically.

AMD Zen implementation: MOESI operates within a CCD's L1/L2/L3 hierarchy — cores on the same CCD share a 32 MB L3 (Zen 4), so intra-CCD coherence is handled by the L3 snoop filter. Cross-CCD coherence over Infinity Fabric uses a directory-based extension of MOESI, with the I/O die's system management unit acting as the coherence home node. Cross-CCD MOESI adds ~50–100 ns to the miss penalty versus <5 ns for intra-CCD L3 hit.

16
MediumArchitecture
What is out-of-order (OoO) execution? What is the Reorder Buffer (ROB), and how does it enable precise exception handling?

Out-of-order execution allows a processor to execute instructions in data-ready order rather than program order, hiding latency. A multiply waiting on a cache miss is bypassed by later independent instructions — the pipeline stays busy.

Key OoO structures:

  • Frontend: Fetch → Decode → Rename (map architectural registers to a larger physical register file, eliminating WAR/WAW hazards)
  • Scheduler (Reservation Station): Holds micro-ops waiting for source operands. Dispatches to execution units the moment all sources are ready — this is where "out-of-order" happens.
  • Execution units: ALU, FP, load/store — execute out of program order based on readiness.
  • ROB (Reorder Buffer): Circular buffer holding all in-flight micro-ops in program order. Instructions enter at dispatch and retire in order from the ROB head.

Precise exceptions via ROB: When an instruction causes an exception (page fault, divide-by-zero), the processor must appear as if all prior instructions completed and no later instructions executed. The ROB makes this precise: at the exception point, the ROB head identifies the faulting instruction in program order. All ROB entries before it are committed (register file and memory updated in order). All entries after it are flushed (results discarded, physical registers freed). The machine state exactly matches the program-order point of the fault — the OS can handle it and restart cleanly.

Zen 4 ROB: 320 entries deep, supporting up to 320 micro-ops in flight simultaneously — a key contributor to high IPC on memory-latency-bound server workloads in EPYC Genoa.

1
HardPhysical Design
What are the key considerations in floorplanning? How do you decide macro placement?

Floorplanning establishes the physical die outline and positions macros (memories, IP blocks) before standard cell placement. Key considerations:

Aspect ratio: Determined by cell area estimate plus 15–25% whitespace for routing. A square die minimises wire length.

Macro placement: Memories and hard IPs are pushed to die edges/corners so the remaining core area is a contiguous rectangle for standard cells. Critical-path macros are placed close together. Timing-driven floorplanning uses estimated wire lengths from the netlist to guide placement.

Pin placement: I/O pins are aligned to the data-flow direction to avoid internal feedthroughs that congest routing.

Power planning: VDD/VSS stripes are planned during floorplanning — horizontal stripes on one metal layer, vertical on the next — running over and around macros.

Halos and blockages: A halo (keep-out margin) around each macro prevents standard cells from being placed in congested macro boundaries. Hard blockages prevent routing in sensitive regions.

2
MediumDFT
Explain scan chain insertion. What changes occur between functional mode and shift mode?

Scan chain insertion replaces standard flip-flops with scan-equivalent cells that add a scan-data input (SI) and a scan-enable (SE) control pin.

Functional mode (SE=0): The scan MUX selects the normal data input D. The circuit operates identically to the original design.

Shift mode (SE=1): The scan MUX selects SI. All flip-flops in a chain form a serial shift register — test vectors are shifted in from scan-in (SCI) and responses shifted out from scan-out (SCO).

Insertion flow: (1) Identify scan-eligible flip-flops (excludes clock-domain boundary FFs, async cells). (2) Replace with scan cells from the tech library. (3) Stitch into balanced chains to equalise shift time. (4) Add dedicated scan-in/scan-out ports (or reuse functional I/Os). DFT compilers (Synopsys DFT Compiler, Cadence Modus) automate this. Chain balance is critical — a long chain increases test application time linearly.

3
MediumSTA
What causes a hold violation? How is it fixed at the physical design stage?

A hold violation means data changes too quickly — it arrives at the capture flop's D input before the hold window closes after the capture clock edge.

Condition: Data_arrival_time < Capture_clock_arrival + t_hold

Root cause post-CTS: Before CTS, ideal clocks assume zero skew. After CTS, if the launch clock arrives later than the capture clock (positive skew toward launch), data launches later and arrives at capture earlier relative to the capture edge — tightening the hold margin.

Fixes:

  • Insert delay buffers (DEL cells or BUF chains) on the data path between the launch and capture flops to slow data arrival.
  • Use higher-drive-strength buffers that inherently have longer propagation delay.
  • Unlike setup violations, hold violations cannot be fixed by slowing the clock — they require actual path delay increase.

ECO insertion of delay cells is the standard physical design fix.

4
HardPhysical Design
What is the difference between static and dynamic IR drop? How are violations fixed?

IR drop is the voltage loss along power grid resistance (V = I × R), reducing supply voltage at standard cells and increasing their delay.

Static IR drop: Computed using average (DC) current draw. Identifies chronic hotspots where the power grid is permanently stressed during normal operation. Analysis is fast and runs early in the design flow. Fix: add wider or additional metal stripes, add vias between layers to reduce resistance.

Dynamic IR drop: Peak instantaneous current from simultaneous switching events (SSO — Simultaneous Switching Output). Much higher than static but short in duration. Requires transient simulation using VCD/SAIF activity files to generate current waveforms. Fix: add decoupling capacitors (decaps) near switching hotspots to supply local charge during peaks; redistribute clock domain switching to spread activity in time.

Signoff: Tools like Cadence Voltus and Synopsys RedHawk annotate per-cell voltages back into STA (voltage-aware timing) so IR-induced timing violations are caught at signoff.

5
HardDFT
How does ATPG work? What is the difference between stuck-at and transition fault models?

ATPG (Automatic Test Pattern Generation) creates test vectors that expose manufacturing defects by driving a distinguishable response on an observable output.

Stuck-at fault model (SAF): Assumes a signal net is permanently stuck at logic 0 (SA0) or 1 (SA1). To detect SA1 on net N: (1) Activate — force net N to logic 0 by choosing input values. (2) Propagate — sensitise a path from N to a primary output so the wrong value is observable. Covers 95–98% of real defects at ≥90nm. Fast to generate and simulate.

Transition fault model (TF): Assumes a gate output transitions too slowly (slow-to-rise or slow-to-fall). Critical at sub-28nm where timing defects dominate over stuck-at. Tests are applied using launch-on-shift (LOS) or launch-on-capture (LOC) protocol — the pattern is launched and captured within a functional clock cycle, checking that the transition completes on time.

Path-delay faults: Test entire combinational paths end-to-end for timing violations. Most comprehensive but pattern count is large. Used in automotive-grade (AEC-Q100) and safety-critical (ISO 26262) parts.

6
MediumPhysical Design
What are the goals of Clock Tree Synthesis? Define skew and insertion delay.

CTS builds the clock distribution network from the clock source to all sequential elements (flip-flops, latches, macros).

Clock skew: The difference in arrival time between the earliest-arriving and latest-arriving clock at any two registers. High skew consumes timing margin on both setup (capture arrives too late) and hold (capture arrives too early). Post-CTS skew target is typically <50–100 ps.

Insertion delay: Total delay from the clock source to the clock pin of a leaf register. Large insertion delay adds to the cycle time overhead seen by the PLL. Minimised by using clock buffers with high drive strength and short wire lengths.

CTS goals:

  • Minimise skew across all clock sinks of a domain.
  • Minimise insertion delay to preserve timing budget.
  • Maintain fast slew (transition time) at clock pins — slow slew increases dynamic power and setup uncertainty.
  • Stay within power budget — clock network is typically 30–40% of chip dynamic power.

Tools (Cadence Innovus, Synopsys ICC2) use H-tree or mesh topologies with cells from a dedicated clock cell library.

7
MediumDFT
What is BIST? Compare LBIST and MBIST and explain when each is used.

BIST (Built-In Self-Test) embeds test circuitry on-chip so the device can test itself without an external tester.

LBIST (Logic BIST): Tests combinational and sequential logic. A PRPG/LFSR (pseudo-random pattern generator) produces stimulus internally. Responses are compressed by a MISR (Multiple-Input Signature Register) into a single signature, compared against a golden value. No external tester is needed after manufacturing — ideal for in-system test during field operation. Used in automotive (ISO 26262) and aerospace for periodic self-check. Limitation: lower fault coverage than ATPG because patterns are pseudo-random.

MBIST (Memory BIST): Tests embedded SRAMs, register files, and ROM arrays. Implements standard memory test algorithms — MARCH C−, MATS+, or Checkerboard — that write and read specific data patterns to detect stuck-at, transition, coupling, and data-retention faults. ATPG cannot model the internal fault models of array structures, making MBIST mandatory for high-density embedded memories.

In practice, most production chips include both: LBIST for logic and MBIST for each embedded memory instance, all controllable via JTAG.

8
HardSTA
What is OCV (On-Chip Variation)? How is it modeled using derating factors in STA?

OCV (On-Chip Variation) accounts for spatial variation in process, voltage, and temperature across a single die. Two transistors at opposite corners of the chip can differ in speed by 10–20% even from the same wafer.

Flat OCV modeling: Apply a derating factor to all cells on each path:

  • Setup: launch clock path × 1.1 (late), capture clock path × 0.9 (early), data path × 1.1 (late).
  • Hold: launch × 0.9 (early), capture × 1.1 (late).

This is pessimistic because spatially adjacent cells on the same path are correlated — they will not simultaneously see worst-case and best-case conditions.

AOCV (Advanced OCV): Uses depth/distance lookup tables. Short paths (few logic levels, cells close together) are highly correlated → lower derating. Long paths with many independent stages → higher derating.

POCV (Parametric OCV): Characterises each cell's delay as a statistical distribution (mean + σ). STA combines path delays using σ_total = √(Σσᵢ²) (independent stages) or σ_total = Σσᵢ (fully correlated). Recovers 10–20% of timing margin versus flat OCV while remaining statistically justified. Required at 7nm and below.

9
HardPhysical Design
What is crosstalk? How does it cause delay violations, and how is it mitigated?

Crosstalk occurs when a switching net (aggressor) induces a voltage disturbance on an adjacent net (victim) via parasitic coupling capacitance (Cc) between parallel wires.

Crosstalk noise: Glitch on a static victim net. If the glitch crosses the switching threshold it can cause functional failure in combinational logic.

Crosstalk delay: On a transitioning victim, an aggressor switching in the same direction speeds the victim up (beneficial). An aggressor switching in the opposite direction slows the victim down — this is the dangerous case for setup timing (increases data path delay) and can also worsen hold (decreases data path delay if victim speeds up).

Mitigation:

  • Shield critical nets by inserting VDD/VSS wires between aggressor and victim.
  • Increase wire spacing (reduces Cc ∝ 1/spacing).
  • Downsize the aggressor driver to reduce dV/dt.
  • Insert buffers to break long victim nets into shorter segments.
  • Change routing layer or direction — wires crossing at 90° have minimal coupling.

Tools: Cadence Voltus-Fi, Synopsys PrimeTime SI perform crosstalk analysis using extracted parasitics (SPEF).

10
HardDFT
What is scan compression (EDT)? How does it reduce test time and data volume?

Full-scan test data volume grows as O(N × P) — N scan cells × P patterns. A modern 28nm chip with 10M scan cells and 100K patterns generates terabytes of test data — impractical for ATE (Automatic Test Equipment) storage and tester time.

EDT (Embedded Deterministic Test / Synopsys): Achieves 50–100× compression.

Architecture:

  • Decompressor: An LFSR-based network takes a narrow external scan channel (e.g., 32 ATE pins) and expands it into many wide internal scan chains (e.g., 3200 chains). Care bits (fully specified bits needed for a specific fault) are encoded into LFSR seeds; the rest are pseudo-random fill.
  • Compactor: An XOR network (space compactor) compresses all internal chain outputs back to a narrow external scan-out channel.

Result: 100× fewer ATE channels and 100× less test application time while maintaining equivalent stuck-at coverage, because the LFSR seed encodes all deterministic care bits. The tool solves for the LFSR seed that simultaneously satisfies all care-bit constraints for a given pattern.

11
MediumRTL Design
What is clock gating? How is it implemented in RTL and why does it save power?

Clock gating inserts a gating cell between the clock source and a register's clock pin. When the gate enable is low, the clock is suppressed — the flip-flop holds state without toggling, eliminating its dynamic switching power.

Power savings: Dynamic power P = α × C × V² × f. Clock gating reduces α (activity factor) for idle registers to zero. Since the clock network consumes 30–40% of dynamic power, even partial gating yields large savings.

ICG cell (Integrated Clock Gate): A latch-based gate that samples the enable on the low phase of the clock and produces a glitch-free gated output. The latch prevents enable glitches from propagating to the clock pin (which would cause spurious register captures).

RTL inference:

always @(posedge clk)
  if (en) q <= d;  // synthesis tool infers ICG

Avoid assign clk_g = clk & en; — this creates a glitchy gated clock. Let the synthesis tool infer the ICG from the conditional register enable. Modern tools insert ICGs automatically when the clock_gating_style attribute is set.

12
MediumDFT
Explain boundary scan (JTAG IEEE 1149.1). What are the key TAP controller states?

Boundary scan (IEEE 1149.1) provides access to chip I/O pins without physical probing — essential for board-level testing of BGA packages where pins are inaccessible.

Each I/O pin has a Boundary Scan Cell (BSC): a MUX that passes functional data in normal mode or routes scan-shift data during test.

TAP (Test Access Port): A 4-pin interface — TCK (test clock), TMS (test mode select), TDI (data in), TDO (data out). The TAP controller is a 16-state FSM driven by TMS. Key states:

  • Test-Logic-Reset: All test logic is inactive; chip operates normally.
  • Shift-DR / Shift-IR: Data or instruction register is shifted serially via TDI→TDO.
  • Capture-DR: The boundary scan register captures the live pin state.
  • Update-DR: Latches the shifted data to drive pins or internal logic.

Key instructions: BYPASS (1-bit pass-through to shorten chain), SAMPLE/PRELOAD (capture I/O without disturbing function), EXTEST (drive/observe boundary cells for board-level test), INTEST (test internal logic).

JTAG is also the transport for ARM CoreSight debug, RISC-V debug, and IEEE 1687 (iJTAG) for embedded instrument access.

13
MediumPhysical Design
What is electromigration (EM)? How does it affect reliability and what triggers violations?

Electromigration is the gradual displacement of metal atoms in a conductor caused by momentum transfer from electrons to lattice ions at high current density. Over time, this creates voids (open circuits) and hillocks (short circuits) in metal interconnect.

Black's equation: MTTF = A × j−n × e(Ea/kT), where j is current density, Ea is activation energy (~0.9 eV for Cu), T is temperature, and n ≈ 2. EM lifetime degrades rapidly with increasing current density and temperature.

Violation trigger: A wire's RMS or peak current density exceeds the PDK-specified limit for its width and length. Power/ground stripes and clock nets are most susceptible due to high sustained current.

Fixes:

  • Widen the metal wire (reduces current density for the same current).
  • Add parallel metal stripes to distribute current.
  • Add vias (stacked vias lower via EM risk which is often the bottleneck).
  • Reduce switching activity on the violating signal net.

EM sign-off is performed by tools like Synopsys PrimeRail or Cadence Voltus using extracted parasitics and activity-based current profiles.

14
MediumSTA
What is a set_multicycle_path constraint? Give a concrete example with correct -hold pairing.

A multicycle path (MCP) constraint relaxes the setup check so data may propagate across N clock cycles rather than 1. Used when a combinational path is too slow to meet a single-cycle constraint and the design intentionally uses multiple cycles (qualified by a valid/enable signal in RTL).

Example: A 32-bit multiplier takes 3 ns in a 2 ns clock design:

set_multicycle_path -setup 2 \
  -from [get_cells mult_reg_A] \
  -to   [get_cells mult_result_reg]

set_multicycle_path -hold 1 \
  -from [get_cells mult_reg_A] \
  -to   [get_cells mult_result_reg]

What happens: -setup 2 moves the setup check from cycle 1 to cycle 2 (opens a full extra cycle of margin). Without the paired -hold 1, STA moves the hold check to cycle 1 (default), which becomes extremely tight. The -hold 1 keeps the hold check at 1 cycle before the new setup edge, preventing over-constraining hold. Always pair -setup N with -hold N−1 for correct MCP analysis.

15
MediumSTA
What is a false path? How does it differ from a multicycle path?

A false path is a topologically valid timing path that can never be sensitised during actual circuit operation. STA would otherwise pessimistically flag it as a violation.

Common cases:

  • Asynchronous reset MUX output to flop data pin — reset is asserted asynchronously; the combinational path from the reset MUX is never a functional data path.
  • Paths between unrelated clock domains where a CDC synchronizer already handles the crossing.
  • Scan data paths during functional mode.
  • One-time configuration registers only written at power-on initialisation.

SDC: set_false_path -from [get_clocks clk_a] -to [get_clocks clk_b]

Key difference from MCP:

  • set_multicycle_path: relaxes the timing check by N cycles but still performs the check. The path is real; it just needs more time.
  • set_false_path: removes the path from analysis entirely. Over-using it masks real timing problems — apply conservatively and always document the reason in the constraints file.
16
HardPower
What is the purpose of a power grid? How do IR drop analysis tools identify and fix violations?

The power grid is the network of metal stripes (VDD and VSS) that distributes supply voltage from chip pads to every standard cell and macro with minimal resistive loss.

Analysis flow:

  1. Extract parasitic resistance of all power grid metals (SPEF or internal extraction).
  2. Assign current sources at each cell using activity factor × Imax from the liberty model (or VCD/SAIF for dynamic).
  3. Solve the resistive network using a matrix solver (SPICE-accurate or faster linear solver in RedHawk/Voltus).
  4. Compute voltage at every grid node; flag nodes where V < VDD − ΔV_budget (typically 5–10% of VDD).

Fixes:

  • Add wider or additional metal stripes in congested current-density regions.
  • Add vias between metal layers to reduce via resistance (often the bottleneck).
  • Insert decoupling capacitors (decaps) near switching hotspots for dynamic IR.
  • Add a power ring around macros or add additional power rails through the macro halo.

Post-fix, voltage annotations are fed back to PrimeTime for voltage-aware timing signoff, ensuring IR-induced timing degradation is correctly modelled.

1
HardSTA
How do you analyse and optimise a critical setup-time path?

Setup check: Data must arrive before the capture clock edge minus setup margin: T_launch + T_comb ≤ T_capture − t_setup

Identification: Use report_timing -max_paths 10 -path_type full in PrimeTime. Examine: data arrival time, required time, negative slack (WNS/TNS), and the logic levels on the critical path.

Optimisation techniques:

  • Cell sizing: Replace HVT (slow, low-leakage) cells with SVT or LVT on the critical path — LVT is typically 20–30% faster but has higher leakage. Apply selectively only to failing cells to control leakage budget.
  • Buffering: Insert buffers to break long wires with high RC delay. A single long wire is slower than two shorter wired segments with a buffer.
  • Retiming: Move registers across combinational logic to rebalance stage depths — shift a register forward to absorb logic from the critical stage into a lighter adjacent stage.
  • Placement: Reduce physical distance between critical-path cells (re-place closer together to shorten wire RC).
  • Logic restructuring: Replace multi-level AND/OR trees with fewer levels (e.g., 3-level NAND tree instead of 4-level AND tree).
  • Architecture: Pipeline — insert a register mid-path, accepting extra latency but doubling frequency.
2
HardPower
What is UPF? How do you define power domains and isolation rules?

UPF (Unified Power Format, IEEE 1801) specifies power intent separately from RTL, allowing power-aware verification and implementation without modifying the design source.

Key constructs:

  • create_power_domain PD_CPU -include_scope — defines a set of logic sharing a common supply.
  • create_supply_net VDD_CPU -domain PD_CPU — declares the supply net for that domain.
  • set_domain_supply_net PD_CPU -primary_power_net VDD_CPU — connects domain to supply.

Isolation: When a domain powers down, its outputs can float or be indeterminate. Isolation cells clamp outputs to a known safe value (0 or 1) before the domain is cut:

set_isolation ISO_CPU \
  -domain PD_CPU \
  -isolation_signal cpu_iso_en \
  -isolation_sense high \
  -clamp_value 0 \
  -applies_to outputs

The isolation cell must be in an always-on domain, powered continuously.

Level shifters: Required when two domains operate at different voltages (e.g., 0.8 V core → 1.8 V I/O). set_level_shifter in UPF directs the tool to insert level-shifter cells at domain crossings. EDA tools (Synopsys Verdi, Cadence Voltus) use UPF to verify correct isolation, level-shifting, and retention implementation.

3
MediumSTA
Why do hold violations typically appear after CTS? How are they corrected?

Hold check: Data_arrival ≥ Capture_clock_arrival + t_hold. A hold violation means data changes too quickly relative to the capture edge.

Why post-CTS: In pre-CTS STA, ideal clocks assume zero insertion delay and zero skew — the launch and capture clocks appear to arrive simultaneously. After CTS, real clock buffers introduce real delays. If the launch flop's clock arrives noticeably later than the capture flop's clock (positive skew toward launch), the launched data arrives later at the capture D-pin — reducing the margin above the hold requirement. Paths where launch clock delay > capture clock delay become hold-critical.

Fixes:

  • Delay cell insertion (ECO): Add delay buffers (DEL/BUF chains) on the data path to slow data arrival. These ECO cells are placed in available whitespace and routed. This is the standard post-CTS hold fix.
  • Clock skew adjustment: Re-balance the CTS buffer tree to equalise arrival times — but this risks disturbing setup closure elsewhere.

Hold violations must be fixed before tapeout — unlike setup failures, they cannot be addressed by slowing the operating clock frequency in production.

4
MediumRTL Design
Compare Moore vs Mealy FSMs. When is one-hot encoding preferred over binary?

Moore FSM: Outputs depend only on the current state (not inputs). Outputs are registered — glitch-free. One extra clock cycle of latency since the state must transition before the output changes. Simpler to verify.

Mealy FSM: Outputs depend on both the current state and current inputs. Responds one cycle faster (output changes immediately when input changes, without waiting for a state transition). Risk of combinational glitches on outputs if inputs are noisy.

One-hot encoding: One flip-flop per state; exactly one bit is high at any time (e.g., 8 states → 8 FFs). Next-state logic reduces to simple OR/AND of predecessor state bits — fewer logic levels, faster transitions.

When to prefer one-hot:

  • State machine is timing-critical and needs minimum propagation delay through next-state logic.
  • Available flip-flops are plentiful (FPGA always; ASIC when state count is small).
  • Glitch immunity is important — one-hot decoding avoids the decoding hazards of binary encoding during state transitions.

Binary encoding uses log₂(N) flip-flops — fewer FFs, but more complex next-state logic. Preferred for large state machines where FF count dominates area.

5
HardSTA
What is POCV? How does it differ from OCV and AOCV?

All three model on-chip process variation (PVT) but with increasing physical accuracy and decreasing pessimism:

OCV (flat): Applies a single worst-case derating multiplier to all cells on the launch/data path and a best-case multiplier on the capture/clock path. Simple but extremely conservative — it assumes every cell on a long path simultaneously hits its worst-case delay, which is physically impossible for correlated cells.

AOCV (Advanced OCV): Replaces the flat factor with a lookup table indexed by path depth (number of stages) and spatial distance between cells. Short paths (few stages, cells close together) are strongly correlated → smaller derate. Long paths with many independent stages → larger derate. Recovers 5–10% timing margin versus flat OCV while remaining deterministic.

POCV (Parametric OCV): Characterises each cell's delay as a mean μ plus a standard deviation σ from silicon measurements or Monte Carlo simulation. STA combines path delays statistically:

  • Independent stages: σ_total = √(Σσᵢ²)
  • Correlated (common source) stages: σ_total = Σσᵢ

The timing budget is set at μ ± k×σ (typically k = 3 for 99.87% yield). POCV recovers 10–20% more margin versus AOCV, using a physically justified statistical model. Required at 7 nm and below where OCV pessimism would make closure impossible.

6
HardPower
How do retention flip-flops work? Describe the save and restore sequence.

Retention flip-flops (also called balloon or shadow-latch FFs) allow a power domain to be shut down while preserving register state, avoiding a full re-initialisation on power-up.

Structure: Two storage elements in one cell:

  • Primary latch: Connected to the normal domain supply (VDD). Operates during functional mode. Powers down when the domain is switched off.
  • Shadow latch: Connected to a separate always-on retention supply (VRET, typically lower voltage, e.g., 0.5 V). Retains data through domain power-down.

Save/restore sequence:

  1. Save: Assert SAVE signal while VDD is still active. Data propagates from primary latch → shadow latch.
  2. Power-down: VDD is removed. Primary latch loses state. Shadow latch retains data on VRET.
  3. Power-up: VDD is restored.
  4. Restore: Assert RESTORE signal. Data propagates from shadow latch → primary latch. Domain resumes normal operation.

Critical: SAVE must complete before VDD is cut. RESTORE must complete after VDD is stable. The PMU (Power Management Unit) enforces this sequencing. UPF: set_retention -retention_condition {save_en} -save_signal {save_net rising}.

7
MediumCDC
What is metastability? How do you calculate the MTBF of a two-flop synchronizer?

Metastability occurs when a flip-flop's D input changes within its setup or hold window. The flip-flop enters an indeterminate analog state — neither a valid 0 nor 1 — and resolves to one after a random resolution time τ that follows an exponential distribution. If τ exceeds the available resolution window, the output propagates metastability to downstream logic, causing functional failures.

Probability of failure for one synchronizer:

P_fail = f_src × f_dst × T_met × e^(−Tr/τ)

  • f_src, f_dst — source and destination clock frequencies
  • T_met — metastability window width (~t_setup + t_hold of receiving FF, typically 10–50 ps at 28 nm)
  • Tr — available resolution time (destination clock period minus timing margin)
  • τ — flip-flop characterisation constant from silicon (technology-dependent)

MTBF = 1 / P_fail

Adding synchronizer stages: Each additional stage increases Tr by one destination clock period, reducing P_fail exponentially. At 1 GHz with 28 nm FFs, two stages typically yield MTBF > 10¹⁰ years. Three stages are used for safety-critical CDC crossings (ISO 26262 ASIL-D).

8
MediumSTA
What is clock uncertainty? How is it decomposed into jitter and skew components?

Clock uncertainty is the total uncertainty in when a clock edge arrives at any flip-flop. It is added as a margin to both setup and hold checks, consuming timing budget. It decomposes into:

1. Jitter: Cycle-to-cycle variation in the PLL/oscillator output period. Types:

  • Period jitter: Variation in a single cycle's period from the ideal. Adds directly to setup uncertainty.
  • Phase jitter (RMS): Long-term phase deviation, specified in ps RMS in the PLL datasheet.

2. Clock skew: Systematic difference in clock arrival time between two registers due to unequal path lengths in the clock distribution tree. Ideal CTS targets <50 ps post-route.

3. Crosstalk on clock nets: Aggressor switching events perturb clock edge timing — adds a small random component.

SDC modeling:

set_clock_uncertainty -setup 0.15 [get_clocks clk]
set_clock_uncertainty -hold  0.05 [get_clocks clk]

Setup uncertainty is larger (typically 100–200 ps) because it must account for worst-case PLL jitter across both launch and capture cycles. Hold uncertainty is smaller since jitter on adjacent cycles is correlated. POCV-based flows partially absorb jitter into the statistical model.

9
MediumPower
What is the difference between fine-grain and coarse-grain clock gating?

Clock gating reduces dynamic power by stopping the clock to idle registers (P = α × C × V² × f — reducing α to 0 for gated sections).

Fine-grain clock gating: ICG cells are placed close to individual registers or small groups (8–16 bits). Each idle register (or small cluster) is gated independently. Maximum power savings because the gate exactly tracks which registers are idle at any given cycle. High area overhead — one ICG per register or small group. Synthesis tools infer fine-grain CG from RTL conditional enables: if (en) q <= d.

Coarse-grain clock gating: ICG cells are placed higher in the clock tree, gating an entire functional block (e.g., an entire ALU, cache bank, or peripheral unit). The entire block must be idle to benefit — less granular. Minimal area overhead (one ICG per block). Implemented as a block-level enable signal at RTL module boundaries.

Trade-off:

  • Fine-grain: more ICG area, better power granularity, synthesised automatically.
  • Coarse-grain: less area, simpler power control, requires architectural power management.

Best practice: use coarse-grain gating at the power-domain level (via UPF) and supplement with fine-grain gating inferred by synthesis within each always-on block.

10
MediumRTL Design
Why is async-assert sync-deassert the correct reset strategy? Show the RTL pattern.

Why async assert: Reset must take effect immediately regardless of the clock state — at power-on the clock may not yet be running. Async assert guarantees every flip-flop reaches a known state even before the clock stabilises.

Why sync deassert: If reset deasserts asynchronously, different flip-flops may exit reset on different clock edges (due to reset net skew), causing part of the design to start operating while other parts are still held in reset — creating illegal intermediate states. Synchronous deassert ensures all registers exit reset together on the same rising clock edge.

Implementation: The external reset input passes through a 2-FF synchroniser (in the destination clock domain) before distributing to all registers:

// 2-FF reset synchroniser
always @(posedge clk or negedge rst_n_ext)
  if (!rst_n_ext) {sync1, rst_n} <= 2'b00;  // async assert
  else            {sync1, rst_n} <= {1'b1, sync1};  // sync deassert

// Design flop using synchronised rst_n
always @(posedge clk or negedge rst_n)
  if (!rst_n) q <= 0;
  else        q <= d;

PCIe hot-reset, power-on reset (POR), and domain resets all follow this pattern. The synchroniser itself uses async assert to ensure the reset state is captured even without a running clock.

11
MediumSTA
What is a timing ECO? How do you fix setup violations post-route without a full re-run?

An ECO (Engineering Change Order) fixes timing violations at the end of the physical design cycle without re-running full synthesis or place-and-route — preserving the existing floorplan, routing, and overall timing closure.

Setup ECO flow:

  1. Identify failing paths from PrimeTime signoff (report_timing -slack_lesser_than 0).
  2. Select fix: upsize critical-path cells to higher drive strength; replace HVT with SVT/LVT; remove unnecessary buffer stages on the data path; downsize high-fanout buffers to reduce load on critical nets.
  3. ECO insertion in PD tool (Innovus eco_place, ICC2 place_eco_cells): place new/resized cells in available filler cell gaps.
  4. eco_route re-routes only the affected nets; existing routing is preserved.
  5. Re-run signoff STA to verify closure.

Constraints: Filler/decap cells must be removed to make room for new ECO cells. Large ECOs risk congestion if many cells are moved. ECO changes must not violate DRC/LVS.

Metal ECO (post-tapeout): Only changes metal layers — no base-layer changes. Faster and cheaper for a re-spin. Used to fix bugs or timing paths discovered after mask generation using spare cells (ECO cells pre-placed in the original design).

12
HardPhysical Design
How does IR drop derate cell speed? How is it modelled in timing signoff?

Cell delay depends on supply voltage — when VDD drops due to IR drop, the transistor drive current decreases (I_ds ∝ (V_gs − V_t)^n), increasing propagation delay. Approximately: a 5% VDD reduction causes a 5–10% increase in cell delay, depending on the process and cell type.

Voltage-aware STA (signoff flow):

  1. Run power grid analysis (Cadence Voltus or Synopsys RedHawk) to compute per-cell supply voltage under realistic activity (from VCD/SAIF).
  2. Export per-cell voltage as a SPEF-format annotation or a voltage map file.
  3. Import into PrimeTime: read_parasitics -format SPEF voltages.spef or use set_voltage per cell.
  4. STA re-queries the liberty model at the annotated voltage for each cell, computing voltage-derated delay.
  5. Paths through high-IR-drop regions show additional setup slack erosion.

Fixing IR-induced timing failures:

  • Widen power stripes in the hotspot region to reduce resistance.
  • Add power vias to improve connectivity between layers.
  • Insert decoupling capacitors near the switching activity hotspot.
  • As a last resort, downsize the switching logic to reduce current draw.

IR-aware timing is mandatory for advanced nodes where VDD is low (0.7–0.8 V) and 50 mV of drop is a significant fraction of the supply.

13
MediumPower
What are isolation cells? When are they required and what does ISO_ENABLE control?

When a power domain is shut down, its output signals can float to an intermediate voltage or go to an unknown state (X). If these outputs drive logic in an always-on domain, they can cause X-propagation — corrupting state — or create a short-circuit DC current path through CMOS gates (one transistor on, one off but with floating input).

Isolation cells clamp the domain outputs to a known logic value before the domain powers down:

  • AND-type isolator: output = data AND ISO_ENABLE. When ISO_ENABLE = 0 (isolate), output is forced to 0.
  • OR-type isolator: output = data OR ISO_ENABLE. When ISO_ENABLE = 1 (isolate), output is forced to 1.

ISO_ENABLE is driven by the always-on Power Management Unit (PMU). It must be asserted (domain isolated) before the domain supply is cut, and de-asserted after the domain supply is restored and stable.

Placement: Isolation cells must be placed in the always-on power domain, not in the domain being shut down — otherwise they lose power along with the domain they are supposed to isolate.

UPF: set_isolation ISO_PD -domain PD_CPU -applies_to outputs -clamp_value 0 -isolation_signal iso_ctrl -isolation_sense high

14
MediumRTL Design
How do you pipeline a design? What determines the optimal number of pipeline stages?

Pipelining divides a combinational logic block into N stages separated by registers, so each stage can start on a new input every clock cycle while the previous input is still in flight.

Throughput vs latency: A non-pipelined block with T_comb delay produces one result per T_comb. An N-stage pipeline produces one result per T_comb/N (ideally), with latency of N cycles.

Optimal stage count: T_stage = T_comb / N must satisfy T_stage ≥ t_setup + t_flop + t_skew for the target clock period T_clk. Ideal N = floor(T_comb / (T_clk − t_overhead)). Adding stages beyond this yields diminishing returns — register overhead (t_setup, t_hold, t_skew) begins to dominate.

RTL considerations:

  • Retiming: move register boundaries to equalise stage delays — synthesis tools do this automatically with the retime attribute.
  • Pipeline hazards: data hazards (result not ready for dependent instruction) → stall logic or data forwarding. Control hazards (branches) → pipeline flush or prediction.
  • Valid/ready signals: propagate through each stage alongside data so partial-valid inputs are handled correctly.

For arithmetic (multiplier, FIR filter), 3–8 stages is typical at 1–2 GHz in 16 nm. Apple's custom cores use deep pipelines (12–15 stages) to achieve 3+ GHz operation.

15
HardCDC
How do FIFO-based CDC synchronizers work? How do you calculate the required FIFO depth?

A dual-clock asynchronous FIFO transfers a multi-bit data stream between two unrelated clock domains without metastability, using separate write (wclk) and read (rclk) pointers.

Gray-code pointers: Write and read pointers are binary counters converted to Gray code before synchronisation. Gray code ensures only one bit changes per increment — critical because if multiple bits changed simultaneously, a 2-FF synchroniser capturing mid-transition could produce an illegal intermediate pointer value, falsely triggering full/empty detection.

Synchronisation: The write-domain Gray pointer is synchronised into the read domain (to detect full), and the read-domain Gray pointer is synchronised into the write domain (to detect empty), each through a 2-FF synchroniser.

FIFO depth calculation:

Minimum depth must absorb the maximum write burst while accounting for the synchronisation latency before the read side can start draining:

depth ≥ burst_size + sync_latency_in_write_cycles

Where sync_latency_in_write_cycles = ceil(2 / f_read × f_write) + pipeline registers.

Example: 1 GHz write, 500 MHz read, 16-word burst, 2-cycle sync latency on read side = 2 / 0.5 ns × 1 ns = 4 write cycles latency. Depth ≥ 16 + 4 = 20 → round up to 32 (power of 2).

Formal verification (JasperGold CDC, Questa CDC) confirms Gray-code conversion, pointer synchronisation, and absence of overflow/underflow.

16
MediumPhysical Design
What is the antenna effect in VLSI fabrication? How is it fixed during physical design?

The antenna effect (plasma-induced gate oxide damage) occurs during the metal etch steps of fabrication. The plasma used for reactive-ion etching charges the metal being patterned. A long floating metal wire accumulates charge proportional to its area. If that wire is directly connected to a gate oxide, the voltage can build up and tunnel through the thin oxide — causing permanent degradation (increased leakage, Vt shift) or hard breakdown (short circuit).

Antenna ratio: Metal area of the wire / connected gate oxide area. PDKs specify a maximum ratio per metal layer (typically 200–400 for metal-1, larger for higher layers etched later).

Fixes during physical design:

  • Antenna diode: Place a reverse-biased diode (P+/N-well) at the gate input. During fabrication the diode conducts and bleeds accumulated charge to the substrate safely. Most common fix — inserted automatically by the router or added as an ECO.
  • Wire jumping (layer hopping): Break the long metal segment by routing up to a higher metal layer for a short segment and routing back down. Higher layers are etched later, so charge has less time to accumulate on the gate before it is connected to a driver. The jump resets the antenna ratio.
  • Re-route: Use shorter wire segments that stay below the antenna ratio limit.

DRC tools flag antenna violations using the PDK antenna rules; PD tools like Innovus can auto-insert diodes post-route.

01
MediumArchitecture
Explain the 1T1C DRAM cell operation and why it requires periodic refresh.

A 1T1C DRAM cell uses one access transistor and one capacitor. The capacitor stores charge representing a bit (charged = 1, discharged = 0). Because the capacitor leaks charge over time (retention ~64 ms typical), all rows must be refreshed periodically using a RAS-only cycle (RAS-only Refresh or Auto-Refresh). During refresh, the sense amplifier reads and rewrites each row before the charge decays below the sensing threshold.

Refresh overhead impacts bandwidth — at 64ms/8192 rows, one row is refreshed roughly every 7.8 µs. LPDDR5 adds per-bank refresh (PBR) to reduce latency penalty.

Interview tip: Mention charge retention, sense amplifier operation, and the JEDEC refresh interval (64 ms, 8192 rows) to show depth.
02
HardArchitecture
How does an SRAM sense amplifier work, and what limits its minimum supply voltage?

A cross-coupled latch (two inverters back to back) is pre-charged to VDD/2, then the bit lines are slightly separated by the cell current. The sense amp fires, amplifying the small differential into full swing. Speed depends on the gain-bandwidth and the initial ΔV developed.

Minimum Vmin is constrained by the 6T cell's read stability (read static noise margin, RSNM) and the sense amp's offset. At low VDD, transistor threshold variation (σVt) can exceed the developed ΔV, causing read failures. Assist techniques (negative word-line, bit-line bias) extend Vmin.

Interview tip: Samsung interviewers want to hear about RSNM, σVt mismatch, and assist circuits — not just "latch detects small voltage."
03
HardArchitecture
What is HBM (High Bandwidth Memory) and how does 3D stacking achieve its bandwidth?

HBM stacks multiple DRAM dies vertically connected by Through-Silicon Vias (TSVs). A logic base die interfaces to the host via a 2.5D silicon interposer. Each HBM2E stack has up to 8 DRAM dies, 1024-bit-wide interface per stack, achieving ~460 GB/s per stack vs. GDDR6's 64-bit × 16 Gbps = 64 GB/s.

The wide parallel bus (1024 b) at lower per-pin frequency (2 Gbps) dramatically cuts power per bit transferred. TSV pitch is ~55 µm; the interposer carries thousands of TSV connections between GPU/AI die and HBM stack.

Interview tip: Contrast with GDDR6: HBM wins on bandwidth density and power/bit; GDDR6 wins on cost for graphics cards.
04
MediumPower
Describe how DVFS is implemented in an Exynos SoC to balance performance and power.

DVFS (Dynamic Voltage and Frequency Scaling) works by selecting an operating point (OPP) from a pre-characterized table of {frequency, voltage} pairs. The PMIC adjusts VDD, then the PLL is reprogrammed to the new frequency. The CPUFreq governor (schedutil, interactive) monitors CPU load and selects the OPP accordingly.

In Exynos: big.LITTLE cores have separate voltage domains. The Mali GPU has its own DVFS loop. A Power Management Unit (PMU) coordinates domain isolation before power gating idle cores. Transition latency (voltage ramp + PLL lock) is typically 50–200 µs.

Interview tip: Mention that voltage must settle before frequency ramps up (avoid timing violations) and PLL lock time dominates latency.
05
MediumPower
What are the key improvements in LPDDR5 over LPDDR4X in mobile SoCs?

LPDDR5 improvements: (1) Speed: 6400 Mbps vs LPDDR4X's 4266 Mbps; (2) Per-bank refresh (PBR) reduces refresh stall latency; (3) Write X (WRX) allows partial write masking; (4) Decision Feedback Equalization (DFE) for signal integrity at high speed; (5) Lower VDDQ (0.5V) reduces I/O power; (6) Link ECC for in-line error correction.

Power savings: ~20% lower power per transfer at equivalent bandwidth due to lower VDDQ and better termination control.

Interview tip: For Samsung interviews, know PBR, DFE, and Link ECC — these are technologies Samsung pioneered in their LPDDR5 products.
06
HardRTL Design
How does 3D NAND store multiple bits per cell (MLC, TLC, QLC) and what is the reliability trade-off?

Each NAND flash cell stores charge in a floating gate or charge-trap layer. The threshold voltage (Vt) is programmed to one of 2^n levels representing n bits. MLC=2 bits (4 levels), TLC=3 bits (8 levels), QLC=4 bits (16 levels). Vt distributions must not overlap — tighter margins at higher bit density.

Reliability trade-off: More bits per cell → narrower Vt windows → more program/erase (P/E) cycle wear → lower endurance. SLC endures 100K cycles; TLC ~3000 cycles; QLC ~1000 cycles. ECC strength (LDPC) must scale up to correct wider error rates, adding latency.

Interview tip: Mention Vt distribution tightening, LDPC ECC, and that Samsung's V-NAND uses charge-trap flash (CTF) instead of floating gate for better reliability.
07
MediumRTL Design
Compare FinFET vs planar MOSFET in terms of electrostatic control and leakage.

Planar MOSFET: gate controls channel from one side only → short-channel effects (SCE) dominate below ~28nm (drain-induced barrier lowering, punchthrough, subthreshold leakage). FinFET: gate wraps three sides of a thin silicon fin → superior electrostatic control, effectively gates from three sides → dramatically reduced SCE, lower Vt variation, steeper subthreshold slope (~65 mV/dec vs. ~80+ for planar).

Leakage: FinFET reduces Ioff by 10–100× at same Ion vs. planar at same node. Fin height/width is quantized, so Vt tuning is limited to gate metal work function and fin width rather than channel doping.

Interview tip: Samsung introduced GAA (Gate-All-Around) nanosheet transistors at 3nm — mention this as the successor to FinFET for even better electrostatics.
08
MediumRTL Design
What is the difference between scan power faults and transition faults in DFT?

Transition fault model: tests whether a node can change from 0→1 or 1→0 within one clock cycle. It targets delay defects where a path is slow but functional at low speed. Two patterns are applied: initialization + launch pattern on the same clock edge.

Scan power: during scan shift, half the flip-flops toggle every clock (high switching activity → IR drop spikes, potential timing violations, overheating). Techniques: scan segmentation (multiple shift enables), X-filling (reduce transitions), enhanced scan (OCC-based hold), low-power scan cells. Power faults are not a separate fault model — they are a concern about structural test conditions damaging the chip or causing false failures due to excessive IR drop.

Interview tip: Samsung DFT interviews focus on low-power scan and X-masking — be ready to explain how ATPG X-fill reduces switching activity.
09
HardSTA
What is electromigration and how is it signed off in a modern SoC flow?

Electromigration (EM): atomic migration of metal ions driven by electron wind in current-carrying wires. Results in void or hillock formation, eventually causing open or short circuits. Failure follows Black's equation: MTTF = A · J^−n · e^(Ea/kT), where J is current density and Ea is activation energy.

Sign-off flow: (1) Extract average and peak current from switching activity (ECSM/CCS models + dynamic simulation); (2) Compare per-segment J vs. PDK EM limits; (3) Fix violations by widening wires (lower J), adding parallel straps, or reducing frequency. Tools: Voltus (Cadence), RedHawk (Ansys). Via EM and power rail EM are checked separately.

Interview tip: Mention that Samsung's EM limits are process-specific and temperature-derated — sign-off is done at junction temperature (Tj = 125°C typically).
10
HardPHY/Analog
Explain 2.5D vs 3D chiplet packaging and the trade-offs for a memory-compute SoC.

2.5D: dies are placed side-by-side on a silicon interposer (or organic RDL). Interconnects are dense (interposer bump pitch ~10–55 µm) but dies are not stacked. HBM + GPU on interposer is classic 2.5D. Good thermal dissipation since both dies are cooled from top. Bandwidth: limited by interposer wire density.

3D: dies are stacked vertically via TSVs or hybrid bonding (Cu-Cu, pitch <1 µm). Extremely short interconnects → high bandwidth, low latency, low power per bit. Thermal challenge: top die blocks heat from bottom die. Logic-on-DRAM (Samsung X-Cube) or SRAM-on-Logic (Apple M-series) uses 3D.

Interview tip: For Samsung, know X-Cube (SRAM stacking on logic via TSV) and FOWLP (Fan-Out Wafer Level Packaging) as proprietary packaging technologies.
01
HardArchitecture
Compare NoC (Network-on-Chip) vs shared bus for a mobile SoC with 8+ masters.

Shared bus: all masters time-share one bus. Simple arbitration, low area, but bandwidth is O(1) — adding masters degrades per-master bandwidth linearly. Bus stalls when any master holds the bus. Works for ≤4 masters.

NoC: packet-switched mesh/ring/crossbar of routers. Bandwidth scales with topology — multiple transactions in flight simultaneously. Latency is higher (router hops) but throughput is O(N) for properly designed topology. MediaTek Dimensity SoCs use a ring-bus NoC interconnecting Cortex-A cores, GPU, APU (AI), modem, camera ISP, and display subsystem.

Interview tip: MediaTek focuses on QoS — mention how NoC supports per-flow bandwidth guarantees (rate limiters, latency budgets) critical for real-time camera/display pipelines.
02
MediumPower
Walk through the end-to-end DVFS flow on a mobile SoC from OS scheduler to PMIC.

Flow: (1) OS kernel (schedutil governor) measures CPU utilization every scheduler tick; (2) Selects target OPP from cpufreq table; (3) Issues voltage change request to PMIC via I²C/SPMI — voltage ramps up; (4) After voltage settled (PMIC PGOOD signal or fixed delay), PLL reprogrammed to new frequency; (5) CPUFreq driver updates clk_rate. For frequency decrease: lower frequency first, then voltage.

Thermal governor (step_wise) can further cap OPP if Tj exceeds threshold. MediaTek's SVS (Smart Voltage Scaling) characterizes each chip at test to find minimum voltage per OPP, reducing average Vdd by 30–50 mV.

Interview tip: Know the order: up-transition = voltage first, frequency second; down-transition = frequency first, voltage second. Getting this wrong causes timing violations.
03
MediumArchitecture
How is LPDDR5 bandwidth shared between CPU, GPU, and APU in a Dimensity SoC?

A central memory controller (EMC) arbitrates DRAM access. Each subsystem (CPU, GPU, APU, camera, display) has a QoS priority and bandwidth limiter in the NoC. The EMC uses weighted round-robin (WRR) scheduling with urgency preemption for latency-sensitive paths (display must refill frame buffer before vsync).

LPDDR5 at 6400 Mbps × 128-bit bus = ~100 GB/s theoretical peak. Real sustained BW is ~60–70% of peak. MediaTek's BWM (bandwidth management) unit monitors per-master usage and dynamically adjusts QoS weights to prevent one subsystem (GPU) from starving others.

Interview tip: Emphasize display subsystem latency sensitivity — missing a DRAM refill deadline causes visible tearing, so display gets highest urgency in the arbiter.
04
HardProtocols
Compare MIPI C-PHY vs D-PHY for camera sensor interfaces.

D-PHY: differential pairs (lane + clock lane), NRZ encoding. Each lane: 80 Mbps (LP mode) to 2.5 Gbps (HS mode). Simple, widely adopted for camera and display (DSI). Separate clock lane adds overhead. Max ~6 Gbps with 4 data lanes.

C-PHY: 3-wire trio (no dedicated clock — embedded in symbol transitions). Uses 3-phase symbol encoding (5-wire equivalent efficiency: 2.28 bits per symbol). Achieves higher data density per wire count: ~2.5 Gsps per trio ≈ 5.7 Gbps. No clock lane means fewer pins. Trade-off: complex receiver CDR, higher power in PHY, less ecosystem support.

Interview tip: C-PHY is preferred for very high-resolution sensors (200MP+) where pin count is constrained. D-PHY dominates in mid-range cameras.
05
HardRTL Design
Describe the pipeline stages of an H.265 hardware video encoder and key RTL design challenges.

H.265 (HEVC) encoder pipeline: (1) Input fetch + preprocessing (color space, padding); (2) CTU partitioning (Coding Tree Unit, up to 64×64); (3) Intra/inter mode decision (IME — integer motion estimation, FME — fractional ME); (4) Transform (DCT/DST on residuals); (5) Quantization; (6) CABAC entropy coding; (7) Deblocking + SAO filter; (8) Reconstructed frame buffer write-back.

RTL challenges: IME requires searching a large reference frame region (hundreds of candidates) — needs highly parallel SAD computation (SIMD-style array of adders). CABAC is inherently sequential (context-adaptive arithmetic coding) — limits throughput and is difficult to pipeline. Memory bandwidth for reference frames is large — requires DMA engines and on-chip line buffers.

Interview tip: CABAC serialization is a classic bottleneck — MediaTek's encoder uses look-ahead symbol buffering to hide CABAC latency.
06
MediumPower
When do you choose clock gating over power gating for power reduction?

Clock gating: disables clock to flip-flops in an idle block using ICG cells. Eliminates dynamic switching power (C·V²·f·α). Saves 20–40% of active power. State is retained. Wake-up latency: 0 cycles (clock gating cell re-enables immediately). No level shifters or isolation cells needed.

Power gating: shuts off VDD to an entire block via header/footer switches. Eliminates both dynamic AND static (leakage) power. State is lost unless retention FFs are used. Wake-up latency: several µs (power-up sequence, state restore). Requires isolation cells and power switches. Use power gating for blocks idle for >100 µs (e.g., video encoder idle between sessions).

Interview tip: Rule of thumb: clock gating for short idle periods (cycles to µs); power gating for long idle periods (>100 µs). The break-even point depends on leakage current and power switch overhead.
07
HardPower
How does a PLL generate a low-jitter clock, and what are the main jitter sources?

PLL components: Phase Frequency Detector (PFD) compares reference vs. feedback phase → Charge Pump (CP) generates correction current → Loop Filter (LF) integrates to VCO control voltage → VCO oscillates at target frequency → Frequency Divider (/N) feeds back. Closed loop phase-locks so fout = N·fref.

Jitter sources: (1) PFD/CP noise: mismatched charge pump currents → phase error; (2) VCO phase noise: flicker noise dominates close-in, thermal noise at high offset; (3) Supply/substrate noise coupling into VCO (solved with differential LC-VCO, regulated supply); (4) Reference clock jitter; (5) Divider quantization noise (for fractional-N: Σ-Δ modulator adds high-frequency noise shaped out of band).

Interview tip: Fractional-N PLLs (used in 5G RF) add Σ-Δ modulator jitter — MediaTek's 5G modem PLLs use multi-stage noise shaping to minimize this.
08
MediumArchitecture
How does an SoC route a workload between DSP, ARM cores, and NPU for AI inference?

Workload partitioning depends on operation type: (1) Control flow, low-throughput tasks → ARM CPU; (2) Signal processing (FFT, FIR, audio) → DSP (VLIW architecture, MAC arrays, low power); (3) Tensor operations (matrix multiply, convolution) → NPU (systolic array or PE array, optimized for 8-bit/16-bit fixed-point).

MediaTek's APU (AI Processing Unit) uses a multi-core design: MDLA (deep learning accelerator, INT8 systolic array) + VPU (vector processor for pre/post-processing). The scheduler in the APU driver partitions a neural network graph into subgraphs and assigns to MDLA or VPU based on operator support and latency targets.

Interview tip: Key decision: operators not supported by MDLA (e.g., custom activations, control flow) fall back to CPU, adding latency. Good model optimization minimizes CPU fallback.
09
MediumPower
How does thermal throttling work in a mobile SoC and what are the trade-offs?

Thermal throttling: when Tj approaches TJ_MAX (typically 85–95°C), the thermal framework reduces OPP to lower power dissipation. Steps: (1) Thermal sensors (on-die BJT or resistive) report temperature; (2) Linux thermal framework compares to trip points; (3) Passive cooling: reduce CPU/GPU frequency; (4) Active cooling: fan on premium phones; (5) Emergency shutdown at TJ_CRIT.

Trade-off: throttling causes user-visible performance drops (game frame rate drops, benchmark inconsistency). MediaTek's AI-based thermal prediction (in Dimensity 9000+) pre-emptively throttles before reaching the thermal limit, maintaining a more consistent temperature and smoother performance curve vs. reactive throttling.

Interview tip: Sustained performance (SPT) vs. peak performance is a key metric — proactive thermal management improves SPT, which is what MediaTek emphasizes.
10
HardProtocols
Explain massive MIMO beamforming in 5G NR and the hardware challenges for a modem SoC.

Massive MIMO: base station uses 64–256 antennas to form narrow beams toward individual UEs using spatial multiplexing. Precoding: each antenna applies a complex weight (phase + amplitude) so signals add constructively at the target UE and destructively elsewhere. Precoding matrix computed from CSI feedback.

UE modem challenges: (1) CSI-RS measurement across hundreds of beams → huge compute for beam management; (2) Channel estimation for massive MIMO reference signals requires large matrix operations; (3) PDSCH decoding with LDPC at 1+ Gbps requires highly parallel decoders; (4) mm-Wave beamforming (FR2) at UE side requires phased array RF front-end with per-element phase control — separate analog beamforming IC or integrated in modem SoC.

Interview tip: MediaTek's M80 modem integrates mm-Wave and Sub-6GHz. Emphasize beam management latency (L1/L3 handoff) as the key challenge for mobility scenarios.
01
HardRTL Design
Walk through the step-by-step operation of a SAR ADC converting one sample.

Successive Approximation Register ADC: (1) Sample: sample-and-hold circuit captures Vin; (2) MSB trial: DAC outputs Vref/2, comparator checks if Vin > Vref/2 → if yes, bit=1 and keep; (3) Next bit: DAC outputs current output ± Vref/4, repeat binary search; (4) After N cycles, N-bit digital code in SAR register = approximation of Vin/Vref.

Key specs: conversion time = N clock cycles. No latency pipeline — result available after N cycles. Power scales with sample rate. Accuracy limited by DAC linearity (INL/DNL), comparator offset, and kT/C noise in sample-and-hold. TI's ADS8588 achieves 16-bit, 500 kSPS with <1 LSB INL.

Interview tip: TI interviews often ask about DNL (worst-case code width deviation) and INL (accumulated error) — make sure you distinguish them clearly.
02
HardRTL Design
How does a Sigma-Delta ADC achieve high resolution and what is noise shaping?

Sigma-Delta (ΣΔ) ADC: oversamples the input at many times Nyquist (OSR = 64–512×). A 1-bit quantizer generates a high-frequency bitstream. Noise shaping: the feedback loop pushes quantization noise to high frequencies (out-of-band). A decimation filter (sinc or FIR) downsamples and low-pass filters, removing shaped noise, yielding high-resolution output at lower rate.

Resolution gain: each doubling of OSR adds 0.5 bits; first-order ΣΔ adds 1.5 bits per octave; higher-order modulators add more. A 2nd-order modulator at OSR=256 achieves ~16-bit ENOB. Used in audio (TI PCM1794A) and precision measurement (ADS124S08, 24-bit).

Interview tip: Explain that noise shaping does NOT eliminate quantization noise — it moves it out of the band of interest. The decimation filter then rejects it.
03
MediumProtocols
Explain CAN bus arbitration and why it is non-destructive.

CAN uses a wired-AND bus: dominant bit (0) overrides recessive bit (1). During arbitration, all nodes simultaneously transmit their 11-bit (or 29-bit) identifier. Each node monitors the bus while transmitting. When a node transmits a recessive bit (1) but reads a dominant bit (0), it lost arbitration and backs off immediately — no collision, no data corruption.

The node with the lowest numerical ID wins (more 0s in MSBs). Non-destructive: the winning frame continues uninterrupted; losing nodes retry after the winning frame completes. Bit rate limited to ~1 Mbps at 40m to ensure all nodes see a stable level within one bit time (propagation constraint). CAN FD extends data phase to 8 Mbps.

Interview tip: Emphasize that arbitration is deterministic and priority-based — critical for real-time automotive systems (e.g., airbag messages always win).
04
MediumRTL Design
Why is dead-time insertion necessary in a half-bridge PWM driver and how is it implemented?

In a half-bridge, high-side and low-side switches must never be on simultaneously (shoot-through → destructive cross-conduction short). Dead-time is a forced delay between turning off one switch and turning on the other. During dead-time, body diode of the off switch conducts the inductor current (freewheeling).

Implementation in RTL/hardware: two independent PWM outputs with asymmetric edge delays. A dead-time generator inserts N clock cycles of both-off state on the falling edge of each output. TI's C2000 ePWM module has programmable dead-band (DBRED, DBFED registers). Minimum dead-time = gate driver propagation delay + worst-case VGS fall time of the turning-off device.

Interview tip: TI focuses on power electronics — know that too much dead-time causes voltage distortion in motor drive waveforms, too little causes shoot-through.
05
HardArchitecture
Describe the pipeline of a TI C66x DSP MAC unit and how it achieves high throughput.

TI C66x is a VLIW (Very Long Instruction Word) DSP. Each core has 8 functional units: 2× L units (logical), 2× S units (shift/branch), 2× M units (multiply — each does 4× 16-bit MACs/cycle = 8 MACs per core), 2× D units (load/store). VLIW executes up to 8 operations per cycle in parallel with no hardware interlocking — compiler schedules all hazards statically.

Throughput: at 1.25 GHz, each core delivers 40 GMACS (16-bit). A 8-core C6678 reaches 320 GMACS. Deep pipeline (10–12 stages) maximizes clock rate. Load-use latency: 4 cycles → compiler fills with independent instructions. Loop buffer (SPLOOP) executes tight loops without fetch overhead.

Interview tip: TI DSP interviews always ask about VLIW scheduling — emphasize that the compiler (not hardware) resolves data hazards via NOPs or instruction reordering.
06
MediumPower
When would you choose a buck converter over an LDO for a processor core supply?

Buck converter (switching): efficiency = Vout/Vin (ideally); 85–95% typical. Use when Vin/Vout ratio is large (e.g., 5V→1V — LDO would dissipate 80% as heat). Requires LC filter, inductor, controller — larger PCB area and cost. Best for high-current loads (0.5A+).

LDO (linear): efficiency = Vout/Vin — poor for large drop. But: no switching noise, small PCB area, fast transient response, no EMI. Use for sensitive analog supplies (ADC AVDD, PLL supply) where switching noise would degrade PSRR. Also acceptable for small dropout (e.g., 1.8V→1.5V at 100mA — only 30mW heat).

Interview tip: TI makes both (TPS62xxx buck, TPS7Axxx LDO) — they want to hear efficiency vs. noise trade-off and that you understand PSRR limitations of LDOs at high frequency.
07
HardRTL Design
What is ENOB and how does aperture jitter degrade ADC effective resolution?

ENOB (Effective Number of Bits) = (SINAD − 1.76) / 6.02. SINAD = Signal-to-Noise And Distortion ratio. An ideal N-bit ADC has SINAD = 6.02N + 1.76 dB. ENOB accounts for all noise and distortion sources: quantization noise, thermal noise, DNL/INL nonlinearity, and clock jitter.

Aperture jitter: uncertainty in the sampling instant (Δt). At input frequency fin, voltage uncertainty ΔV = 2π·fin·Vin·σt. This appears as additive noise. For a 16-bit ADC at 1 MHz input, required jitter σt < 1/(2π·fin·2^16) ≈ 2.4 ps RMS. High-speed ADCs use low-jitter LC oscillator or external clock reference to meet this. Jitter-limited ENOB = −20·log10(2π·fin·σt) / 6.02.

Interview tip: TI ADC interviews expect you to calculate jitter-limited ENOB from σt and fin — practice this formula.
08
MediumDFT
How is an ADC tested in production for linearity (INL/DNL) efficiently?

Full code histogram method: apply a slowly ramping or sinusoidal input, collect a large sample (10K–100K points), count hits per output code. Ideal uniform distribution → each code gets equal hits. DNL = (actual hits − ideal hits) / ideal hits. INL = cumulative sum of DNL. Accurate but slow.

Fast alternatives: (1) Sine-wave FFT test (SINAD, THD, SFDR in one measurement — covers linearity and noise together); (2) Beat frequency histogram (two tones near Nyquist create slow beat, sweeping all codes quickly); (3) Digital post-correction using factory trim: measure INL, store correction coefficients in OTP, apply digitally. TI uses ATE (Automatic Test Equipment) with precision signal sources (better than ADC under test).

Interview tip: Production test time is cost — TI uses fast FFT-based tests (SNR, THD, SFDR) rather than full histograms wherever possible.
09
MediumProtocols
Design an RTL block for SPI master supporting multiple chip-select lines.

Key RTL blocks: (1) Prescaler counter: divide system clock to generate SCLK at configurable baud rate; (2) Shift register (8/16/32-bit): parallel-load tx_data, shift out MSB/LSB first on SCLK edge, shift in MISO; (3) CS mux: decode cs_sel to assert the correct CS_N line low during transfer; (4) State machine: IDLE → CS_ASSERT → TRANSFER (N bits) → CS_DEASSERT → IDLE; (5) CPOL/CPHA configuration: control idle clock level and which edge is capture/launch.

Interface: AXI-lite register map for tx_data, rx_data, control (baud, mode, cs_sel, start). Interrupt on transfer complete. For multiple slaves: one MOSI, one MISO (tri-stated when CS deasserted), N CS lines. Each slave may need different CPOL/CPHA — store per-CS config.

Interview tip: TI expects Verilog RTL for this. Draw the FSM first, then code the shift register. Make sure CS de-assert has configurable hold time to satisfy slave setup requirements.
10
HardArchitecture
Explain the CORDIC algorithm and why it is preferred in fixed-point DSP hardware.

CORDIC (COordinate Rotation DIgital Computer): computes trigonometric functions (sin, cos, arctan, magnitude) using only shifts and adds — no multipliers. Algorithm: iteratively rotate a vector by pre-defined angles (arctan(2^−i)) using shift-and-add until the rotation error is minimized. Two modes: rotation mode (given angle, find sin/cos) and vectoring mode (given vector, find angle/magnitude).

Why preferred in fixed-point hardware: no multipliers needed (shifts are free in hardware), predictable latency (N iterations = N-bit precision), low gate count, no LUT tables. Used in TI C2000 for FOC (Field-Oriented Control) motor drives where real-time sin/cos calculation at fixed-point is needed without floating-point overhead.

Interview tip: CORDIC gain: each iteration multiplies the vector by sqrt(1 + 2^−2i). Total gain ≈ 1.6468 — must be compensated by a scaling factor or pre-scaling the input.
01
HardPHY/Analog
Explain 112G PAM4 SerDes architecture and the key challenges vs NRZ.

PAM4 (4-level Pulse Amplitude Modulation) encodes 2 bits per symbol. At 56 Gbaud, PAM4 achieves 112 Gbps vs. NRZ's 56 Gbps at the same baud rate. Trade-off: PAM4 voltage levels are 1/3 the spacing of NRZ → 9.5 dB worse SNR margin → much stricter EQ requirements.

SerDes blocks: (1) TX: DSP pre-emphasis (FFE — Feed Forward Equalization, 3–7 taps) to compensate channel loss; (2) RX: CTLE (Continuous Time Linear Equalizer) + DFE (Decision Feedback Equalizer, 6–10 taps) + CDR (Clock Data Recovery, Mueller-Müller or Alexander PD); (3) FEC: Reed-Solomon FEC (RS-528,514 for 112G-LR) mandatory due to higher raw BER. JTOL (jitter tolerance) spec: must tolerate 0.15 UI random jitter at 112 Gbaud.

Interview tip: Marvell's Alaska SerDes are industry-leading 112G PAM4 — know that RS-FEC reduces BER from 10^-5 to 10^-15, and that DFE tap adaptation uses LMS algorithm.
02
MediumProtocols
Why is 8b10b encoding used in SerDes links and what are its trade-offs?

8b10b maps each 8-bit data byte to a 10-bit code word ensuring: (1) DC balance (equal 0s and 1s → no low-frequency content → AC-coupled links work); (2) Sufficient transitions for CDR clock recovery (running disparity control); (3) K-codes (special characters) for comma, idle, and flow control.

Trade-off: 20% bandwidth overhead (8 bits → 10 bits). At 10 Gbps, effective data rate is 8 Gbps. Used in PCIe Gen1/2, SATA, USB 3.0, Gigabit Ethernet. Replaced by 128b130b in PCIe Gen3+ and USB 3.1+ for lower overhead (1.5%). 128b130b does not guarantee DC balance — requires scrambler for sufficient transitions.

Interview tip: Marvell PHY interviews often ask about running disparity — a 10-bit symbol has two valid encodings (+ and − disparity); the encoder alternates to maintain DC balance.
03
HardArchitecture
How does a TCAM implement longest-prefix match for IP routing at line rate?

TCAM (Ternary CAM) stores entries with don't-care bits (X) in addition to 0 and 1. Each cell has two storage nodes: one for data, one for mask. A lookup word is broadcast to all rows simultaneously; each row compares bit-by-bit with its mask applied. All matching rows assert a match signal. A priority encoder selects the highest-priority (longest prefix = most specific) match in O(1) time.

Implementation: TCAM arrays operate in parallel — the lookup takes 1–2 clock cycles regardless of table size (vs. O(log N) for SRAM binary search). Drawback: 4–10× area and power vs. SRAM. Marvell Prestera ASICs pair TCAM with SRAM: TCAM finds the match index, SRAM retrieves the action/next-hop. Power saving: narrow search (TCAM search key = 80/160/320 bits selectable) reduces active cells.

Interview tip: Know that TCAM power is dominated by the broadcast compare operation — Marvell uses pipelined TCAM with selective power-down of unused segments.
04
HardProtocols
Describe PCIe Gen4/5 link training sequence (LTSSM) and what can cause link training failures.

LTSSM (Link Training and Status State Machine): Detect → Polling → Configuration → L0. Key phases: (1) Detect: both sides detect electrical idle exit (EIEOS); (2) Polling: exchange TS1/TS2 ordered sets to establish bit lock and symbol lock; (3) Configuration: negotiate lane width, link number, speed; (4) Recovery (Gen2→Gen5 speed transition): retrain at new speed with equalization.

Gen4/5 Equalization: mandatory 3-phase EQ handshake (Preset exchange, coefficient request, coefficient set). Each direction negotiates TX pre-set and coefficients via TS2s. Failure causes: (1) Signal integrity issues (insufficient EQ at high speed); (2) Compliance pattern errors; (3) PLL lock failure at high baud rate; (4) Receiver detection failure (termination mismatch). Debug via LTSSM state capture and eye diagram.

Interview tip: Gen5 (32 GT/s) requires PAM4 on some implementations — same EQ challenges as 112G Ethernet. Know that PCIe uses 128b130b encoding (not 8b10b) from Gen3 onwards.
05
MediumProtocols
Compare NVMe vs SATA from a storage controller ASIC perspective.

SATA: 6 Gbps max (SATA III), single command queue depth (NCQ: 32 commands), HBA-based arbitration, legacy ATA command set. Simple controller, low CPU overhead for HDDs. Bottleneck at ~600 MB/s for SSDs.

NVMe over PCIe: PCIe Gen4 ×4 = 8 GB/s effective; up to 64K queues × 64K depth — parallelism matches NAND flash internal parallelism. NVMe command set is streamlined (2 doorbell register writes to submit/complete a command vs. SATA's 7 register writes). Controller ASIC: needs PCIe MAC, NVMe controller (queue management, completion), FTL (Flash Translation Layer) for NAND, ECC engine (LDPC). Marvell's 88SS9187 is a mainstream NVMe SSD controller.

Interview tip: NVMe's deep queue parallelism is what enables high IOPS for SSDs — SATA's 32-command NCQ was designed for rotational HDD latency characteristics.
06
HardArchitecture
How does a hardware hash table handle collisions in a networking ASIC?

Hardware hash tables are used for MAC address lookup, flow tables, and ARP caches in switch ASICs. Collision handling: (1) Open addressing (linear or quadratic probing): deterministic, bad worst-case; (2) Chaining: overflow list in a separate SRAM — requires pointer following (extra latency); (3) Cuckoo hashing: two hash functions, two tables — if primary bucket full, displace existing entry to alternate bucket. Lookup always checks both locations in parallel → O(1) guaranteed lookup, insertion O(1) amortized; (4) Perfect hashing: pre-compute collision-free table offline — feasible for static tables (e.g., VLAN tables).

Marvell uses cuckoo hashing in Prestera switches for the FDB (Forwarding Database). Two 512K-entry SRAM banks, each checked in parallel per lookup cycle. Guaranteed 2-cycle lookup at 400 MHz → under 5 ns, fast enough for 400GbE line rate.

Interview tip: Cuckoo hashing is the networking ASIC standard — mention that worst-case insertion requires a chain of displacements (bounded to ~500 operations before rehash).
07
MediumArchitecture
Explain weighted round-robin (WRR) scheduling in a switch ASIC output port.

WRR assigns a weight Wi to each priority queue i. The scheduler services queues in round-robin order but proportional to weight: queue i gets Wi/(sum of weights) fraction of bandwidth. Example: 3 queues, weights 4:2:1 → high priority gets 4/7 = 57% bandwidth.

Implementation: credit counter per queue. Each queue starts with Wi credits per round. Dequeue one packet per credit; when credits reach 0, skip to next queue. After full round, refill all credits. Strict priority queues (SP) for latency-sensitive flows (e.g., RoCE, voice) preempt WRR queues. Marvell Prestera uses SP+WRR hierarchy: top 2 queues strict priority, lower 6 queues WRR.

Interview tip: Deficit WRR (DWRR) handles variable-size packets: deficit counter tracks unused bytes from previous round, preventing large-packet queues from stealing bandwidth.
08
HardProtocols
How does Reed-Solomon FEC (RS-FEC) work in 100G/400G Ethernet and what does it correct?

RS-FEC (RS(544,514) for 100G-LR): systematic block code — 514 data symbols + 30 parity symbols per codeword (each symbol = 10 bits). Can correct up to 15 symbol errors per codeword (t = 15). Key property: corrects burst errors within a codeword window — ideal for SerDes where PAM4 lane errors tend to cluster.

Hardware implementation: RS encoder is a shift register with GF(2^10) polynomial multiplications. Decoder: syndrome computation → Berlekamp-Massey algorithm (find error locator polynomial) → Chien search (find error locations) → Forney algorithm (find error values) → correct. Throughput: must process one FEC codeword every 80 ns at 100Gbps. Requires highly parallel pipelined GF arithmetic. Pre-FEC BER target: 2.4×10^-4; post-FEC BER target: 10^-15.

Interview tip: Marvell implements RS-FEC in both the SerDes PHY and the MAC layer — know that FEC latency (~100 ns) is part of the link budget for 400G optical specs.
09
MediumDFT
How is a high-speed SerDes PHY tested using BIST and loopback modes?

SerDes BIST (Built-In Self-Test): (1) PRBS generator (PRBS-7, PRBS-15, PRBS-31) generates pseudo-random bit stream in TX; RX checks against locally generated PRBS to count bit errors — BER measurement without external tester; (2) Loopback modes: (a) near-end analog loopback (TX analog out → RX input on same die — tests TX+RX path); (b) far-end loopback (loop at remote end — tests full channel); (c) digital loopback (TX digital data → RX digital input — tests only digital portion); (3) Eye diagram capture: built-in eye monitor samples RX at different phase/voltage offsets to map eye opening — reports horizontal/vertical eye margins.

Production test: ATE uses SerDes BIST + loopback for rapid go/no-go without expensive external high-speed stimulus. At-speed PRBS test catches SerDes-specific defects (CDR failure, EQ misadaptation, PLL phase noise).

Interview tip: Know PRBS-31 is preferred for long-haul links (longer period → better randomness) and that PRBS requires sync before BER counting starts.
10
HardSTA
How do you allocate a jitter budget for a 112G SerDes link across TX, channel, and RX?

Total jitter budget for 112G (56 Gbaud): IEEE 802.3ck specifies JTOL (jitter tolerance) and Jitter Generation limits. Budget allocation: Total Jitter (TJ) = Deterministic Jitter (DJ) + Random Jitter (RJ) × Q_factor. At BER=10^-12, Q≈7 → TJ = DJ + 14·σRJ.

Typical 112G-CR4 budget (0.5m copper cable): TX jitter generation ≤ 0.2 UI DJ + 0.03 UI RJ; channel ISI + crosstalk adds 0.15 UI DJ; RX must tolerate 0.5 UI DJ + 0.07 UI RJ at its CDR input. RX JTOL spec (from IEEE): 0.65 UI total at low frequency. Equalization (FFE+CTLE+DFE) reduces ISI-induced DJ at RX. CDR bandwidth determines RJ filtering. FEC reclaims 0.1 UI effective margin by correcting residual errors.

Interview tip: Marvell SerDes validation teams use this budget to allocate which component (TX EQ, FEC, DFE taps) to optimize when a link misses margin.
01
HardArchitecture
How does Tesla's FSD (Full Self-Driving) ASIC differ architecturally from a GPU for neural network inference?

GPU: general-purpose SIMT (Single Instruction Multiple Threads) — thousands of small cores with shared memory and flexible compute. Good for training (irregular sparsity, variable batch sizes). High memory bandwidth required; DRAM bandwidth often the bottleneck.

Tesla FSD (HW3/HW4): two dedicated Neural Processing Units (NPUs), each a systolic array optimized for INT8/INT16 matrix multiply. Fixed dataflow (weight-stationary or output-stationary) eliminates scheduling overhead. On-chip SRAM sized for weight reuse across a batch of frames. Camera pipeline feeds directly into NPU memory — no DMA round-trips. Result: 36 TOPS at 72W (HW3) — better TOPS/W than GPU for this specific workload. Deterministic latency for safety-critical ADAS.

Interview tip: Tesla explicitly chose NPU over GPU for latency determinism and power efficiency. Emphasize that GPUs have non-deterministic scheduling (preemption), which is unacceptable for safety-critical inference.
02
HardArchitecture
Describe the systolic array architecture used in AI accelerators for matrix multiplication.

A systolic array is a 2D grid of Processing Elements (PEs). Each PE performs a multiply-accumulate (MAC). Data flows through the array rhythmically: activations flow left-to-right, weights flow top-to-bottom (or are pre-loaded). Each PE computes one partial sum, passes activations to the right and weights downward.

For an N×N array computing C = A×B: each output element C[i][j] = Σ A[i][k]·B[k][j] accumulates over K cycles. No global interconnect — each PE only communicates with neighbors → very regular layout, high utilization, low power. Data reuse: once a weight tile is loaded into the array, many input vectors flow through, amortizing DRAM bandwidth. Google TPU, Tesla FSD NPU, and Arm Ethos all use systolic arrays.

Interview tip: Key insight: systolic arrays achieve high MAC utilization only when batch size / matrix dimensions match the array shape. Padding and tiling strategy in the compiler is critical.
03
MediumArchitecture
What is INT8 quantization in neural networks and how does it affect accuracy and hardware efficiency?

Quantization maps floating-point weights and activations to 8-bit integers: Q = round(FP / scale + zero_point). Scale and zero_point are per-layer or per-channel calibration parameters. Benefits: (1) 4× smaller model size vs. FP32; (2) 4× faster matrix multiply (INT8 SIMD 4× throughput); (3) Lower memory bandwidth; (4) Lower power.

Accuracy impact: post-training quantization (PTQ) typically loses <1% accuracy for CNN classification but can degrade more for detection/segmentation heads and transformers. Quantization-aware training (QAT) simulates INT8 during training — recovers most accuracy. Per-channel weight quantization (different scale per output channel) significantly reduces accuracy loss vs. per-tensor.

Interview tip: Tesla uses INT8 (and partial INT16 for sensitive layers) in FSD. Know that activations must be calibrated using representative data (calibration dataset) to find optimal scale factors.
04
HardDFT
What does ISO 26262 ASIL-D require at the hardware level for a safety-critical SoC?

ISO 26262 ASIL-D (highest automotive safety integrity level): hardware architectural metrics — SPFM (Single-Point Fault Metric) ≥ 99%, LFM (Latent Fault Metric) ≥ 90%, PMHF (Probabilistic Metric for random Hardware Failures) < 1×10^-8 /h. Requires hardware redundancy and diagnostic coverage to meet these targets.

Implementation at SoC level: (1) Lock-step CPU cores (two identical cores execute same instructions, outputs compared cycle-by-cycle — any divergence = fault); (2) ECC on all on-chip SRAMs and caches; (3) Parity on buses and register files; (4) Watchdog timers; (5) BIST for logic (LBIST) and memory (MBIST) at boot and periodically during operation; (6) Fault injection testing to verify diagnostic coverage. Tesla FSD ASIC has redundant NPU cores with output comparison for safety-critical perception paths.

Interview tip: Tesla interview expects you to know lock-step vs. split-lock (different tasks, compared via software), and that ASIL-D requires hardware redundancy — software monitoring alone is insufficient.
05
MediumDFT
What are the DFT challenges specific to an AI accelerator with large on-chip SRAM arrays?

Large SRAM challenges for DFT: (1) MBIST (Memory BIST): testing large SRAM arrays at-speed requires MBIST controllers per array. March algorithms (March C-, March LR) detect stuck-at and transition faults. At high density, coupling faults between adjacent cells become significant; (2) Repair: use redundant rows/columns (column repair common) with programmable fuses (eFuse, polyfuse); (3) Compiler-generated SRAM: foundry memory compilers provide BIST interfaces — verify timing across all PVT corners; (4) Power: during BIST, all SRAM arrays switching simultaneously → huge IR drop — must stagger MBIST enable signals; (5) Test time: multi-megabyte on-chip SRAMs can take minutes at-speed — hierarchical BIST with concurrent testing of independent banks reduces test time.

Interview tip: For Tesla's large weight-buffer SRAMs, mention that MBIST must run in automotive temperature range (-40°C to 125°C) with sufficient margin for Vmin characterization.
06
HardArchitecture
Why is memory bandwidth often the bottleneck for large transformer model inference?

Transformer inference (auto-regressive generation): each token generation requires loading all model weights from memory (for each attention head and FFN layer). For GPT-3 (175B parameters, FP16): 350 GB of weights must transit the memory bus per token step. At HBM bandwidth of 3.35 TB/s (A100), this takes 350/3350 ≈ 105 ms/token — severely limits throughput.

The compute-to-memory ratio (arithmetic intensity) for token generation is very low: few MACs per byte loaded (weights are loaded once, multiply against one token vector). Prefill (prompt processing) is compute-bound (large batch); decode (generation) is memory-bandwidth-bound (batch=1 typically). Solutions: weight quantization (INT4 → 4× bandwidth reduction), speculative decoding, KV-cache compression, and compute-in-memory (CIM) architectures.

Interview tip: Tesla's Dojo training cluster is compute-bound (large batch training), but their FSD inference chip targets memory-bandwidth efficiency for single-vehicle inference at batch=1.
07
HardRTL Design
How does lock-step redundancy detect random hardware faults in a safety-critical CPU?

Lock-step: two identical CPU cores (or NPUs) receive the same inputs and execute the same instruction stream simultaneously. Outputs (register writes, memory writes, branch outcomes) are compared by dedicated comparison logic every cycle. Any divergence triggers a safety fault — the system can then switch to a safe state or use a third redundant core for voting (TMR — Triple Modular Redundancy).

Temporal offset variant: one core runs N cycles behind, comparing outputs with delay — detects transient (single-event upset) faults that would not affect two simultaneous cores identically. Spatial diversity: cores are physically separated on die to reduce common-cause failures (radiation, thermal gradients). In Cortex-R52 (common in ADAS): lock-step mode is built into the core — a single IP providing dual-core lock-step with internal comparators.

Interview tip: Lock-step doubles power and area — Tesla HW4 uses it only for safety-critical paths (CPU complex), not the high-throughput NPU (which uses output comparison instead).
08
HardCDC
What CDC challenges arise in a heterogeneous AI SoC with CPU, GPU, and NPU clock domains?

Heterogeneous SoC clock domains: CPU (1–3 GHz), GPU (800 MHz–1.5 GHz), NPU (500 MHz–1 GHz), LPDDR controller (fixed ratio to memory clock), ISP, video codec — each potentially asynchronous. CDC challenges: (1) Metastability at every domain crossing — require 2FF synchronizers (or 3FF for very high speed); (2) Multi-bit buses: cannot synchronize bus directly — need handshake protocol (req/ack) or FIFO-based crossing; (3) Control signals and data must be co-synchronized — a stale address with new data causes memory corruption; (4) Timing convergence: STA tools must correctly identify false paths at CDC boundaries — missing false paths creates phantom timing violations; (5) Formal CDC verification (SpyGlass CDC, Meridian) required to prove all crossings are safe.

Interview tip: Tesla's safety qualification requires formal CDC proof — not just simulation. Emphasize that CDC bugs are the hardest to catch in simulation because metastability is probabilistic.
09
HardRTL Design
What is the difference between an NPU, GPU, and fixed-function accelerator for ADAS workloads?

GPU: general SIMT, programmable, flexible data types. High power (200–400W). Non-deterministic scheduling. Best for: training, batch inference with irregular models. Latency not guaranteed.

NPU (Neural Processing Unit): structured MAC arrays (systolic or PE array), fixed dataflow, optimized for tensor ops. Programmable at layer granularity (compiler-generated schedules). Deterministic latency, power-efficient (1–10 TOPS/W). Best for: production inference of standard CNN/transformer layers.

Fixed-function accelerator: hard-wired pipeline for a specific algorithm (e.g., ISP pipeline, H.265 encoder, radar FFT engine). Lowest power, lowest area, fastest, most deterministic. Zero flexibility — cannot run a different algorithm. Used in camera ISP, video decode, radar processing in ADAS. Tesla FSD SoC combines: ARM CPUs (control), NPUs (neural net inference), fixed-function ISP (camera preprocessing), video decoder.

Interview tip: The ADAS SoC design philosophy: use fixed-function for deterministic, well-defined algorithms; NPU for DNN inference; CPU for control and edge cases.
10
MediumPower
How do you manage the power envelope of an autonomous driving SoC in a vehicle thermal environment?

Vehicle thermal constraints: automotive SoC must operate continuously at ambient up to 105°C (cabin temperature) + self-heating. Tesla FSD computer is liquid-cooled, but the chip must still meet TDP (Thermal Design Power) limits (~72W for HW3, ~100W for HW4) to prevent liquid cooling system from being oversized.

Power management: (1) Per-block power gating when camera/radar pipelines not active; (2) DVFS on CPU and GPU based on scene complexity (highway vs. urban); (3) NPU utilization capped when thermal headroom is low; (4) Workload shaping: Tesla's firmware defers non-critical tasks (map updates, logging) during peak thermal loads; (5) PMIC coordination: SoC signals power intent to PMIC via I²C/SMBus, which adjusts rail voltages; (6) Boot-time power characterization (OTP trim) ensures each chip operates at Vmin for its specific leakage.

Interview tip: Unlike mobile, automotive cannot throttle aggressively (safety functions must always run) — power budgets are allocated with guaranteed minimums for perception-critical paths.
01
MediumArchitecture
Compare cut-through vs store-and-forward switching and when each is preferred.

Store-and-forward: receive entire frame, check FCS (CRC), then forward. Latency = full frame transmission time (e.g., 64-byte frame at 100G = 5.12 ns; 1500-byte frame = 120 ns). Eliminates corrupt frames. Required when ingress and egress speeds differ (speed adaptation).

Cut-through: begin forwarding after reading the destination MAC (first 14 bytes = 112 ns at 1G, 1.12 ns at 100G). Dramatically lower latency for small frames. Cannot check FCS — corrupt frames propagate. Fragment-free mode (512-bit minimum) filters collision fragments (legacy Ethernet). Broadcom's Tomahawk series supports cut-through for low-latency HPC interconnects (RDMA/RoCE where latency matters more than error checking).

Interview tip: Broadcom switches support configurable cut-through/store-forward per port. Know that cut-through increases in-flight corrupted frames if any upstream link has high BER.
02
HardArchitecture
What is Head-of-Line (HOL) blocking and how do Virtual Output Queues solve it?

HOL blocking: in a shared input queue switch, if the head packet is destined for a busy output port, it blocks all packets behind it — even those destined for idle output ports. Efficiency degrades to ~58% under uniform random traffic (Karol-Hluchyj-Morgan theorem for FIFO input queuing).

VOQ (Virtual Output Queue): maintain one queue per (input, output) pair. Packet at head of VOQ for output X only blocks if output X is busy — it does not block packets destined for output Y. VOQ requires a central scheduler (crossbar arbiter) that grants one input per output per slot (matching problem). iSLIP algorithm (iterative round-robin matching) solves this in O(N) time converging to maximum weight matching. Broadcom's StrataXGS architecture uses VOQs for large-scale Clos fabric switches.

Interview tip: VOQ requires N² queues (N inputs × N outputs) — memory scales as N², which is why VOQ is used in high-radix switch chips (N=256 ports), not cheaper edge switches.
03
HardPHY/Analog
What DSP techniques are required to implement a PAM4 SerDes receiver at 112 Gbps?

112G PAM4 (56 Gbaud) receiver DSP chain: (1) CTLE (Continuous Time Linear Equalizer): analog high-pass filter compensates cable/PCB loss (frequency-dependent attenuation); (2) VGA (Variable Gain Amplifier): normalize signal amplitude; (3) ADC (4-bit, ~3× oversampled): digitize signal for DSP; (4) Digital FFE (Feed-Forward Equalizer, 5–11 taps): linear FIR filter, compensates remaining ISI; (5) DFE (Decision Feedback Equalizer, 6–15 taps): uses previous decisions to cancel post-cursor ISI (non-linear, avoids noise amplification); (6) CDR (Clock Data Recovery): Mueller-Müller or baud-rate phase detector adapts sampling phase; (7) PAM4 slicer: compare against 3 thresholds (−2/3Vmax, 0, +2/3Vmax) to decode 2 bits per symbol; (8) FEC: RS(544,514) corrects residual errors.

Interview tip: DFE cannot fully equalize in the presence of precursor ISI (which DFE cannot cancel) — FFE handles precursors; DFE handles post-cursors. Both required at 112G.
04
MediumProtocols
How does PFC (Priority Flow Control) enable lossless Ethernet for RoCE/RDMA?

Standard Ethernet is lossy — tail-drop when buffers fill. RDMA (and RoCE v2 over UDP) requires lossless transport (packet loss triggers expensive retransmission or connection reset). PFC (IEEE 802.1Qbb) adds per-priority flow control: when a switch ingress buffer nears full for priority class X, it sends a PAUSE frame to the upstream sender for class X only — other priorities continue unaffected.

Implementation in Broadcom ASIC: per-port, per-priority threshold triggers XOFF PAUSE generation. Sender hardware responds to PAUSE frame by halting class-X transmission within 100 ns (hardware-enforced, not software). The backpressure propagates hop-by-hop through the fabric. Headroom buffer per port-priority is sized to absorb in-flight packets while PAUSE propagates. ECN (Explicit Congestion Notification) is used end-to-end alongside PFC for congestion avoidance (DCQCN algorithm).

Interview tip: PFC deadlock is a real risk in large fabrics — circular PAUSE dependencies can freeze the fabric. Broadcom's switches implement PFC watchdog timers to detect and recover from deadlock.
05
HardArchitecture
How does ECMP (Equal-Cost Multipath) forwarding work in a switching ASIC?

ECMP: when multiple next-hops exist for a destination (equal routing metric), traffic is distributed across all of them. Hashing: a tuple (src IP, dst IP, src port, dst port, protocol) is hashed (CRC32 or Toeplitz hash) to select a next-hop index. Same flow always hashes to the same next-hop → consistent ordering within a flow (no reordering). Different flows hash to different next-hops → load balancing.

ASIC implementation: ECMP group table stores N next-hop entries. Hash output selects index 0..N-1. For N not a power of 2: modulo N or resilient hashing (Broadcom's Resilient ECMP) avoids remapping of surviving flows when one next-hop fails. Resilient ECMP: use a large bucket table (1024 entries) mapped to N next-hops; on failure, only reassign the failed next-hop's buckets, minimizing flow disruption.

Interview tip: Standard ECMP remaps ~50% of flows on any next-hop change. Resilient ECMP (Broadcom StrataXGS feature) limits remapping to 1/N of flows — critical for RDMA which cannot tolerate reordering.
06
MediumArchitecture
Explain DWRR scheduling and why it is preferred over strict WRR for networking ASICs.

WRR (Weighted Round-Robin): service queues in proportion to weight. Works well for fixed-size packets. For variable-size packets: a queue with large packets gets more bytes than intended per round even if its packet count matches the weight — unfair.

DWRR (Deficit WRR): each queue maintains a deficit counter. Per round, each queue receives Wi bytes of credit (not packets). A packet is dequeued if (deficit ≥ packet_size). After dequeue, deficit -= packet_size. Unused deficit carries over to next round. Result: byte-accurate fairness regardless of packet size. Implementation: hardware maintains per-queue deficit register, updated each dequeue decision. Broadcom's Trident series implements DWRR for best-effort queues with SP (strict priority) queues for latency-critical traffic (RDMA, latency-critical apps).

Interview tip: DWRR is the standard for data center Ethernet — IEEE 802.1Q-2014 (Credit-Based Shaper for AVB) uses a similar mechanism. Know that Broadcom also supports CBS (Credit-Based Shaper) for TSN.
07
HardProtocols
How is RoCE v2 (RDMA over Converged Ethernet) implemented in hardware on a switching ASIC?

RoCE v2 encapsulates InfiniBand transport packets in UDP/IP/Ethernet. Hardware offload on NIC: zero-copy DMA, QP (Queue Pair) management, CQ (Completion Queue) notifications — CPU is not involved per-packet. The switch ASIC is RoCE-unaware at packet level — it treats RoCE as UDP traffic. But the switch must support: (1) PFC for lossless delivery; (2) ECN marking (WRED/RED with ECN bit set) for DCQCN congestion control; (3) Low latency (cut-through forwarding); (4) RoCE v2 requires routable IP — switch does L3 routing same as any IP packet.

Broadcom StrataXGS implements hardware ECN marking: when queue depth exceeds a threshold, incoming RoCE packets get ECN CE (Congestion Experienced) bit set in IP header. The receiver NIC echoes this to the sender NIC, which reduces injection rate. This DCQCN loop runs entirely in hardware at 100ns timescales.

Interview tip: The key insight is that RoCE v2 losslessness is enforced by PFC in the switch, and congestion avoidance by ECN marking + DCQCN — two separate mechanisms working together.
08
HardArchitecture
Walk through the forwarding pipeline of a 400GbE switching ASIC from packet ingress to egress.

Broadcom Tomahawk 400G pipeline (simplified): (1) Ingress MAC: receive 400G signal, SerDes decode, RS-FEC decode, MAC framing, CRC check; (2) Ingress parser: extract packet headers (Ethernet, VLAN, IP, TCP/UDP) into header vector; (3) Ingress pipeline: VLAN lookup, ACL (TCAM), L2 FDB lookup, L3 LPM (ECMP), QoS marking, meter/policing — all in 1–2 ns pipeline stages; (4) Switching fabric: cell-based internal fabric (cells = 128-byte cells from packet segmentation) — crossbar or Clos; (5) Egress pipeline: rewrite (TTL decrement, MAC swap, VLAN push/pop), QoS queue selection, scheduler; (6) Egress MAC: RS-FEC encode, SerDes transmit.

Throughput requirement: 25.6 Tbps aggregate for 64×400G ports. Pipeline clock: 1.2 GHz, highly parallel. Packet rate: 64 bytes minimum → 3.2 Bpps — each pipeline stage processes one cell per clock.

Interview tip: Emphasize that TCAM (ACL/LPM) and SRAM (FDB/ARP) lookups are pipelined — Broadcom uses 4+ pipeline stages for L2/L3/ACL to run all lookups in parallel.
09
MediumDFT
How is a 112G SerDes PHY array tested in production using on-chip BIST at Broadcom?

Production test strategy for 112G SerDes: (1) PRBS-31 loopback BER test: all SerDes lanes enabled simultaneously in near-end analog loopback, PRBS-31 pattern generated and checked. BER < 10^-12 required for pass; (2) Eye monitor scan: hardware eye monitor sweeps phase (16 steps across UI) and voltage (16 levels), records error count per point — maps eye opening. Minimum eye height/width checked vs. spec; (3) EQ self-adaptation: CDR, CTLE, DFE adapt during PRBS run — verify adaptation convergence (final coefficient values within bounds); (4) Lane-to-lane crosstalk: run PRBS on all lanes simultaneously to stress worst-case crosstalk; (5) Clock jitter measurement: internal TDC (Time-to-Digital Converter) measures CDR recovered clock phase noise.

Test time: 64 lanes × PRBS run + eye scan ≈ 30–60 seconds per device. Parallel testing all lanes simultaneously cuts total time.

Interview tip: Broadcom tests SerDes PHY both at-speed (PRBS BER) and parametrically (eye scan, jitter) — both are needed because BER pass doesn't guarantee adequate margin for field reliability.
10
HardSTA
What makes the Ethernet MAC datapath a critical timing path in a 400G switch ASIC?

400G Ethernet MAC processes data at 400 Gbps. Using 256-bit internal datapath at 1.6 GHz: each cycle processes 256 bits = 32 bytes at 1.6 GHz = 51.2 GB/s per MAC. For 64 ports: 64×400G = 25.6 Tbps total. Critical paths: (1) FCS (CRC-32) computation: must compute CRC over 256-bit word in one cycle → requires parallel CRC algorithm (precomputed Galois field matrix); (2) Parser: extracting IP/TCP offsets from variable-length headers in one cycle → wide combinational mux tree; (3) Alignment FIFO: CDC between SerDes recovered clock and MAC core clock — synchronizer contributes latency; (4) Insertion of preamble/SFD and padding: modifies first/last cells — conditional logic on critical path; (5) Checksum offload: inner IP checksum calculation for tunnel traffic in the same cycle as outer MAC.

Interview tip: Broadcom's Tomahawk 4 MAC runs at ~1.2 GHz. Critical path closure requires retiming (adding pipeline flops), logic optimization, and physical awareness (placement-aware synthesis) for the CRC and header parser logic.
No questions match your search. Try a different keyword or select "All".