A latch is a level-sensitive storage element. When the enable (or clock) is HIGH, the output follows the input continuously — the latch is "transparent." When the enable goes LOW, the last value is held. A flip-flop is edge-triggered — it samples the input only at the exact rising (or falling) clock edge and ignores the input at all other times.
Why flip-flops dominate RTL design: Their predictable sampling window makes static timing analysis (STA) straightforward — setup and hold times are well-defined relative to one clock edge. Latches create "transparent windows" that make STA far more complex; timing tools must ensure that no combinational path through an open latch violates timing in any cycle.
When to use latches deliberately:
- Power savings: A latch consumes no clock dynamic power when transparent (no clock-to-output toggling).
- High-performance pipelines: In "latch-based" designs (common in custom datapath and CPUs), a latch pair (master + slave) forms a pseudo-FF but allows time-borrowing — a slow first half-cycle can steal time from a fast second half-cycle, improving throughput.
- Specialized cells: Sense amplifiers and certain memory cells use latch-based structures.
D = en ? data_in : Q. The synthesis tool maps this to a clock-gate cell, not a latch.Dividing by an odd number and achieving 50% duty cycle requires using both clock edges. A single-edge counter can only produce a 33%/67% duty cycle.
The technique: Create two signals derived from the same mod-3 counter — one toggled on the rising edge, one on the falling edge — then OR them together.
- Use a 2-bit counter clocked on the rising edge counting 0→1→2→0. Generate
out_r= HIGH when count == 0, LOW when count == 1. - Use the same counter logic clocked on the falling edge. Generate
out_fidentically. - Final output =
out_r OR out_f. Becauseout_randout_fare offset by half a source clock period, their OR produces exactly 1.5 periods HIGH and 1.5 periods LOW out of every 3 source periods — 50% duty cycle.
The FIFO must hold all data written during a burst before the reader catches up. The minimum depth is:
Depth ≥ (Write rate − Read rate) × Burst duration + Synchronizer latency guard
Breaking this down:
- Burst excess: If a writer sends at
fwwords/cycle forTcycles, and the reader drains atfrwords/cycle, the net accumulation is(fw − fr) × Twords. This is the minimum storage needed. - Synchronizer latency: The gray-code read pointer takes 2–3 destination-clock cycles to propagate through the synchronizer. During this window, the write side may falsely see the FIFO as full (or the read side sees it as empty). Add 2–3 words of margin per side.
- Round up to power of 2: Async FIFO address arithmetic requires a power-of-2 depth so that the gray code pointer MSB inversion trick for full/empty detection works correctly.
A glitch (or hazard) is a spurious, short-duration output pulse that occurs when multiple inputs change simultaneously and race through paths of unequal delay. Even though the steady-state output is correct, the transient produces an unwanted transition.
Classic example: A static-1 hazard in a 2-input AND gate where both inputs come from the same signal A through paths with different delays — one direct, one through an inverter. Mathematically A AND NOT(A) = 0, but if the direct path is faster, the gate briefly sees 1 AND 1 = 1 before the inverted path arrives.
Why glitches matter:
- Power waste: Every glitch is a switching event that consumes dynamic power (
αCV²f). High-activity buses can waste significant power. - Clock path corruption: A glitch on a clock or enable line can clock a flip-flop at the wrong time, causing functional failure.
- Latch transparency: Glitches on a latch enable propagate directly to the latch output while it is transparent.
Prevention:
- Register outputs: Sampling glitchy combinational logic in a flip-flop on the clock edge filters all glitches shorter than the setup window.
- Hazard-free logic: In Karnaugh map minimization, add "consensus" prime implicant terms that cover the transition between any two adjacent groups — eliminates static hazards.
- Clock gating cells (ICG): Use library clock gate cells that latch the enable on the clock LOW phase — ensures the gated clock output is always a complete pulse or no pulse at all.
A naive assign clk_out = sel ? clk1 : clk0 will produce a glitch when sel changes — the output can get a truncated pulse from one clock or a merged pulse from both. This corrupts any flop clocked by clk_out.
The safe design uses an interlocked two-branch structure:
- Each branch has a flip-flop clocked on the falling edge of its own clock to gate that branch on or off.
- Branch 0 FF:
D = !sel AND !en1_q, clocked on negedge clk0. Branch 1 FF:D = sel AND !en0_q, clocked on negedge clk1. - Each branch gates its own clock:
clk0_g = clk0 AND en0_q,clk1_g = clk1 AND en1_q. - Output:
clk_out = clk0_g OR clk1_g.
Why falling-edge clocking? Gating on the falling edge ensures that the enable change is captured while the clock is LOW, so the gated clock output is either a complete HIGH pulse or nothing — never a partial pulse.
Why the cross-interlocking? The !en1_q / !en0_q terms ensure only one branch is ever active at a time. The transition from clk0 to clk1 requires clk0's branch to deassert fully before clk1's branch asserts — preventing both from being active simultaneously.
Setup time (t_su) is the minimum time the data input must be stable before the active clock edge for the flip-flop to reliably capture it. Hold time (t_h) is the minimum time the data must remain stable after the clock edge.
Together they define a "forbidden window" around the clock edge where data must not change.
Setup violation (data arrives too late): The flip-flop samples data before it has settled to a valid logic level. The FF may capture the wrong value or enter a metastable state. This is a functional failure at the target frequency — the design either works slowly or not at all. Setup violations are frequency-dependent: slow down the clock enough and they disappear.
Hold violation (data changes too soon after the clock edge): The flip-flop's captured value is overwritten before it is fully stored. This causes the FF to capture a corrupted value — either the new data that hasn't fully arrived, or garbage. Hold violations are frequency-independent — they occur even at 1 Hz and are caused by short combinational paths (fast data propagation relative to clock skew). They are the more dangerous class because no amount of slowing down the clock fixes them.
Yes, the fixes are completely different and cannot be mixed up.
Fixing a setup violation (data arrives too late — reduce data path delay):
- Replace high-drive-strength cells with faster variants (higher Vt cells are slower; swap to lower Vt)
- Reduce logic depth — restructure combinational logic to fewer gate stages
- Use retiming — move registers across combinational logic to balance stages
- Add pipeline registers to split a long path into two shorter ones
- Optimize clock skew — use positive skew on the capture FF (delay the capture clock) to give the data path more time
- Last resort: reduce the clock frequency
Fixing a hold violation (data arrives too early — increase minimum data path delay):
- Insert delay buffers (DEL cells from the standard cell library) on the short data path
- Use higher-drive-strength cells — paradoxically, some cells have more internal delay than smaller ones
- Add logic stages that cancel each other (insert an even number of inverters)
- Adjust clock skew: negative skew on the capture FF (advance the capture clock) reduces the hold window
Clock skew is a static, deterministic difference in clock arrival time between two flip-flops on the same chip. It is fixed for a given netlist and process corner. Skew arises from different buffer depths or wire lengths in the clock tree.
Clock jitter is a dynamic, cycle-to-cycle variation in the clock edge position. It is random (caused by power supply noise, substrate coupling, PLL VCO noise) and varies every cycle. You cannot predict the sign or magnitude of jitter in any given cycle.
Effect on timing:
- Skew and setup: Positive skew (capture FF clock arrives later) helps setup — the data has more time to travel. Negative skew (capture clock earlier) hurts setup.
- Skew and hold: Positive skew hurts hold — data launched early from the launch FF might arrive at the capture FF before its (delayed) clock edge. Negative skew helps hold.
- Jitter and both: Jitter degrades both setup and hold because you can't know which direction the edge will shift. STA tools add a "clock uncertainty" (a worst-case jitter margin) that reduces both setup and hold slack. It cannot be recovered through skew optimization.
For a path from a launch flip-flop (FF1) to a capture flip-flop (FF2):
Setup check — data must arrive before the capture edge:
- Data arrival time =
T_clk_launch + T_cq(FF1) + T_comb - Data required time =
T_clk_capture − T_setup(FF2) - Setup slack = Required − Arrival =
(T_clk_capture − T_su) − (T_clk_launch + T_cq + T_comb)
Hold check — data must not arrive too early:
- Data must arrive after:
T_clk_capture + T_hold(FF2) - Hold slack = Arrival − Hold_required =
(T_clk_launch + T_cq_min + T_comb_min) − (T_clk_capture + T_hold)
Where:
T_cq= clock-to-Q propagation delay of the launch FFT_comb= total combinational path delay (sum of gate + wire delays)T_setup / T_hold= FF timing constraints from the cell libraryT_clk_capture − T_clk_launch= clock skew (positive = capture is later)
Slack = margin above the requirement. Positive slack → timing met. Negative slack → timing violated. The most negative slack in the design = worst negative slack (WNS); summing all negative slacks = total negative slack (TNS).
A simultaneous setup and hold violation on the same path means the combinational delay window is too narrow — the path is fast enough to threaten hold, but not fast enough to meet setup. This typically means the logic between two FFs is very shallow (perhaps only a wire or one gate), but there is also a clock tree imbalance creating large skew.
Diagnosis first:
- Check the clock skew between launch and capture FFs. Excessive positive skew can simultaneously worsen hold (by making the capture clock late) while hurting setup if the data path is borderline.
- Look at the path's actual combinational depth — very few gates means it's a structurally short path.
Fix strategy:
- Rebalance the clock tree first: Reduce skew between these two FFs. This is the most targeted fix — less skew directly improves both simultaneously.
- Insert delay cells: Add buffers on the data path to increase minimum delay (fix hold). Then verify setup is still met — if setup is tight, you may need to also optimize the logic depth.
- Restructure logic: If setup is violated because the path is in a long combinational chain overall, pipelining it (adding an intermediate register) can help. But this changes the design architecture.
Metastability occurs when a flip-flop's setup or hold time is violated — the flip-flop enters a metastable state where its output is neither a valid logic 0 nor a valid logic 1. The internal node of the FF is stuck near the switching threshold (V_DD/2) and takes an unpredictable time to resolve to a valid level.
The physics: A flip-flop is a bistable element with two stable equilibria (0 and 1) and one unstable equilibrium (the metastable point). When forced into the unstable point, it resolves exponentially fast — but how long it takes is governed by thermal noise and is therefore random.
Can it be eliminated? No — not completely. Any time asynchronous data crosses a clock boundary, there is a non-zero probability of violating the setup/hold window. The probability of remaining metastable beyond a time T_r decreases exponentially with T_r, but never reaches exactly zero.
What we do instead: We manage the probability using synchronizers. The key metric is MTBF (Mean Time Between Failures). A 2-flop synchronizer gives the metastable FF one full clock period to resolve — in modern CMOS (τ ≈ 30ps), at 1 GHz this gives MTBF of thousands of years, making failure astronomically unlikely.
A 2-flop synchronizer consists of two back-to-back flip-flops, both clocked by the destination domain clock, inserted on a signal crossing from another clock domain.
How it works: The first FF may go metastable when it samples the asynchronous input. It has one full clock period (minus the FF's own propagation delay and the second FF's setup time) to resolve. Because metastability resolution time is exponential, the probability that it remains metastable long enough to corrupt the second FF is extremely small.
Why not 1 flop? One flop doesn't give enough resolution time. The metastable FF must resolve within roughly T_clk − T_cq − T_setup2 — for a 1 GHz clock, that's ~500ps. The probability of remaining metastable that long is non-trivial in some technologies.
Why not 3 flops? Three flops are rarely necessary. With a 1 GHz destination clock and τ ≈ 30ps (modern 7nm), a 2-flop synchronizer gives:
- Resolution time T_r ≈ 500ps, τ ≈ 30ps
- MTBF ≈ e^(T_r/τ) / (f_c × f_d) ≈ e^16.7 / 10^18 → millions of years
Three flops extend MTBF to an astronomically larger number that adds no practical benefit. Use 3 flops only in safety-critical applications (automotive ASIL-D, aerospace) where even million-year MTBF is required to be proven insufficient with 2.
In an async FIFO, the read pointer lives in the read clock domain and the write pointer lives in the write clock domain. Each pointer must be compared against the other's synchronized version to determine full or empty.
The problem with binary counters: When a binary counter increments, multiple bits change simultaneously. For example, 0111 → 1000 changes all 4 bits. If you sample a binary counter while it's transitioning, you might read any of the 16 possible values — a catastrophic error that could falsely declare the FIFO full or empty, corrupting data.
Why Gray code solves this: A Gray code changes exactly one bit per count. When the pointer transitions from count N to N+1, only one bit flips. If the synchronized copy is sampled mid-transition, the worst case is that it sees either count N or count N+1 — off by at most one.
The FIFO full/empty logic deliberately has one count of tolerance: full is declared when the pointers are exactly DEPTH apart, not DEPTH−1. This one-count margin absorbs the maximum one-count error introduced by Gray code sampling, making the detection robust.
gray = bin XOR (bin >> 1). Store the Gray counter as the pointer, convert back to binary (requires a loop of XORs) only if you need the absolute address for memory indexing.MTBF (Mean Time Between Failures) quantifies how often a synchronizer is expected to allow a metastable signal to propagate into the destination domain. The standard formula is:
MTBF = e^(T_r / τ) / (f_c × f_d × t_w)
Where:
- T_r — resolution time available for metastability to resolve (≈ T_clk − T_cq_ff1 − T_su_ff2). The more time, the exponentially higher the MTBF.
- τ — technology metastability time constant. A smaller τ means the FF resolves faster, improving MTBF. Scales with process: ~100ps at 180nm, ~30ps at 7nm.
- f_c — destination clock frequency. Higher frequency = more sampling opportunities per second = more chances for metastability to cause failure.
- f_d — data toggle rate. How often does the incoming signal change near the clock edge?
- t_w — the setup+hold window width. Narrower window = smaller probability of entering metastability per clock cycle.
The exponential dependence on T_r/τ is why adding a second synchronizer flip-flop dramatically improves MTBF — it adds one full clock period to T_r.
Dynamic power = α × C_L × V_DD² × f — consumed when a node switches from 0 to 1 (charges the load capacitance) or 1 to 0 (discharges through the pull-down network). α is the activity factor (fraction of clock cycles the node switches). This power is zero when the circuit is idle.
Static (leakage) power = I_leakage × V_DD — consumed even when no gates are switching, due to sub-threshold current, gate oxide tunneling, and junction leakage. It does not depend on frequency and is present whenever power is supplied.
Historical trend: In nodes above 90nm, dynamic power dominated. Below 28nm and especially at 7nm/5nm, leakage has grown dramatically because transistors cannot be switched fully off at low supply voltages. Modern SoCs spend significant area on leakage management.
Reduction techniques:
- Dynamic: Clock gating (reduce α), operand isolation (prevent toggling of datapath), voltage scaling (V² dependence), frequency scaling, low-swing signaling
- Static: Power gating (cut V_DD to an entire domain via header/footer cells), multi-Vt design (use High-Vt cells in non-critical paths — slower but much lower leakage), reverse body biasing, state retention during power down
Clock gating removes the clock signal from a register bank when its value won't change, eliminating the clock-to-Q dynamic power and the switching power of all downstream logic. The clock network can account for 30–40% of total chip dynamic power, making clock gating one of the highest-impact power techniques.
Why you cannot simply write assign gated_clk = clk AND enable:
If enable changes while clk is HIGH, the AND gate output glitches — it produces a truncated clock pulse shorter than a full cycle. This truncated pulse can violate the setup/hold requirements of any flip-flop it clocks, corrupting stored data or causing metastability.
The correct implementation — Integrated Clock Gating (ICG) cell:
- A latch samples the enable signal on the LOW phase of the clock (when clock = 0)
- The latched enable is then ANDed with the clock
- Because the latch captures
enableonly when the clock is LOW, by the time the clock rises, the latch output is stable — the AND gate sees a stable enable and a clean rising edge → full clock pulse or no pulse, never a partial one
In RTL, you write: if (enable) register <= data; and the synthesis tool infers an ICG cell. Never write clock gating manually at the gate level in RTL — let the tool use the optimized library ICG cell.
A scan chain connects the flip-flops in a design into a long shift register that can be controlled and observed from the chip's I/O pins, purely for testing purposes.
How it works: Each flip-flop in the design is replaced with a scan flip-flop — identical to a normal FF but with an extra 2:1 mux at the data input:
- In functional mode (scan_enable = 0): the mux passes normal D input — the design operates as designed.
- In scan mode (scan_enable = 1): the mux passes the previous FF's output — all FFs form a shift register. You can shift in a test pattern, capture one functional clock cycle, and shift out the results for comparison.
Why it's essential: Without scan, testing whether a stuck-at fault (wire permanently stuck at 0 or 1) exists deep in the chip requires applying just the right sequence of primary input patterns — combinatorially explosive. With scan, an ATPG (Automatic Test Pattern Generation) tool can directly control any FF's state and observe any FF's captured output, enabling near-100% stuck-at fault coverage with a manageable number of test vectors.
Test flow at production: After packaging, every chip is tested on an ATE (Automated Test Equipment). The scan chain shifts in millions of vectors and compares the shifted-out responses against the fault-free model. Any mismatch → chip fails and is discarded.
Every AXI4 channel (AW, W, B, AR, R) uses a two-signal handshake: VALID (driven by the sender) and READY (driven by the receiver). A transfer occurs on the rising clock edge when both VALID and READY are simultaneously HIGH.
Rules:
- The sender asserts VALID when it has valid data/address to send and must not deassert VALID until the transfer completes (both signals HIGH on a clock edge).
- The receiver asserts READY when it can accept data. READY may be HIGH before VALID (pre-ready) — this is fine.
- If VALID is asserted and READY is LOW, both sides wait. Neither can "cancel" the transaction by deasserting VALID without completing the handshake.
The rule that must never be broken: VALID must not combinatorially depend on READY. If the master only asserts VALID after it sees READY, and the slave only asserts READY after it sees VALID, the result is a deadlock — neither ever fires first. READY is allowed to depend on VALID, but not vice versa.
AXI4 has 5 independent channels, enabling key features: a new write address (AW) can be accepted while write data (W) from a previous burst is still in-flight. Read and write transactions are completely independent, maximizing bus utilization.
AXI4 allows a master to issue multiple outstanding read or write transactions before receiving responses. Each transaction is tagged with a Transaction ID (ARID for reads, AWID for writes). The slave and interconnect are free to complete transactions in a different order from how they were issued — a fast SRAM access may return data before a slow DRAM access even if the DRAM request was issued first.
How the master reconciles responses: Read data returns on the R channel with RID matching the original ARID. Write responses return on the B channel with BID matching AWID. The master maintains an outstanding transaction table and uses the ID to match each response to the correct request.
Ordering rule per ID: Transactions with the same ID must complete in order. If a master issues two reads both with ARID=3, the interconnect must return them in order. Transactions with different IDs have no ordering guarantee relative to each other.
Interconnect ID widening: When multiple masters share an interconnect, the fabric appends a master-identifying prefix to each ID (e.g., 2-bit master select + original ARID = extended RID). On the response path, the prefix is used to route the response back to the correct master, which strips the prefix before comparing IDs.
Pipelining divides a long combinational operation into N sequential stages, each separated by flip-flops. Instead of one result every T_total clock period (dictated by the slowest path), you get one result per T_total/N clock period — throughput increases N× once the pipeline is full.
Example: A 5-stage 32-bit multiplier at 500 MHz produces one product every 2 ns. Without pipelining, the same logic would run at 100 MHz (5× slower combinational chain). With pipelining, a new multiply starts every cycle — though each individual result still takes 5 cycles of latency.
What pipelining improves: Throughput (results per unit time) — directly, by allowing clock frequency to be multiplied by the number of stages.
Trade-offs:
- Latency: Each result takes N cycles to complete instead of 1. This is often acceptable for bulk data, but hurts interactive or latency-sensitive operations.
- Area: N−1 extra register stages add flip-flop area and routing overhead.
- Power: More registers switching every cycle; however the lower V_DD enabled by higher-frequency operation may offset this.
- Hazards: Data hazards (RAW — read after write: a stage needs a result not yet produced by a later stage), control hazards (branches), and structural hazards (resource conflicts) require stalls, forwarding, or branch prediction logic — all of which reduce ideal throughput.
- Balancing: If one stage is slower than others, it bottlenecks the pipeline. All stages must be balanced to the same worst-case delay for the frequency gain to be fully realized.
Blocking assignment (=) executes sequentially within an always block — each statement completes before the next begins, exactly like a software assignment. The left-hand side updates immediately.
Non-blocking assignment (<=) evaluates all right-hand sides first (using values from the current time step), then schedules all left-hand side updates to happen simultaneously at the end of the time step. This models the parallel behavior of flip-flops sampling their D inputs on a clock edge.
The golden rules:
- Use
=(blocking) for combinational logic inalways @(*)blocks. The sequential evaluation correctly implements the logic function. - Use
<=(non-blocking) for sequential logic inalways @(posedge clk)blocks. The simultaneous update models how FFs all sample their D input on the same clock edge. - Never mix both types in the same
alwaysblock.
Classic bug with the wrong choice: A shift register written with blocking assignments (a = in; b = a; c = b;) immediately propagates the input through all stages in a single clock cycle. With non-blocking (a <= in; b <= a; c <= b;), all three FFs sample their current input simultaneously — correct shift register behavior.
wire, reg, and logic in SystemVerilog? Why was logic introduced?wire (Verilog): a net type representing a physical connection. It can only be driven by continuous assignments (assign) or module output ports. Multiple drivers resolve via wired-AND/OR logic depending on the net type. It cannot hold state.
reg (Verilog): a variable that can be driven inside procedural blocks (always, initial). Despite its name, it does NOT necessarily synthesize to a register — a reg inside always @(*) synthesizes to combinational logic. The name is misleading and a common source of confusion.
logic (SystemVerilog): a unified 4-state variable type that replaces both wire and reg for most use cases. It can be driven by both continuous assignments and procedural blocks. The key restriction: logic allows only one driver — the compiler flags multi-driver errors that wire silently allows. This catches accidental bus conflicts at compile time.
Why logic was introduced:
- Eliminates the confusing
regmisnomer —logiccommunicates data type, not inferred hardware. - Provides compile-time multiple-driver checking that
wirelacks. - Works in both continuous and procedural contexts, reducing declarations.
logic for almost everything. Use wire only when you explicitly need multiple drivers (e.g., tri-state buses, wired-AND). Avoid reg entirely in new SystemVerilog code.A synchronous FIFO uses a write pointer (wrptr) and a read pointer (rdptr) to track the head and tail. Both pointers start at 0. When the FIFO is empty, both point to the same location — and when it is completely full, both also point to the same location after wrapping around. This ambiguity is the core challenge of FIFO pointer design.
The naive approach fails: If both pointers are N-bit binary counters with range 0 to DEPTH-1, you cannot distinguish full from empty because both conditions result in wrptr == rdptr.
The extra bit trick: Use N+1 bit pointers, where N = log₂(DEPTH). The lower N bits are the actual memory address; the MSB (the "extra bit") acts as an overflow wrap indicator.
- Empty:
wrptr == rdptr(all N+1 bits equal — same wrap count, same address) - Full:
wrptr[N-1:0] == rdptr[N-1:0]ANDwrptr[N] != rdptr[N](same address, but one extra wrap ahead)
The MSBs differ when the write pointer has wrapped one more time than the read pointer — meaning the FIFO is exactly DEPTH entries deep.
False path: a timing path that exists in the netlist but will never carry valid data in the real operating system. STA should completely ignore it — no setup or hold analysis. Examples:
- Paths between two completely unrelated, never-simultaneously-active clock domains
- Paths from a test-mode-only mux output that is static during functional operation
- Reset synchronizer paths where the reset is never timing-critical
- Paths between scan mode logic not active during functional timing
SDC: set_false_path -from [get_cells launch_ff] -to [get_cells capture_ff]
Multi-cycle path (MCP): a path where data is intentionally designed to take N clock cycles to settle. The designer tells STA to use N×T_clk as the available time for the setup check instead of 1×T_clk.
SDC for a 2-cycle setup path: set_multicycle_path 2 -setup -from ... -to ...
Critical rule for MCP: When you relax setup by N cycles, you MUST also adjust the hold check. By default, STA places the hold check one cycle before the setup capture edge — correct for 1-cycle paths. For a 2-cycle setup, the hold check must also move back one cycle:
SDC: set_multicycle_path 1 -hold -from ... -to ...
set_multicycle_path 2 -setup without the matching -hold exception creates an overly pessimistic hold check one cycle before the new setup capture — often an impossible hold requirement that forces unnecessary delay insertion.OCV (On-Chip Variation) acknowledges that cells at different locations on the same die do not experience identical conditions. Spatial gradients in temperature, VDD (due to IR drop), and manufacturing process (oxide thickness, doping) cause cells in different parts of the chip to have slightly different delays — even if they are the same cell type running at the same nominal conditions.
This matters for STA because the clock path and data path typically run through physically different areas of the chip. If both paths were derated the same way, the error would cancel. But since one may be faster and the other slower, we must be pessimistic.
Flat OCV (Flat Derating): Applies a single multiplicative derating factor to all cells. The launch data path is made slower (multiply delays by e.g. 1.05) and the capture clock path is made faster (multiply by 0.95) for setup — worst-case pessimism everywhere. Simple but overly conservative.
AOCV (Advanced OCV): Uses a lookup table indexed by path depth (number of logic stages) and distance. Longer paths with more stages average out variation — a 30-stage path has less cell-to-cell variation than a 2-stage path. AOCV assigns less derating to deep paths, reducing pessimism and improving timing convergence without sacrificing accuracy.
POCV (Parametric OCV / LVF): Uses full statistical distributions (mean and sigma) for each cell's delay, propagating uncertainties through the path using statistical addition. This is the most accurate method and is becoming the industry standard at 7nm and below, where AOCV is no longer pessimistic enough.
When STA analyzes a flip-flop-to-flip-flop path, the launch clock path (from clock source to FF1) and the capture clock path (from clock source to FF2) often share common clock buffers near the root of the clock tree before they diverge.
With OCV derating, the tool pessimistically applies opposite deratings to the launch and capture paths: the launch clock is made slower (derated up) and the capture clock is made faster (derated down) for setup analysis. But the shared portion of the two paths cannot simultaneously be both slow and fast — it is the same physical cell running at the same moment in time.
CRPR removes this double-counting. For the portion of clock tree that is common to both launch and capture paths, the STA tool calculates how much pessimism was added by applying opposite deratings to the same cells, and adds that amount back as credit. The formula:
CRPR credit = max_delay(common) − min_delay(common)
This credit is added back to the setup slack. Typical CRPR values range from 10 ps to 100 ps depending on how much of the clock tree is shared and how aggressive the OCV derating is.
CRPR is sometimes called CPPR (Common Path Pessimism Removal) — both terms mean the same thing. Modern STA tools (PrimeTime, Tempus) apply it automatically.
Why per-bit synchronization fails: Each bit of the bus passes through its own 2-FF synchronizer independently. Each synchronizer may sample from a different source clock cycle — bit 3 might capture the value from cycle N while bit 0 captures the value from cycle N+1. The destination domain then reads a "torn" word that never existed in the source domain. For a 32-bit bus, this can produce completely wrong data.
Safe techniques for multi-bit CDC:
- Gray code (for counters/pointers): If the bus is a counter that increments by one at a time, encode it in Gray code before the crossing. Only one bit changes per count, so a sampled-in-transition value is at most off by one — which FIFO logic tolerates.
- Handshake (req/ack): Source asserts a request (req) after data has been stable for at least one source cycle. Destination synchronizes req (2-FF), samples the data only after req is asserted, then asserts ack. Source deasserts req after seeing synchronized ack. Both req and ack use separate 2-FF synchronizers. Low throughput (takes ~4–6 destination clock cycles per transfer) but works for any arbitrary data.
- Asynchronous FIFO: For streaming data, use an async FIFO with Gray-coded pointers. The FIFO internally handles all multi-bit CDC safely.
- Qualified sampling: Source keeps data stable for at least 3 destination clock cycles, then asserts a single "data valid" signal. Destination synchronizes the valid signal and samples the data on the synchronized valid. Risky — relies on the source holding data long enough.
A 2-FF level synchronizer is used when the source signal is a steady level that persists for many source clock cycles. The destination captures it safely after 2 destination clocks.
A pulse synchronizer is needed when the source generates a single-cycle pulse — a signal that is HIGH for exactly one source clock cycle. A 2-FF synchronizer cannot reliably capture this: if the destination clock is slower or at an unfortunate phase, the pulse may be missed entirely.
How a toggle-based pulse synchronizer works:
- Source domain: A toggle flip-flop converts each incoming pulse into a level change. Every time a pulse arrives, the FF inverts its output. The toggle signal therefore holds its value until the next pulse — making it a persistent level that won't be missed.
- Clock crossing: The toggle signal crosses the domain via a standard 2-FF synchronizer.
- Destination domain: An XOR of the synchronized output and its one-cycle-delayed copy detects each edge → generates a clean single-cycle pulse in the destination domain.
Constraint: Source pulses must be spaced at least 3 destination clock cycles apart so the previous toggle has fully propagated through the synchronizer before the next pulse arrives. If pulses can arrive faster, use an async FIFO instead.
UPF (IEEE 1801) is a standard format for capturing the power intent of a chip design in a separate file that accompanies the RTL. As SoCs moved to multiple power domains, it became impossible to express power management purely in RTL — the RTL describes logical functionality, not which block gets what voltage or when a domain shuts off.
What UPF defines:
- Supply networks: Which voltage rails exist (
VDD_CPU,VDD_MODEM,VDD_AON), their nominal voltages, and how they connect to design blocks. - Power domains: Which RTL modules belong to which supply rail. Each domain has a defined primary power supply.
- Power states: Which domains are ON or OFF in each operating mode (e.g., "sleep mode: modem ON, CPU OFF, AON ON").
- Isolation cells: Specifies where isolation cells must be inserted at the boundary of power-gatable domains, and what value they should clamp to when the domain is off.
- Retention registers: Which flip-flops need SRPG (State Retention Power Gating) cells to preserve state across a power-off event.
- Level shifters: Where voltage-level-shifting cells are needed between domains running at different voltages.
- Power switches: Header (PMOS) or footer (NMOS) transistors that gate the power supply to a domain.
Isolation cells are required at the output boundary of any power-gated domain. When a domain's power supply is cut, its flip-flops lose their state and outputs become undefined (float to a random value or X). If an always-on domain receives these floating signals, it may malfunction — latching garbage data, causing spurious state transitions, or drawing excessive short-circuit current.
An isolation cell is inserted on each output net of the power-gated block. It is connected to an always-on supply. When the domain is OFF, the isolation cell clamps the output to a safe known value (typically 0 for AND-based isolation, or 1 for OR-based) as specified in UPF. When the domain is ON, the isolation cell passes the signal through transparently.
Retention registers (SRPG — State Retention Power Gating) are special flip-flop variants with a small "shadow latch" connected to a separate always-on power rail (typically a low-leakage supply). The shadow latch holds only a few transistors, consuming a fraction of the normal FF's leakage.
Operation:
- Before power-off: The power management controller sends a SAVE signal → each SRPG cell captures its current state into its shadow latch.
- Domain is off: Main supply cut, shadow latch retains state at very low power.
- After power-on: A RESTORE signal pushes the shadow state back into the main FF.
Without retention, the block must re-initialize from scratch after every power-up, adding latency and requiring software re-programming of registers.
A voltage island is a physically distinct region of the chip that operates at a different supply voltage from surrounding blocks. By running low-activity blocks at a lower V_DD, dynamic power scales as V², giving dramatic savings — dropping from 1.0V to 0.8V reduces dynamic power by 36%.
Why Qualcomm uses voltage islands: A Snapdragon SoC has very different performance and power requirements across blocks. The modem baseband runs continuously but at moderate frequency. The application CPU cores spike to high performance on demand. The always-on sensor hub must run at <0.7V for weeks on battery. A single supply voltage optimized for the fastest block wastes enormous power in slower blocks.
Required boundary cells:
- Level shifters (LS): Signals crossing between domains at different voltages must be shifted to the receiving domain's logic levels. A signal from a 0.8V domain HIGH (0.8V) is not guaranteed to be a valid HIGH in a 1.1V domain without level shifting. Level shifters are inserted on every signal crossing.
- Isolation cells: If the lower-voltage island can be powered off completely, isolation cells (see previous question) are needed to clamp its outputs.
- Level-shifting isolation cells: Combined cells that both shift voltage and isolate — used at boundaries between always-on and power-gatable domains at different voltages.
Clock Tree Synthesis (CTS) is the physical design step that builds the clock distribution network — a buffered tree that delivers the clock signal from the clock source (PLL output or pad) to every flip-flop's clock pin across the entire chip.
Goals of CTS:
- Minimize clock skew: Every FF should see the clock edge at (nearly) the same time. Unbalanced trees create skew that consumes setup and hold timing margins.
- Meet insertion delay target: Total latency from clock source to FF clock pins must be within the budgeted range (typically set in SDC via
set_clock_latency). - Minimize clock power: The clock network toggles every cycle and can consume 30–40% of total chip dynamic power. The tool balances skew reduction against cell count and wire length.
- Respect no-touch (NDR) routing rules: Clock nets typically use special Non-Default Routing Rules (wider wires, more spacing, preferred upper metal layers) for reduced resistance and better EM reliability.
Flow position: CTS runs after placement (cell locations are fixed) but before detailed routing. After CTS, timing analysis uses real clock arrival times instead of ideal clock assumptions — hold violations often emerge here because real clock trees have skew that didn't exist in pre-CTS analysis.
IR drop is the voltage reduction along the power delivery network from the supply pins to the power pins of individual cells. The metal power grid has resistance (R), and the switching current (I) causes a voltage drop V = I × R. A cell operating at V_nominal − ΔV is slower than a cell at the full supply voltage.
Two types:
- Static IR drop: Average current × grid resistance. Determined by the long-term average switching activity. Used for power integrity sign-off of DC operating point.
- Dynamic (transient) IR drop: When a large number of cells switch simultaneously (e.g., a wide datapath all clocking at once), the instantaneous current surge exceeds the average. The power grid voltage transiently collapses by a larger amount, limited by the inductance and decoupling capacitance. This "voltage droop" is worse than static IR and is the primary concern at high frequencies.
Effect on timing: In a high-IR-drop region, cells are slower than characterized at nominal voltage. A path that passes STA at nominal conditions may violate setup timing in silicon due to IR-induced delay increase. Hold violations are less common (slower cells improve hold margin).
Fixes:
- Widen power stripes or add more power mesh layers
- Add decoupling capacitors (decaps) near high-switching density regions
- Spread high-activity cells during placement to avoid current hot spots
- Use power gating with controlled wake-up sequences to avoid simultaneous switching
- In STA: apply voltage derating in high-IR-drop regions for more accurate sign-off
During VLSI fabrication, metal layers are deposited and patterned one at a time using plasma etching. Plasma charges accumulate on exposed metal wires during etching. If a long metal wire is already connected to a transistor gate oxide but NOT yet connected to a diffusion region (which would discharge the charge safely), the accumulated charges can create a large voltage across the thin gate oxide — sufficient to cause permanent gate oxide damage: threshold voltage shifts, increased leakage, or immediate breakdown.
The antenna ratio = (metal area of the wire connected to the gate) / (gate oxide area). Process Design Kits (PDKs) specify maximum allowable antenna ratios (typically 400–1000 for metal, 200–600 for vias). Exceeding this ratio means the wire can accumulate enough charge to damage the oxide.
How it's detected: The router's DRC (Design Rule Check) engine computes the cumulative antenna ratio for every net using the partial routing built up layer by layer. If it exceeds the limit, an antenna violation is flagged.
Fixes:
- Metal jumper (layer hopping): Break the long wire by jumping to a higher metal layer and back. This "resets" the antenna accumulation because higher-layer routing is done later, after diffusion connections have been made. Most common fix.
- Antenna diode: Insert a reverse-biased diode near the gate, connected to the same metal wire. During plasma etching, the diode provides a discharge path to substrate, preventing charge buildup. Small area cost, always effective.
- Reduce net length: Re-route the net to use shorter wires on lower layers.
Code coverage measures how much of the RTL source code was exercised by the simulation:
- Line/statement coverage: Were all lines of RTL executed?
- Branch coverage: Were both sides of every
if/elseand everycasearm taken? - Toggle coverage: Did every signal toggle both 0→1 and 1→0?
- FSM coverage: Were all states visited and all transitions taken?
Code coverage is automatically collected by the simulator with no extra specification — easy to get, but tells you nothing about what scenarios were verified. You can hit 100% branch coverage while never testing the most critical protocol corner case.
Functional coverage is user-defined. The verification engineer specifies which scenarios, protocol states, and parameter combinations are important to verify — then measures whether simulations actually exercised them:
- Was an AXI4 burst of ARLEN=255 (256 beats) issued?
- Did a FIFO simultaneously receive a write and a read when exactly one slot was free?
- Did a CDC crossing happen with data changing every source cycle?
Which matters more? Both are necessary; neither alone is sufficient. Code coverage ensures no dead code was accidentally left un-exercised. Functional coverage ensures the right scenarios were tested. A mature sign-off process requires both to be above target (typically 95%+ code coverage, 100% defined functional coverpoints).
UVM (IEEE 1800.2) is a standardized SystemVerilog methodology for building reusable, scalable verification environments using an object-oriented framework. It replaces brittle, one-off directed testbenches.
Key UVM components:
- uvm_test: Top-level test class. Selects which scenario/sequence to run and configures the environment. Different tests reuse the same TB infrastructure.
- uvm_env: Container that instantiates and connects agents, scoreboards, and coverage collectors for one DUT.
- uvm_agent: Models one protocol interface (e.g., AXI4 master). Contains: Driver (applies stimulus to DUT pins), Monitor (observes DUT pins and creates transaction objects), Sequencer (arbitrates between sequences and feeds items to the driver).
- uvm_sequence / uvm_sequence_item: Defines the actual stimulus transactions. Sequences can be layered (a higher-level sequence calls lower-level sequences) and constrained-random.
- uvm_scoreboard: Compares DUT output (from monitor) against a reference model's expected output. Reports pass/fail.
- TLM ports (uvm_analysis_port): Standardized communication channels between components — no direct references between classes.
Vs. directed testbench: A directed testbench hand-codes every stimulus vector — it only tests what the engineer explicitly wrote. A UVM testbench with constrained-random stimulus explores the full stimulus space automatically within user-specified constraints, finding corner cases no human would write by hand.
ATPG (Automatic Test Pattern Generation) tools model physical manufacturing defects as logical faults and generate patterns to detect them. The main fault models are:
- Stuck-At Fault (SAF): A wire is permanently stuck at logic 0 (SA0) or 1 (SA1), regardless of what drives it. Models open circuits, resistive shorts to VDD/GND, and broken connections. The most widely used model. A stuck-at fault is detected by finding a test that excites the fault (drives the opposite value) and propagates the effect to a primary output or scan chain output. Industry target: 95–99% fault coverage.
- Transition Delay Fault (TDF): Tests whether a net can make a complete 0→1 or 1→0 transition within one clock cycle. Detects resistive defects that don't prevent correct logic levels but slow transitions — critical at high frequency where even a slightly slow net causes a setup violation. TDF requires two-pattern tests: launch the transition, then capture the response one cycle later.
- Path Delay Fault (PDF): Tests the end-to-end propagation delay of a specific signal path. More accurate timing characterization than TDF — detects accumulated small delays across many gates. Requires many patterns but provides the most complete timing sign-off.
- Bridging Fault: Models an unintended short between two adjacent nets. A short that combines two signals via wired-AND or wired-OR logic. Increasingly important at 7nm/5nm where metal pitch is very tight and coupling between adjacent wires is a common defect.
- Cell-Aware Fault: Tests for defects inside standard cells at the transistor level (open/short in the cell's internal netlist). Catches defects that SAF, modeled at the cell's logical interface, would miss.
CSI-2 (Camera Serial Interface 2) is a MIPI Alliance standard for connecting image sensors to application processors. It is the dominant camera interface in smartphones — virtually every mobile camera uses CSI-2.
Physical layer (D-PHY): CSI-2 uses MIPI D-PHY, a differential serial interface with two operating modes:
- High-Speed (HS) mode: Low-swing differential signaling (100–300 mV differential) at 80 Mbps to 4.5 Gbps per lane. Used for pixel data transmission.
- Low-Power (LP) mode: CMOS-level single-ended signaling. Used for control, synchronization, and lane management. Much lower speed.
Architecture: One clock lane + 1 to 4 data lanes. Each lane is a differential pair (DP/DN). For a quad-lane sensor at 4.5 Gbps/lane: total bandwidth = 4 × 4.5 = 18 Gbps — sufficient for 200 MP sensors at full frame rate.
Virtual channels: Up to 4 virtual channel IDs allow multiple cameras to share the same physical CSI-2 interface, multiplexed by the sensor or ISP.
C-PHY (newer alternative): Uses 3-wire "trios" with encoded 3-symbol signaling, achieving 5.7 Gsymbols/s per trio = ~2.28 bits per symbol → higher effective data rate without increasing frequency. Used in high-resolution cameras where D-PHY lane count limits bandwidth.
VLSI implementation: The CSI-2 receiver on a Snapdragon SoC consists of a D-PHY frontend (analog deserializer), a lane merger, a CSI-2 protocol decoder, and an interface to the Image Signal Processor (ISP). It must process pixels faster than they arrive to prevent FIFO overflow — typically 500 MHz+ operating frequency.
SIMD (Single Instruction, Multiple Data) means one instruction operates on many data elements in parallel. In a GPU's Streaming Multiprocessor (SM), this is implemented as SIMT — Single Instruction, Multiple Threads: 32 threads are grouped into a warp, and all 32 threads execute the same instruction simultaneously, each on its own private data.
The warp is the fundamental scheduling unit. A single SM can hold many warps in flight (e.g., 64 warps × 32 threads = 2048 threads per SM on H100). When one warp stalls on a memory access, the SM's warp scheduler instantly switches to another ready warp — this is how GPUs hide memory latency through latency hiding rather than large caches.
Warp divergence occurs when threads within a warp take different execution paths — for example, in if (threadIdx.x % 2 == 0), even-numbered threads go one way, odd threads another. The GPU must serialize execution:
- First execute all threads that took the "true" branch (odd threads masked off)
- Then execute all threads that took the "false" branch (even threads masked off)
A 50/50 divergent branch halves effective throughput. Nested divergence multiplies the penalty. Minimizing divergence is one of the most important GPU kernel optimization principles.
From fastest/smallest to slowest/largest (H100 as reference):
- Registers: Per-thread, 65,536 × 32-bit registers per SM. Latency: 0 cycles (operand bypass). Total bandwidth: ~20 TB/s across all SMs. Zero-latency access when available; register pressure causes spill to local memory.
- Shared Memory / L1 cache: Per-SM, programmer-managed scratchpad, 228 KB per SM (H100), configurable split with L1. Latency: ~4–5 cycles. Bandwidth: ~19 TB/s. Used for inter-thread communication within a thread block.
- L2 cache: Chip-wide, 50 MB in H100, shared by all 132 SMs. Latency: ~100–200 cycles. Bandwidth: ~12 TB/s. Caches both data and instructions.
- HBM (device memory): Off-chip stacked DRAM, 80 GB in H100 SXM5. Latency: ~400–800 cycles. Bandwidth: 3.35 TB/s. The primary memory bottleneck for memory-bound kernels.
- Peer GPU memory (NVLink): Another GPU's HBM accessed via NVLink 4.0. Bandwidth: 900 GB/s bidirectional, cache-coherent. Latency: ~1–2 µs.
- Host CPU memory: Via PCIe 5.0 x16. Bandwidth: 64 GB/s, Latency: ~5–10 µs. The slowest tier — minimize host↔device transfers.
GPU shared memory is physically organized into 32 banks, matching the warp width. Each bank can serve exactly one 32-bit access per clock cycle. If multiple threads in the same warp access different addresses within the same bank simultaneously, those accesses are serialized — reducing effective bandwidth proportionally.
Bank mapping rule: Address A maps to bank A % 32. So addresses 0, 128, 256… all map to bank 0; addresses 4, 132, 260… all map to bank 1 (for 4-byte elements).
Classic bank conflict example: A 32×32 matrix stored in shared memory, accessed column-first by a warp. Thread 0 reads M[0][0] (bank 0), Thread 1 reads M[1][0] (bank 32 % 32 = bank 0 also) → 32-way bank conflict → 32× slowdown.
Solutions:
- Padding: Declare the array as
float M[32][33]instead of[32][32]. The extra column shifts each row's bank alignment, breaking the conflict pattern. One of the most common GPU optimization tricks. - Access reordering: Restructure the algorithm so consecutive threads access consecutive 4-byte addresses (consecutive banks).
- Broadcast: If all threads access the same address in one bank, the memory system broadcasts it — no conflict. This is free.
HBM (High Bandwidth Memory) is a stacked DRAM architecture. Multiple DRAM dies are stacked vertically using Through-Silicon Vias (TSVs) and connected to the GPU die via a silicon interposer — a passive silicon layer with very fine-pitch connections. The stack sits physically adjacent to the GPU die on the interposer, connected by thousands of short, dense wires rather than long PCB traces.
Why this matters:
- Massive bus width: HBM3 provides a 1024-bit-wide bus per stack. An H100 with 5 stacks has a 5120-bit total memory bus. Compare to GDDR6X: 16-bit per chip × 24 chips on a high-end gaming card = 384-bit total. HBM is 13× wider.
- Bandwidth: H100 achieves 3.35 TB/s of HBM3 bandwidth. An RTX 4090 with GDDR6X achieves ~1 TB/s. The H100 is 3.35× faster despite being a smaller number of dies.
- Power efficiency: Short interconnects (millimeters vs centimeters on PCB) mean much lower capacitance per bit — lower switching energy. HBM typically consumes 50% less power per GB/s than GDDR.
- Package area: No GDDR chips around the periphery of a large PCB. HBM stacks sit compactly next to the die on the interposer.
Why GDDR wins for gaming GPUs: HBM requires a silicon interposer, which is expensive (2–3× PCB packaging cost). For gaming budgets, GDDR6X offers enough bandwidth at lower cost. HBM's cost is justified only where bandwidth-per-dollar is critical — AI training, HPC, data center.
PCIe Gen5 (PCIe 5.0) doubles the per-lane data rate from 16 GT/s (Gen4) to 32 GT/s. Using 128b/130b encoding (2-bit overhead vs 8b/10b's 20% overhead), the effective throughput per lane is ~31 Gbps. A ×16 link delivers 64 GB/s per direction (128 GB/s total bidirectional).
Signal integrity challenges at 32 GT/s:
- Insertion loss: FR4 PCB material absorbs high-frequency signal energy. At 16 GHz Nyquist (32 GT/s), losses across even short traces become severe. Requires low-loss dielectric materials (Megtron 6, Rogers) or very short trace lengths.
- Crosstalk: Adjacent differential pairs couple more aggressively at higher frequencies. Tighter guard spacing and reference planes are needed.
- Equalization demands: Both transmitter and receiver require aggressive equalization — CTLE (Continuous Time Linear Equalization) and DFE (Decision Feedback Equalization) at the receiver, and FIR (finite impulse response) transmitter pre-emphasis. Gen5 standardizes more complex equalization than Gen4.
- Connector and via design: Even PCB connectors and vias introduce resonant stubs at Gen5 frequencies. Via back-drilling (removing unused via stubs) becomes mandatory.
NVLink is NVIDIA's proprietary high-speed, cache-coherent interconnect for direct GPU-to-GPU communication. NVLink 4.0 (H100) provides 900 GB/s bidirectional bandwidth per GPU across 18 NVLink connections. Compare this to PCIe 5.0 x16: 64 GB/s per direction — NVLink is 14× higher bandwidth.
Why PCIe fails for GPU-to-GPU at scale:
- Topology: PCIe is a CPU-centric tree topology. GPU-to-GPU data must traverse CPU root complex, adding ~2–5 µs latency and halving effective bandwidth (each hop is bidirectional but the shared root complex is a bottleneck).
- Bandwidth: A GPT-4 scale model training run performs AllReduce across 8+ GPUs every few hundred milliseconds. Each AllReduce requires each GPU to send and receive its full gradient tensor (~10s of GB). PCIe's 64 GB/s would be saturated; NVLink's 900 GB/s handles it comfortably.
- No cache coherency in PCIe: PCIe lacks hardware cache coherency between GPUs. Memory copies must be explicit and managed by software/driver. NVLink supports hardware cache coherence — GPU A can read GPU B's memory with the same semantics as its own, dramatically simplifying programming models like NVLink's NVSHM.
NVSwitch: NVIDIA's NVSwitch chip (3.2 Tbps per switch in NVSwitch 3.0) connects 8 GPUs in an all-to-all topology inside a DGX H100 node. Every GPU can communicate with every other GPU at full NVLink bandwidth simultaneously — no head-of-line blocking.
A hazard is a condition that prevents the next instruction from executing in the following clock cycle, threatening to give incorrect results.
1. Structural Hazard: Two instructions need the same hardware resource simultaneously (e.g., both need to write to the register file in the same cycle, but there is only one write port). Resolution: add a second hardware unit (more write ports, more functional units), or stall one instruction.
2. Data Hazard: An instruction depends on the result of a preceding instruction that hasn't yet written its output to the register file.
- RAW (Read After Write): Most common. Instruction B reads a register before instruction A has written it. Resolution: forwarding/bypassing — route the ALU output directly back to the ALU input without waiting for register writeback. If forwarding can't bridge the gap (e.g., load-use hazard), insert a stall (pipeline bubble).
- WAW (Write After Write): Two instructions write the same register — later one might arrive first in out-of-order execution. Resolved via register renaming.
- WAR (Write After Read): Instruction writes a register before an earlier instruction reads it (only in out-of-order). Resolved via register renaming.
3. Control Hazard: A branch changes the program counter, but instructions after the branch have already been fetched. Resolution: branch prediction (speculate the branch outcome, flush on misprediction), delayed branching (execute one instruction after branch regardless — RISC classic), or speculative execution with rollback.
Slicing + generate: Break the 512-bit operation into N independent or semi-independent slices and instantiate them with a generate loop. For operations that are fully parallel (bitwise AND, OR, XOR), this gives linear throughput with the number of slices and synthesis has no trouble.
The carry problem for adders: A naive 512-bit binary adder using ripple carry has a critical path through all 512 stages — completely unacceptable. Use a hierarchical adder structure:
- Carry Lookahead Adder (CLA): Compute generate (G=A&B) and propagate (P=A^B) for each bit, then compute carries in parallel using a tree. Reduces depth from O(N) to O(log N).
- Carry-Select Adder: Pre-compute two copies of each upper slice (one assuming carry-in=0, one assuming carry-in=1), then mux between them when the actual carry arrives. Trades area for speed.
- For synthesis: write the adder behaviorally (
assign sum = a + b;) and let the synthesis tool select the adder topology from the technology library. Modern tools (DC, Genus) will choose appropriately based on timing constraints.
Additional concerns:
- Routing congestion: A 512-bit bus creates dense wiring in physical design. The datapath should be placed in a compact rectangular region with proper floorplanning.
- Retiming: If the datapath is pipelined, enable retiming (
set_optimize_registers truein DC) to let the tool rebalance registers across the datapath stages automatically. - Operand isolation: Wide datapaths toggle a lot of bits. Gate the inputs when the datapath result is not needed to reduce switching power.
At older nodes (28nm+), gate delay dominated wire delay. At 4nm/5nm, the relationship has reversed — wire RC delay now dominates on many paths, making interconnect the primary timing bottleneck.
Key new challenges:
- Wire resistance explosion: As metal pitch shrinks, wire cross-section shrinks, resistance per unit length increases dramatically. A long metal wire at 5nm can have 5–10× higher resistance than the same wire at 28nm. RC delay scales as R×C — both R and C worsen at smaller nodes.
- FinFET / GAAFET parasitic capacitance: Gate-to-drain capacitance (Miller capacitance) in FinFET and Gate-All-Around FET structures is proportionally larger, adding input/output loading that didn't exist at planar bulk CMOS.
- Variation (OCV/POCV): Random dopant fluctuation and gate length variation are larger fractions of the total delay at small nodes. Statistical timing (POCV/LVF) is required rather than flat derating — adding complexity to STA flow.
- Double/triple patterning constraints: Metal layers below M4 require multi-patterning. This forces additional spacing rules, reducing routing freedom and forcing the router to use longer detours → more wire → worse RC delay.
- IR drop impact on timing: Higher current density at same power with smaller metal → worse static and dynamic IR drop. Cells in IR-drop hot spots run slower and create timing violations not visible in STA at nominal VDD.
- Crosstalk aggressor coupling: Tighter metal pitch increases coupling capacitance between adjacent wires. A switching aggressor can add delay (or reduce delay) to a victim net, creating timing violations that only appear with specific data patterns.
Electromigration (EM) is the gradual displacement of metal atoms caused by momentum transfer from electron flow (high current density). At elevated current density, atoms migrate toward the cathode end, creating voids (open circuits) at the anode and hillocks (shorts) elsewhere — a wearout failure that grows over months or years in the field.
Why it matters for NVIDIA: High-TDP GPUs (300–700W) draw hundreds of amperes. The power distribution network carries enormous currents through metal layers. Clock nets and wide datapath buses also carry high current due to their switching activity.
The constraint: PDKs define maximum DC current density (J_DC, mA/µm²) and RMS current density (J_RMS for AC/switching) per metal layer and via. Routers check these limits during sign-off. Violations require fixes before tape-out.
Common EM weak points:
- Vias: Single-cut vias have much higher current density than the wire they connect. Via-EM is the most common EM failure mechanism. Fix: use minimum 2 vias wherever current exceeds the single-via limit (the router enforces this via "via doubling" rules).
- Clock buffers: Clock networks drive large loads at full VDD swing every cycle — highest RMS current in the design. Clock routing uses upper metal layers with wide wires by design.
- Power grid stripes: Size power stripes based on average current drawn by each domain, with margin.
Fixes: Widen the wire, add parallel wires, add redundant vias, move to upper metal layers (lower resistance, higher J_max), spread high-current logic across more stripes.
A 400W GPU at 0.85V core voltage draws approximately 470 amperes. Delivering this cleanly — without voltage droops that cause functional failures or reliability damage — is a major system and chip design challenge.
Voltage droop: When the GPU transitions from light to heavy workload (e.g., a kernel launch), the current demand steps up in nanoseconds. The PDN inductance (L) resists this fast current change: V_droop = L × dI/dt. A 10 nH package inductance with a 100 A/ns current ramp creates 1 V of instantaneous droop — catastrophic for a 0.85V rail.
PDN design hierarchy — capacitors at three timescales:
- Die-level decaps (on-chip, ~1–10 ns): Inserted as standard cells in unused routing areas and power domain boundaries. Handle the fastest transients. Limited by available die area. Capacitance: ~10–100 nF total.
- Package decaps (~10–100 ns): Capacitors embedded in the package substrate or placed as discrete SMDs on the package interposer. Handle intermediate transients. Capacitance: ~100 nF – 10 µF.
- Board bulk caps (>100 ns): Large ceramic and electrolytic capacitors on the PCB, close to the VRM (Voltage Regulator Module). Handle slow load steps. Capacitance: 100 µF – 1 mF+.
Target PDN impedance: Design the PDN so its impedance Z(f) = V_droop_budget / I_max is flat across all relevant frequencies (DC to ~1 GHz). Resonant peaks in impedance cause amplified droop at those frequencies and must be damped.
Chip-level: Top metal layers (M8–M14 in advanced nodes) are dedicated entirely to power distribution — wide horizontal and vertical stripes forming a mesh. The mesh resistance across the die must be kept below ~0.1 mΩ to limit static IR drop to under 50 mV.
Simulation-based CDC verification does not scale for a GPU-class design. There are too many clock phase combinations, too many data patterns, and too many cycles to simulate to ever encounter all CDC metastability scenarios. Missing a single unsafe synchronizer can cause silent data corruption that only appears under specific workloads in the field.
Formal CDC verification is the industry-standard solution. Tools like Synopsys SpyGlass CDC, Mentor Questa CDC, or Cadence JasperGold CDC analyze the entire netlist structurally and mathematically:
- Topology analysis: Identify every net that crosses a clock domain boundary (source FF in domain A, destination FF in domain B).
- Synchronizer recognition: Detect whether each crossing has a valid synchronizer structure (2-FF chain, handshake, async FIFO pointers). Compliant structures are "promoted" to safe.
- Multi-bit coherency: Flag any multi-bit bus where individual bits are synchronized independently (torn-word risk). Require Gray code, handshake, or FIFO.
- Reconvergence analysis: Detect where two signals from the same CDC crossing reconverge into the same logic — one through a synchronizer, one not. This is the most dangerous CDC pattern and is hard to find manually.
RTL coding guidelines enforce synchronizer templates that formal tools can recognize. Deviations from approved templates are flagged automatically. Waivers document paths that are safe for non-RTL reasons (e.g., a path only crosses during reset when data is irrelevant).
DVFS dynamically adjusts both supply voltage (V_DD) and clock frequency (f) based on workload demand and thermal conditions. Since dynamic power scales as P = αCV²f, simultaneously reducing V and f provides cubic power reduction in the ideal case — halving both V and f reduces power by 8×.
The relationship: Maximum safe frequency is approximately proportional to (V_DD − V_th)/V_DD. Lower voltage → lower max frequency. The curve of achievable (V, f) operating points is characterized during silicon bring-up and stored as a VF table (V-F curve).
Hardware implementation in a GPU:
- On-chip Performance Monitoring Unit (PMU): Monitors SM utilization, power draw, die temperature, and throttle signals every ~1 ms. Decides the target power state (P-state).
- Multiple P-states: The GPU ships with a defined set of (V, f) operating points — e.g., Base Clock (guaranteed), Boost Clock (sustained under thermal headroom), Max Boost (burst, thermally limited). NVIDIA's GPU Boost algorithm dynamically selects among these.
- Voltage regulator: An external PMIC or on-package VR changes V_DD on command. Voltage settling takes ~10–50 µs — during this window, frequency must stay within the safe range for the transitioning voltage.
- PLL reprogramming: The on-chip PLL changes frequency by updating its divider ratios. Must happen after voltage is stable when scaling up, and before voltage drops when scaling down — violating this order can cause timing failures.
Simulation checks specific test cases: apply a sequence of inputs, observe the outputs, compare against expected. Even with millions of random vectors (constrained-random UVM), simulation covers only a tiny fraction of the total state space. It proves the design is correct for those specific scenarios, not in general.
Formal verification (property checking / model checking) mathematically proves that a property holds for all possible input sequences and all reachable states — or produces a concrete counterexample. No test vectors needed. Tools like JasperGold (Cadence), VC Formal (Synopsys), and Questa Formal (Siemens) use SAT/BDD solvers to explore the full state space.
When formal is preferred:
- Safety-critical properties: "The FIFO never overflows", "The arbiter never grants two requestors simultaneously", "A valid handshake always completes within 16 cycles." These must be proven, not just tested.
- CDC checking: Formal tools exhaustively verify all synchronizer topologies across all clock domains (as described in the previous question).
- RTL-to-gate equivalence checking: After synthesis, prove the gate-level netlist is functionally identical to the RTL. Catches synthesis tool bugs. Industry standard for ASIC tape-out.
- Reset/initialization verification: Prove all state elements reach known values after reset sequences.
- Protocol compliance: Verify that an AXI4 interface implementation correctly follows the spec for all legal sequences of VALID/READY.
Limitation: State space explosion. Complex datapaths (e.g., floating-point units with 52-bit mantissas) have state spaces too large for formal to solve without heroic abstraction. Formal is most powerful on control paths; datapath is verified by simulation.
Hardware emulation compiles RTL onto a large array of FPGAs (or custom emulation processors like Cadence Palladium or Siemens Veloce). The emulated design runs at 1–10 MHz — 100–10,000× faster than RTL simulation. At 5 MHz emulation speed vs 500 Hz simulation speed, a test that takes 1 year in simulation completes in hours on an emulator.
Why emulation is essential for GPUs:
- Software bring-up: The GPU driver stack, firmware, CUDA runtime, and OS interactions involve billions of transactions and complex state machines. Running real software stacks on the emulator is the only way to validate pre-silicon software behavior. Simulation is simply too slow.
- Latent bugs: Some bugs only manifest after millions or billions of transactions under realistic workloads — memory coherency races, power state machine errors, firmware edge cases. Emulation can run full AI model training passes on pre-silicon hardware.
- System-level integration: Connect the emulated GPU RTL to real PCIe hardware and run real-world benchmarks (ResNet training, CUDA programs) to validate system integration months before silicon is available.
- DFT validation: Run production test patterns on the emulator to validate scan chain behavior, scan shift/capture at-speed before committing to tape-out.
Out-of-order (OOO) execution allows a CPU to execute instructions in a different order from program order when earlier instructions stall (e.g., on a cache miss), so that later independent instructions can proceed. This improves instruction-level parallelism (ILP) and hides memory latency.
The challenge: Although instructions execute out of order, they must commit (become architecturally visible — update registers and memory) in program order. If an instruction causes an exception or a branch misprediction is detected, all subsequent instructions must be discarded as if they never executed. This requires the ability to "undo" out-of-order execution.
The Reorder Buffer (ROB) solves this:
- A circular buffer that tracks all in-flight instructions in program order. Each entry stores the instruction, its destination register, its result (when computed), and its status (executing / done / excepted).
- Instructions are allocated (in order) at dispatch and deallocated (committed) only from the head — always in program order.
- An instruction commits when: it is at the ROB head AND its result is ready AND no exception occurred. Commitment writes the result to the architectural register file.
- On misprediction/exception: The ROB is flushed from the faulting instruction to the tail — all results are discarded, the architectural state is restored to the last committed state. Execution resumes from the correct PC.
Register renaming: Architecturally, there may be 32 registers. The OOO engine maps these to a larger physical register file (hundreds of registers). This eliminates WAW and WAR hazards by giving each instruction its own private physical register for its result — multiple "versions" of the same architectural register can be in-flight simultaneously.
APB (Advanced Peripheral Bus) is the simplest — non-pipelined, low-speed, no burst support. Designed for low-bandwidth peripherals: UARTs, timers, GPIOs, I2C controllers. Transfers take minimum two cycles. Minimal area overhead makes it ideal for the "leaf" nodes at the edge of the SoC.
AHB (Advanced High-performance Bus) is pipelined with burst support. Address is presented one cycle before data. Used for mid-tier peripherals: DMA controllers, USB, internal SRAM, Ethernet MACs. AHB-Lite (single master) is the standard form used in Cortex-M-based systems.
AXI4 (Advanced eXtensible Interface 4) is the highest-performance AMBA bus. Separate read/write address and data channels support out-of-order transactions and up to 256-beat bursts. Used for high-bandwidth masters: CPU cores, GPU, DRAM controllers, DMA engines. AXI4-Lite (no bursts) and AXI4-Stream (no addressing) are simplified variants.
- Rule of thumb: AXI for processors and high-BW masters; AHB for mid-BW SoC subsystems; APB for peripheral registers needing under 1 MB/s.
- Standard ARM bridges (AXI-to-AHB, AHB-to-APB) connect the protocol layers. A typical SoC has one AXI fabric at the center, with AHB and APB subtrees hanging off it.
AXI4 has five independent channels, each with its own VALID/READY handshake:
- AW (Write Address): Master presents AWADDR, AWID, AWLEN, AWSIZE, AWBURST. Handshake completes when both AWVALID and AWREADY are high.
- W (Write Data): Master sends data beats with WDATA, WSTRB, WLAST. Independent of AW — data can be issued before or after the address.
- B (Write Response): Slave returns BRESP and BID after accepting all write data. Master asserts BREADY.
- AR (Read Address): Master presents ARADDR, ARID, ARLEN. Slave asserts ARREADY.
- R (Read Data): Slave returns RDATA, RRESP, RLAST, RID. Master asserts RREADY.
Outstanding transactions: Because address and data channels are decoupled, a master can issue multiple addresses before receiving responses. The ARID/AWID tags identify which response belongs to which transaction. An interconnect can have N simultaneous in-flight transactions (N = ID space depth).
Ordering rule: Transactions with the same ID must complete in order. Transactions with different IDs can complete out of order, allowing the interconnect to service faster-responding slaves first without corrupting per-source ordering.
ACE (AXI Coherency Extensions) extends AXI4 with three additional channels and coherent transaction types to maintain hardware cache coherency between CPU clusters connected to a shared interconnect (CCI-400, CCI-500).
Three additional ACE channels:
- AC (Snoop Address): Interconnect sends to a cache master — "do you have address X cached, and is it dirty?"
- CR (Snoop Response): Cache master replies — found/not-found, clean/dirty, passed/invalidated.
- CD (Snoop Data): If the snooped cache has a dirty line, it sends data back on CD. The interconnect forwards it to the requester or writes it back to memory.
Coherent transaction types: ACE adds new transaction types (ReadShared, ReadUnique, CleanUnique, MakeUnique, WriteBack) specifying what cache state the requester needs — enabling the interconnect to orchestrate the correct snoop sequence.
AxDOMAIN field: Each transaction specifies its shareability domain (Non-shareable, Inner-shareable, Outer-shareable, System). The interconnect only snoops agents within the specified domain, avoiding unnecessary broadcast snoops.
CHI (Coherent Hub Interface), introduced in AMBA 5, is ARM's next-generation coherent protocol for high core-count SoCs (8–64+ cores). Used in CoreLink CMN-600/CMN-700 and Neoverse N1/N2 server-class CPUs.
Key differences from ACE:
- Packet-based vs. channel-based: ACE extends AXI4 channels on a shared-bus/crossbar topology. CHI uses a packet-based, link-layer protocol over a point-to-point mesh network with credit-based flow control — no shared bus bottleneck.
- Distributed Home Node (HN): Manages coherency per address range and holds a snoop filter (which nodes have each line). Snoops go only to nodes that actually have the line — no broadcast snooping.
- Peer-to-peer data: When a snoop hits a dirty line, the snooped node returns data directly to the requester — bypassing the HN for data, reducing latency compared to ACE's centralized data path.
- Node types: RN-F (fully coherent, CPU cluster), RN-I (I/O coherent, GPU/DMA), SN-F (slave, DRAM controller), HN-F (home node with snoop filter).
ACE scales well to 4–8 cores with a central CCI. CHI's mesh topology scales to 64+ cores — critical for ARM's Neoverse infrastructure CPUs targeting 64–96 core server chips.
RISC principles as embodied in ARM:
- Load/Store architecture: Only LOAD and STORE access memory. All arithmetic operates only on registers — no "add [memory], reg" like x86. Simplifies the pipeline: execution stage always operates on register values, never needing a memory read mid-instruction.
- Fixed instruction width: Classic ARM uses 32-bit fixed-width instructions (Thumb: 16-bit). Fixed width simplifies fetch/decode — the pipeline always knows where the next instruction starts.
- Large register file: 16 GPRs in AArch32, 31 in AArch64. More registers reduce memory spills — data stays in fast registers longer, improving compiler output.
- Simple addressing modes: Fewer, more regular modes than x86 — simpler decode logic, lower decode energy.
- Single-cycle execution (originally): Most instructions complete in one clock cycle, enabling simple in-order pipelines and accurate power modeling.
vs. CISC: CISC (x86) has complex multi-cycle instructions — smaller code size per program. Modern x86 CPUs decode CISC to micro-ops internally (recovering RISC pipeline efficiency). The difference is now mainly at the ISA level: ARM is simpler to verify, lower decode energy, historically better power efficiency — why ARM dominates mobile and is winning in servers.
TrustZone partitions the SoC into two worlds: Secure World (trusted OS, cryptographic keys, secure boot) and Normal World (Android/Linux, untrusted apps). The CPU operates in either world; transitions go through the Secure Monitor (EL3 in AArch64) via the SMC instruction.
The NS bit: Every AMBA transaction carries NS via ARPROT[1] / AWPROT[1]. Set by the CPU based on current security world:
- NS=0 (Secure): can access both Secure and Non-Secure resources.
- NS=1 (Normal): can only access Non-Secure resources.
Hardware enforcement:
- TZASC: Between interconnect and DRAM. Secure software programs it to mark address ranges as Secure or Non-Secure. Blocks NS=1 transactions from Secure DRAM — returns a bus error.
- TZPC: Controls per-peripheral access. Marks peripherals as Secure-only; Normal world accesses are rejected at the bus level.
- Caches: Lines carry the NS bit. NS=1 accesses cannot hit NS=0 cache lines — treated as a miss, preventing secure data leakage to Normal world.
NEON is ARM's fixed-width 128-bit SIMD extension, mandatory in AArch64. Processes 4×32-bit floats, 8×16-bit integers, etc. in parallel. Uses V0–V31 (128-bit vector registers, shared with the FP register file). Accelerates media codecs, image processing, signal processing, and early ML inference.
SVE (Scalable Vector Extension), introduced in ARMv8.2, solves NEON's fixed-width limitation. SVE vectors can be any multiple of 128 bits. The ISA is width-agnostic — the same binary runs on a 128-bit SVE core and a 512-bit SVE core without recompilation. The CPU exposes its vector length via a system register; loops query it at runtime to set strides correctly.
Hardware design implications:
- Datapath width: A 512-bit SVE unit requires 512-bit-wide ALUs and register file ports — 4× the area and power of a 128-bit NEON unit. Significant die area and power budget impact.
- Predicate registers: SVE adds 16 predicate registers (P0–P15), each (VL/8) bits wide — 1 bit per byte. At 512-bit SVE: 64-bit predicates × 16 = 128 bytes of predicate register file. Enables per-element masking without branch-based loop tails.
- SVE2 (ARMv9): Adds gather/scatter, complex arithmetic, feeds SME (Scalable Matrix Extension) for native matrix multiply — targeting AI workloads directly in the CPU pipeline.
Out-of-order CPUs and write-combining memory systems do not guarantee memory operations complete in program order. Memory barriers force ordering constraints required by concurrent software or hardware interaction.
DMB (Data Memory Barrier): Ensures all memory accesses before DMB are visible to the memory system before any accesses after it. Does not stall instruction execution — subsequent non-memory instructions proceed. Use case: write data to a shared buffer, then DMB, then set the "data ready" flag.
DSB (Data Synchronization Barrier): Stronger. Stalls instruction fetch until all outstanding memory transactions — including cache maintenance operations and TLB invalidates — complete. Required before modifying page tables, enabling caches, or switching address spaces. Always precedes an ISB when system registers change.
ISB (Instruction Synchronization Barrier): Flushes the instruction pipeline. Guarantees instructions fetched after ISB use updated processor state — new exception vectors, MMU configuration, system register values. Always follows a DSB when writing to system registers that affect fetch behavior (SCTLR_EL1, VBAR_EL1, TTBR0_EL1).
Hardware cost: The CPU's load/store unit drains outstanding memory transactions — write buffer empty, store queue drained, ACE/CHI coherency acknowledgments received. On a loaded memory system, DSB can stall the pipeline hundreds of cycles — use sparingly at actual synchronization points only.
A combinational mux on a clock signal produces glitches — partial pulses when select changes while the clock is active. Even a single glitch can corrupt flip-flop state. A glitch-free mux must switch between two clock sources without producing any pulse shorter than a full clock period.
Standard 4-FF implementation:
- Two parallel flip-flop paths — one clocked by CLK_A, one by CLK_B.
- The select signal is ANDed with a feedback term (inverted grant of the other source) before being clocked by each source. Creates a mutual-exclusion handshake: one source deasserts before the other asserts.
- Outputs drive a simple OR gate. Only one input can be HIGH at a time — no overlap, no glitch.
- Switch latency: 2–3 cycles of each clock domain for the handshake to complete.
In Cortex-M SoCs: DVFS requires switching between a PLL (high-frequency, active mode) and an RC oscillator (low-frequency, sleep mode) without glitches. During DFT scan, switching to the test clock must happen glitch-free — a partial pulse on the scan clock corrupts the shift chain state. ARM's Reference Methodology documents the exact RTL template and the STA false-path annotations for the switching window.
set_false_path -from clk_sel_ff or set_clock_groups -asynchronous to suppress STA violations on the select synchronization path — the bounded switch latency is correct by design.When a Cortex-M CPU takes an exception, hardware automatically stacks 8 registers before jumping to the handler — allowing handlers to be written as plain C functions with no special prologue/epilogue.
Automatically saved (pushed to stack):
- R0, R1, R2, R3: ARM AAPCS caller-saved argument/scratch registers. Exception handlers call sub-functions that may clobber these — saving them lets the interrupted code resume correctly.
- R12 (IP): Intra-procedure scratch register — also caller-saved, freely clobbered by function calls.
- LR (R14): Link register — the interrupted function's return address. Must be saved so the handler can make function calls without corrupting the interrupted context's return path.
- PC (R15): Address of the instruction that would have executed next — restored on exception return to resume interrupted code.
- xPSR: Program Status Register — contains APSR flags (N/Z/C/V), thumb state bit, and active exception number. Must be restored to return to the same execution state.
Why exactly these 8? The AAPCS defines R0–R3 and R12 as caller-saved. By saving exactly caller-saved registers + PC/LR/xPSR, hardware guarantees the handler can use R0–R3/R12 freely and make sub-function calls normally. R4–R11 are callee-saved — if the handler uses them, the C compiler saves/restores them automatically in the function prologue/epilogue.
By default, STA assumes every combinational path between flip-flops has one clock cycle. A multi-cycle path (MCP) exception tells the tool a specific path legitimately has more cycles:
set_multicycle_path 2 -setup -from FF_A -to FF_B— setup check uses 2 cycles. Data launched at cycle N must be valid by N+2.- A matching hold adjustment is almost always required:
set_multicycle_path 1 -holdon the same path — shifts the hold check back to prevent impossibly tight hold requirements.
When to apply:
- Paths that physically cannot meet single-cycle timing (deep combinational trees: large muxes, iterative dividers) in designs where increasing frequency isn't required.
- Datapath stages deliberately using 2 cycles — pipeline registers omitted for area savings when throughput allows.
- Interfaces that present valid data only every N cycles by protocol definition — the downstream FF only samples every N cycles by design.
Risks:
- Wrong application: An MCP on a path that runs every cycle silently hides a real timing violation — a serious tape-out risk.
- RTL/SDC divergence: Design intent must be consistent between RTL and SDC. Divergence is a common cause of post-silicon failures that pass timing sign-off.
ARM delivers hard IP (Cortex-A78, X2, Neoverse V1, etc.) as encrypted DEF/GDSII macros with characterized Liberty (.lib) timing models — the macro is a timing black box with accurate I/O pin arcs.
What the integrator receives:
- Liberty files for each PVT corner (SS/TT/FF × -40/25/125°C) — model all interface pin timing arcs.
- SDC constraints file —
create_clockfor macro-boundary clocks, MCP exceptions for AXI/APB interfaces not running every cycle, false paths for async reset trees and internal CDC crossings. - MCMM constraints — separate SDC for functional, scan, and MBIST operating modes.
Typical SoC-level exceptions needed:
- Async reset false path:
set_false_path -to [get_pins ...nreset...]— NRESET is asynchronous, no setup check applicable. - AXI interface CDC false paths: AXI between core clock and SoC interconnect clock is synchronized inside the macro. STA sees apparent CDC crossing paths — must be false-pathed at SoC level to avoid thousands of bogus violations.
- ACLKEN multi-cycle paths: If AXI clock enable throttles the interface to every N cycles, matching MCPs must be set on all AXI data paths.
- Scan mode false paths:
set_false_path -from scan_enablesuppresses violations during scan shift analysis.
create_clock. Without the ARM-provided SDC, you'll see thousands of false violations — real violations become invisible in the noise. Always source the ARM SDC before adding SoC-level constraints.Each L1 cache line is in one of four MESI states: Modified (dirty, owned exclusively), Exclusive (clean, owned exclusively), Shared (clean, may be in other caches), or Invalid. The CCI manages coherency across cores.
Dirty-line snoop scenario:
- Core 0 writes address 0x4000 → Core 0's L1 line is Modified (dirty). DRAM is stale.
- Core 1 reads 0x4000. Issues a ReadShared request to the CCI.
- CCI's snoop filter knows Core 0 has 0x4000 Modified. CCI sends a snoop (CleanShared) on AC channel to Core 0.
- Core 0 responds on CR: "found, dirty." Sends dirty data on CD channel back to CCI (or L2). Transitions its copy Modified → Shared.
- CCI forwards clean data to Core 1. Core 1's L1 → Shared. DRAM is now up-to-date.
- Both cores have clean Shared copies. A subsequent write by either core requires another coherency transaction (MakeUnique) before writing.
CDC angle: In multi-frequency ARM clusters, the AC/CR/CD snoop channels cross clock domain boundaries between the CCI clock and each core's private clock domain. ARM's ACE protocol uses VALID/READY handshaking on each channel to handle these crossings safely — snoop response timing is bounded by the protocol, not by a fixed cycle count.
big.LITTLE (2011, Cortex-A15/A7) pairs a high-performance "big" core (e.g., Cortex-A78) with a power-efficient "LITTLE" core (e.g., Cortex-A55) on the same die. Both implement the same ISA — software runs unchanged on either. The OS scheduler migrates tasks based on workload demand.
Power reduction mechanism:
- A background task on Cortex-A78 at 2.8 GHz might consume 1.5 W. The same task on Cortex-A55 at 1.0 GHz might consume 0.15 W — 10× less power — with only 2–3× lower performance. Excellent tradeoff for background work.
- Since P ∝ CV²f, the LITTLE core has lower V (lower supply voltage), lower f, and smaller C (simpler microarchitecture = fewer transistors switching). Savings multiply across all three terms.
- Cache coherency requirement: Migration requires the task to see the same memory state on the new core. ACE/CCI ensures L1 cache state is visible across clusters — no software cache flushes needed on migration.
Scheduler: Linux EAS (Energy Aware Scheduler) models energy cost of each task at each frequency on each core type and assigns to minimize energy while meeting performance deadlines.
DynamIQ (2017, starting with Cortex-A75/A55) places all CPU cores — big and LITTLE — in a single cluster sharing one L3 cache via the DSU (DynamIQ Shared Unit), rather than separate clusters with a CCI bridge.
Key improvements over big.LITTLE:
- Unified L3 cache: All cores share the same L3 directly. Coherency no longer crosses a CCI bridge — uses the simpler, lower-latency DSU fabric. Inter-core communication latency drops ~30–40%.
- Per-core power domains: In big.LITTLE, all cores in a cluster shared one voltage rail. DynamIQ allows each core its own power domain — LITTLE core at 0.6V/0.8 GHz simultaneously with a big core at 0.9V/2.8 GHz. True per-core DVFS.
- Flexible core mix: Up to 8 cores in any combination — 1×X2 + 3×A78 + 4×A55 in one cluster. How Qualcomm Snapdragon's Prime+Performance+Efficiency topology is built.
- NPU integration: DynamIQ clusters connect to ARM Ethos NPU via the DSU fabric — low-latency data sharing between CPU and NPU for ML inference workloads.
DSU hardware: ARM delivers the DSU as parameterizable RTL — configurable for core count, L3 size, power domain count, and PMU event counters. The DSU also hosts the GIC (interrupt controller) interface and the CoreSight external debug interface. SoC integrators configure it at synthesis time; the OS reads DSU registers at runtime to understand cluster topology.
CoreSight is ARM's standardized debug and trace architecture — hardware IP components providing non-invasive visibility into a running SoC. Built into every Cortex-A/M/R core, it is the foundation for all JTAG-based debugging of ARM chips.
Key CoreSight components:
- DAP (Debug Access Port): Entry point for an external debugger. Connects JTAG or SWD (Serial Wire Debug — ARM's 2-pin alternative to 5-pin JTAG) to the internal APB debug bus. Every ARM chip exposes at least one DAP.
- CTI (Cross Trigger Interface): Routes trigger signals between debug components. A breakpoint on Core 0 can simultaneously trigger trace capture on Core 1 — enabling synchronized multi-core debugging and cross-trigger halt.
- ETM (Embedded Trace Macrocell): Per-core instruction trace — records every executed instruction and its address. Streams to the trace bus for offline analysis. Used for profiling, coverage measurement, and reproducing timing-dependent bugs.
- STM (System Trace Macrocell): Software-generated trace via memory-mapped registers. Firmware writes messages to STM ports; they appear in the trace stream alongside ETM hardware traces — mixing printf-style output with hardware instruction flow.
- ETF / TPIU: ETF buffers trace on-chip for post-mortem capture after a crash. TPIU streams trace off-chip to a host analyzer via a parallel trace port.
SoC bring-up use: On first silicon, engineers connect an ARM DSTREAM probe to the JTAG/SWD port. CoreSight allows setting breakpoints in the ROM bootloader, reading/writing memory and registers, single-stepping through reset sequences, and verifying memory-map initialization — all before the silicon can boot. CoreSight is the first debug tool used after power-on, before even the first successful boot.
Blocking (=): Executes sequentially within the current time step — each line completes before the next begins, like C code. Used in combinational always_comb blocks.
Non-blocking (<=): Schedules the RHS evaluation immediately but defers the LHS update until the end of the current time step. All flops in the block sample their old values simultaneously — matching real flip-flop behavior where all Q outputs change after the clock edge.
Why never mix: Mixing creates order-dependent, simulation-synthesis mismatches. A shift register written with blocking in a clocked block simulates as a wire (each stage immediately sees the updated value of the previous), while synthesis infers the correct flip-flop chain. The simulation passes but silicon is wrong.
Golden rules: always_ff @(posedge clk) → always use <=. always_comb → always use =. Never mix in one block.
A round-robin arbiter uses a rotating priority mask. After granting requestor N, the mask disables requestors 0..N for the next cycle, giving priority to N+1..last. When no masked request exists, it falls back to unmasked priority (wrap-around).
module rr_arb #(parameter N=4)(
input logic clk, rst_n,
input logic [N-1:0] req,
output logic [N-1:0] gnt
);
logic [N-1:0] mask;
logic [N-1:0] masked_req;
assign masked_req = req & mask;
// Grant lowest active bit within mask; fallback to unmasked
assign gnt = masked_req ? (masked_req & (~masked_req + 1))
: (req & (~req + 1));
always_ff @(posedge clk or negedge rst_n)
if (!rst_n) mask <= '1;
else if (|gnt) mask <= ~((gnt << 1) - 1); // rotate past granted bit
endmodule
Fairness guarantee: Every active requestor is served within N consecutive grant cycles — no requestor can be starved as long as it holds its request. Under fully loaded conditions all N requestors share bandwidth equally.
- Incomplete sensitivity list: Missing signals in
always @(...)make simulation miss updates. Fix: usealways_comb— tool auto-completes the list. - Unintentional latch inference: An if/case without a complete else/default infers a latch in synthesis, but simulation sees the last assigned value. Fix: always assign every output in every branch; use
always_combwhich flags latches as errors. - Blocking in clocked blocks: A shift register with
=simulates as a wire, synthesizes as flops. Fix: always use<=inalways_ff. - Uninitialized registers: Verilog initializes regs to X; synthesis assumes 0 or 1. RTL simulation passes; gate-level simulation reveals X-propagation failures. Fix: always include a synchronous or asynchronous reset.
- Implicit truncation: Assigning a wide bus to a narrow reg silently drops MSBs in both sim and synth — but the intent is wrong. Fix: use explicit width casts and enable lint truncation warnings.
AMD RTL style: mandate always_comb/always_ff, run SpyGlass/Jasper lint gates as a pre-synthesis check, and require sign-off on zero lint errors before RTL freeze.
Setup Slack = Data Required Time − Data Arrival Time
Data Arrival Time (how late the data gets to the capture flop's D pin):
= Launch clock edge + Clock source latency + Launch clock network delay + Tco (launch flop clock-to-Q) + Combinational logic delay + Net delay to capture D
Data Required Time (latest the data can arrive and still be captured):
= Capture clock edge + Clock source latency + Capture clock network delay − Tsetup (capture flop setup time)
Clock skew = capture clock arrival − launch clock arrival. Positive skew (capture arrives later) helps setup; negative skew hurts it.
Positive slack → timing met. Negative slack → setup violation — the critical path. STA tools report the worst negative slack (WNS) and total negative slack (TNS) per corner. AMD Zen sign-off runs SS (slow-slow) corner at worst-case voltage/temperature for setup, and FF (fast-fast) for hold.
OCV (On-Chip Variation): At advanced nodes, identical cells at different physical locations experience different threshold voltage (Vt), temperature, and IR drop — making their delays vary. OCV models this by applying a flat derate factor to the entire launch path (slow it) and capture path (speed it), creating a pessimistic worst-case margin.
Problem with flat OCV: A short path with 2 gates gets the same derate as a 50-gate critical path. Statistically, variation averages out over many stages — a 50-gate path has far less total variation than a 2-gate path. Flat OCV is massively over-pessimistic for long paths, forcing unnecessary ECO iterations and area/power overhead.
POCV (Parametric OCV): Applies a statistical model where variation reduces with path length (central limit theorem — more stages → more averaging). Each cell gets a sensitivity coefficient derived from Monte Carlo silicon characterization. Long paths get less derate; short paths get more. POCV is accurate rather than merely pessimistic.
AOCV (Advanced OCV): A simpler intermediate — uses lookup tables indexed by path depth and physical distance. Less accurate than full POCV but simpler to apply.
At TSMC 7nm/5nm (Zen 3/4), POCV recovers 5–15% timing margin vs flat OCV — directly translating to lower voltage guardbands, enabling higher frequency or lower power at the same frequency. AMD uses Synopsys PrimeTime with POCV (CRPR + POCV derates) for Zen 4 sign-off.
A 2-FF synchronizer handles metastability for a single bit. When applied to a 4-bit bus independently, each bit settles to its synchronized value in a different cycle — the capture domain may sample 2 bits from the new value and 2 from the old, forming a spurious intermediate that never existed in the source domain.
Example: Bus transitions from 0b0111 (7) to 0b1000 (8). Bit[3] synchronizes one cycle early; bits[2:0] synchronize one cycle later. Destination briefly sees 0b1111 (15) — a value that never occurred in the source.
Correct approaches:
- Gray coding: If the bus is a counter, encode it as Gray code (only 1 bit changes per step). Pass through a 2-FF sync. Decode on receive side. This is the basis of async FIFO pointer crossing.
- Handshake: Assert a request signal, wait for acknowledge synchronized back. Only then is the data sampled. Safe for any data, but one transaction per round-trip latency.
- Async FIFO: For streaming data, use an async FIFO with Gray-coded read/write pointers. Standard solution for a continuous data stream crossing clock domains.
At 1 GHz → 200 MHz (5:1 ratio), the receive domain has 5 source cycles per receive cycle — ample metastability resolution time, but data integrity across the boundary still requires one of the above protocols.
Reconvergence hazard: A single source signal fans out to two registers (R1, R2) in the source domain. Both cross to the destination via separate 2-FF synchronizers. In the destination domain, logic combines the synchronized R1' and R2' signals.
Because R1 and R2 may have different routing delays, they can be captured in different source clock cycles. After synchronization, R1' and R2' may represent different "generations" of the originating signal — a combination that never simultaneously existed in the source domain.
Concrete example: Source generates a 2-bit state: valid=1, data=0xAB. Valid crosses through sync path A (2 cycles). Data crosses through sync path B (3 cycles). Destination sees valid=1 while data still holds the old value — it processes stale data with a valid flag, causing a silent data corruption.
Fix:
- Synchronize a single control signal (valid/enable) and use it to capture all associated data simultaneously in the destination domain using a load register.
- Move reconvergent logic entirely into the source domain before crossing.
- Use an async FIFO which safely bundles all associated bits into one crossing.
CDC tools (Questa CDC, Meridian, Jasper CDC) specifically flag reconvergence paths because 2-FF synchronizers alone cannot fix this class of bug.
Clock gating: Stops the clock to a register bank when its data won't change. An Integrated Clock Gating (ICG) cell ANDs the enable signal with CLK on the falling edge (latch-based to prevent glitches). Supply rail remains on — state is preserved. Wake-up is instantaneous (next clock edge). Eliminates dynamic power (P = C·V²·f·α — α drops to 0 for gated registers). Used for microsecond-to-millisecond idle periods.
RTL pattern automatically recognized by synthesis:
always_ff @(posedge clk)
if (enable) q <= d; // synthesis tool inserts ICG automatically
Power gating: Cuts the VDD supply rail entirely to a block using a PMOS header or NMOS footer switch cell. Eliminates both dynamic and leakage power. State is lost unless retention flip-flops (with separate always-on supply) are used. Wake-up requires power rail ramp + state restore — typically 1–100 µs penalty. Used for blocks idle for long durations.
When to use each: At 5nm, leakage can exceed dynamic power for idle blocks. For fine-grained, frequent idle (display pipeline idle for one frame): clock gate. For coarse-grained, infrequent idle (USB controller during airplane mode): power gate. AMD Zen uses both — clock gating per execution unit, power gating per core and per CCD.
- Operand isolation: Gate the inputs to a functional unit (multiplier, FP adder) when its output won't be consumed. Even with the clock running, propagating switching activity through a 64-bit multiplier dissipates significant power. A simple AND gate on inputs with an enable signal prevents this. Zen's FP units are operand-isolated when the scheduler has no ready FP micro-ops.
- Fine-grained per-register clock gating: Rather than gating an entire register file, gate individual rows or entries. Zen uses per-physical-register clock gating so that inactive rename table entries and retirement slots consume zero dynamic power — critical in a 320-entry ROB where most entries are idle each cycle.
- Bus-invert encoding: Wide internal buses (128-bit operand buses, 256-bit cache read data) toggle frequently. Bus-invert encoding checks if the Hamming distance between the current and previous bus value exceeds N/2 — if so, invert the bus and set an invert bit. The receiver reinverts. This halves the worst-case toggle count and reduces average bus switching by 20–35%. AMD applies this on the L1↔L2 data bus in Zen cores.
Standard scan chains connect all flip-flops end-to-end. Test time = (flops/chains) × patterns × 2 (shift in + shift out). For a 50M-flop AMD Zen CCD with 200 scan chains and 100K patterns — this is weeks of tester time at high cost per second.
Scan compression inserts a decompressor/compressor network between the external tester and the internal scan chains. A small number of external channels (e.g., 16) drive many more internal chains (e.g., 3200) through an LFSR-based decompressor. Responses are compressed through a MISR (Multiple Input Signature Register) before going to the tester.
Compression ratio = internal chains / external channels. Typical: 50:1 to 200:1. A 200:1 ratio reduces test time and tester data by ~200×. AMD uses Synopsys DFT Compiler with compression ratios of 100–150:1 across Zen designs.
Trade-offs:
- Fault coverage slightly reduces (compressed stimulus can't fully control every flop independently — called "care bits" limitation)
- X-sources (unknowns from uninitialized memories, analog blocks) corrupt the MISR — must be masked or controlled during DFT planning
- Area overhead: 2–3% for compressor/decompressor logic — worthwhile given the 100–200× test time savings
JTAG defines a 4-pin serial interface: TCK (clock), TMS (mode select), TDI (data in), TDO (data out). A 16-state TAP (Test Access Port) controller state machine manages all operations.
Mandatory registers:
- Instruction Register (IR): Selects the active data register. Typically 4–8 bits wide. Loaded via the Shift-IR state.
- Bypass Register: Single bit. When selected, TDI connects directly to TDO — allows a chip in a daisy-chain to be bypassed without shifting through its full scan chain.
- Boundary Scan Register (BSR): One cell per I/O pin. EXTEST mode drives/captures pin values for board-level continuity testing. INTEST mode tests internal logic.
- IDCODE Register: 32-bit device identification (JEDEC manufacturer ID, part number, version). Read automatically after TAP reset — the first thing a bring-up engineer checks.
AMD bring-up use: On first silicon of a new Zen or RDNA die, engineers connect a JTAG probe before the chip can boot. JTAG allows loading bootcode into internal SRAM, reading/writing CSRs (control/status registers), single-stepping through the reset sequence, and verifying memory-map initialization — all before the first successful boot. It is the primary debug entry point on day one of silicon validation.
Physical challenges:
- Bump pitch bandwidth limit: Micro-bumps (~40 µm pitch on EMIB, ~9 µm on TSMC SoIC) limit how many signal wires cross the die boundary. Monolithic on-chip metals can be <1 µm pitch — die-to-die bandwidth per mm of perimeter is severely constrained.
- Signal integrity: Package traces have higher resistance, capacitance, and inductance than on-chip metals. Multi-Gbps Infinity Fabric links require equalization (CTLE/DFE) and careful impedance matching.
- Latency: A cross-chiplet cache miss on Infinity Fabric adds ~50–100 ns versus ~1–5 ns for on-die L3 access. This affects cache architecture decisions (EPYC's large L3 per CCD reduces cross-chiplet traffic).
- Power per bit: Die-to-die PHYs consume ~1–2 pJ/bit versus ~0.05 pJ/bit for on-chip wires. Total fabric power is a key design constraint.
Protocol challenges:
- Cache coherence across die boundary: MOESI protocol messages (probes, responses, data) must traverse Infinity Fabric. The I/O die acts as the coherence home node — every cross-CCD cache miss touches it.
- Credit-based flow control: High, non-deterministic fabric latency requires credit-based flow control so the transmitter never overflows the receiver's buffers without per-cycle handshaking.
- CRC per flit: On-chip wires are effectively error-free. Die-to-die links need per-flit CRC for error detection, with link-level retry for corrections.
AMD addressed these with XGMI protocol layered over a UCIe-compatible PHY, and a directory-based coherence home in the I/O die for Zen 4 / EPYC Genoa.
PCIe is a packet-based, full-duplex serial bus. The Transaction Layer creates TLPs (Transaction Layer Packets) for all host-device communication. TLPs carry a 12 or 16-byte header plus optional data payload.
- Memory Write (MWr) — posted: The initiator sends the TLP (address + data) and immediately moves on. No completion is returned. Highest throughput — used for DMA streaming from GPU to host. "Posted" means the initiator trusts delivery without acknowledgment.
- Memory Read (MRd) — non-posted: Initiator sends a request TLP (address + length). The target responds with a Completion with Data (CplD) TLP. Latency-bound — initiator waits. Outstanding requests tracked by a tag field (up to 256 tags in PCIe 4.0). Used for register reads, configuration access.
- Completion (Cpl/CplD): Response to a non-posted request. Contains the data and a status field indicating success or error (CA/UR).
Credit system: Each receiver advertises its buffer capacity in credits (1 credit = 1 header or 4 DW of data). Three independent credit pools: Posted (P), Non-Posted (NP), Completion (Cpl). The transmitter tracks available credits and never sends a TLP that would overflow the receiver. Credits are returned as the receiver's buffers drain. This removes per-cycle handshaking — the link runs at full bandwidth as long as credits are available.
Advantages:
- Yield: A 5nm CCD is ~80 mm². A monolithic 16-core 5nm die would be ~500+ mm² with catastrophically low yield. Multiple small, high-yield CCDs combined via packaging achieve equivalent compute with far better economics.
- Process node mixing: Compute logic (CCDs) uses 5nm — fast, low leakage. I/O, memory PHY, and analog (IOD) use 6nm/12nm — cheaper and more reliable for analog. Monolithic designs must compromise.
- Reuse and modularity: The same CCD is used across Ryzen, EPYC, and Threadripper. Scale core count by adding CCDs. NRE cost is amortized across millions of units.
- Competitive time-to-market: CCD and IOD can be taped out in parallel and validated independently before integration.
Verification challenges:
- Interface complexity: Infinity Fabric D2D must be verified at block level (UVM) and as a full system. Latency modeling across the fabric must be cycle-accurate.
- Cross-die cache coherence: MOESI correctness across the die boundary, including simultaneous probe + local write corner cases.
- Power-on sequencing: CCD and IOD power up separately; link training and initialization order must be verified and fault-tolerant.
- Full-system simulation cost: Cycle-accurate RTL simulation of 8 CCDs + IOD is intractable — AMD uses FPGA emulation and architectural simulators for system-level validation.
MESI states: Modified (dirty, exclusive) → Exclusive (clean, exclusive) → Shared (clean, multiple copies) → Invalid.
MESI problem: When a line is in Modified state and another core requests a read, the owner must write back to DRAM (slow — full DRAM latency), then supply the data. Two memory transactions per shared read.
MOESI adds the Owned (O) state: A dirty line that is being shared. The Owner holds the authoritative (dirty) copy without writing back to DRAM. Other sharers hold Shared (S) copies. The Owner supplies data directly to new requestors cache-to-cache — L3 speed, not DRAM speed. The Owner is responsible for eventual write-back on eviction.
MOESI advantage: Eliminates the mandatory write-back on a shared read, replacing DRAM-latency write-back with a fast cache-to-cache transfer. In producer-consumer workloads (one core writes, several read), DRAM traffic drops dramatically.
AMD Zen implementation: MOESI operates within a CCD's L1/L2/L3 hierarchy — cores on the same CCD share a 32 MB L3 (Zen 4), so intra-CCD coherence is handled by the L3 snoop filter. Cross-CCD coherence over Infinity Fabric uses a directory-based extension of MOESI, with the I/O die's system management unit acting as the coherence home node. Cross-CCD MOESI adds ~50–100 ns to the miss penalty versus <5 ns for intra-CCD L3 hit.
Out-of-order execution allows a processor to execute instructions in data-ready order rather than program order, hiding latency. A multiply waiting on a cache miss is bypassed by later independent instructions — the pipeline stays busy.
Key OoO structures:
- Frontend: Fetch → Decode → Rename (map architectural registers to a larger physical register file, eliminating WAR/WAW hazards)
- Scheduler (Reservation Station): Holds micro-ops waiting for source operands. Dispatches to execution units the moment all sources are ready — this is where "out-of-order" happens.
- Execution units: ALU, FP, load/store — execute out of program order based on readiness.
- ROB (Reorder Buffer): Circular buffer holding all in-flight micro-ops in program order. Instructions enter at dispatch and retire in order from the ROB head.
Precise exceptions via ROB: When an instruction causes an exception (page fault, divide-by-zero), the processor must appear as if all prior instructions completed and no later instructions executed. The ROB makes this precise: at the exception point, the ROB head identifies the faulting instruction in program order. All ROB entries before it are committed (register file and memory updated in order). All entries after it are flushed (results discarded, physical registers freed). The machine state exactly matches the program-order point of the fault — the OS can handle it and restart cleanly.
Zen 4 ROB: 320 entries deep, supporting up to 320 micro-ops in flight simultaneously — a key contributor to high IPC on memory-latency-bound server workloads in EPYC Genoa.
Floorplanning establishes the physical die outline and positions macros (memories, IP blocks) before standard cell placement. Key considerations:
Aspect ratio: Determined by cell area estimate plus 15–25% whitespace for routing. A square die minimises wire length.
Macro placement: Memories and hard IPs are pushed to die edges/corners so the remaining core area is a contiguous rectangle for standard cells. Critical-path macros are placed close together. Timing-driven floorplanning uses estimated wire lengths from the netlist to guide placement.
Pin placement: I/O pins are aligned to the data-flow direction to avoid internal feedthroughs that congest routing.
Power planning: VDD/VSS stripes are planned during floorplanning — horizontal stripes on one metal layer, vertical on the next — running over and around macros.
Halos and blockages: A halo (keep-out margin) around each macro prevents standard cells from being placed in congested macro boundaries. Hard blockages prevent routing in sensitive regions.
Scan chain insertion replaces standard flip-flops with scan-equivalent cells that add a scan-data input (SI) and a scan-enable (SE) control pin.
Functional mode (SE=0): The scan MUX selects the normal data input D. The circuit operates identically to the original design.
Shift mode (SE=1): The scan MUX selects SI. All flip-flops in a chain form a serial shift register — test vectors are shifted in from scan-in (SCI) and responses shifted out from scan-out (SCO).
Insertion flow: (1) Identify scan-eligible flip-flops (excludes clock-domain boundary FFs, async cells). (2) Replace with scan cells from the tech library. (3) Stitch into balanced chains to equalise shift time. (4) Add dedicated scan-in/scan-out ports (or reuse functional I/Os). DFT compilers (Synopsys DFT Compiler, Cadence Modus) automate this. Chain balance is critical — a long chain increases test application time linearly.
A hold violation means data changes too quickly — it arrives at the capture flop's D input before the hold window closes after the capture clock edge.
Condition: Data_arrival_time < Capture_clock_arrival + t_hold
Root cause post-CTS: Before CTS, ideal clocks assume zero skew. After CTS, if the launch clock arrives later than the capture clock (positive skew toward launch), data launches later and arrives at capture earlier relative to the capture edge — tightening the hold margin.
Fixes:
- Insert delay buffers (DEL cells or BUF chains) on the data path between the launch and capture flops to slow data arrival.
- Use higher-drive-strength buffers that inherently have longer propagation delay.
- Unlike setup violations, hold violations cannot be fixed by slowing the clock — they require actual path delay increase.
ECO insertion of delay cells is the standard physical design fix.
IR drop is the voltage loss along power grid resistance (V = I × R), reducing supply voltage at standard cells and increasing their delay.
Static IR drop: Computed using average (DC) current draw. Identifies chronic hotspots where the power grid is permanently stressed during normal operation. Analysis is fast and runs early in the design flow. Fix: add wider or additional metal stripes, add vias between layers to reduce resistance.
Dynamic IR drop: Peak instantaneous current from simultaneous switching events (SSO — Simultaneous Switching Output). Much higher than static but short in duration. Requires transient simulation using VCD/SAIF activity files to generate current waveforms. Fix: add decoupling capacitors (decaps) near switching hotspots to supply local charge during peaks; redistribute clock domain switching to spread activity in time.
Signoff: Tools like Cadence Voltus and Synopsys RedHawk annotate per-cell voltages back into STA (voltage-aware timing) so IR-induced timing violations are caught at signoff.
ATPG (Automatic Test Pattern Generation) creates test vectors that expose manufacturing defects by driving a distinguishable response on an observable output.
Stuck-at fault model (SAF): Assumes a signal net is permanently stuck at logic 0 (SA0) or 1 (SA1). To detect SA1 on net N: (1) Activate — force net N to logic 0 by choosing input values. (2) Propagate — sensitise a path from N to a primary output so the wrong value is observable. Covers 95–98% of real defects at ≥90nm. Fast to generate and simulate.
Transition fault model (TF): Assumes a gate output transitions too slowly (slow-to-rise or slow-to-fall). Critical at sub-28nm where timing defects dominate over stuck-at. Tests are applied using launch-on-shift (LOS) or launch-on-capture (LOC) protocol — the pattern is launched and captured within a functional clock cycle, checking that the transition completes on time.
Path-delay faults: Test entire combinational paths end-to-end for timing violations. Most comprehensive but pattern count is large. Used in automotive-grade (AEC-Q100) and safety-critical (ISO 26262) parts.
CTS builds the clock distribution network from the clock source to all sequential elements (flip-flops, latches, macros).
Clock skew: The difference in arrival time between the earliest-arriving and latest-arriving clock at any two registers. High skew consumes timing margin on both setup (capture arrives too late) and hold (capture arrives too early). Post-CTS skew target is typically <50–100 ps.
Insertion delay: Total delay from the clock source to the clock pin of a leaf register. Large insertion delay adds to the cycle time overhead seen by the PLL. Minimised by using clock buffers with high drive strength and short wire lengths.
CTS goals:
- Minimise skew across all clock sinks of a domain.
- Minimise insertion delay to preserve timing budget.
- Maintain fast slew (transition time) at clock pins — slow slew increases dynamic power and setup uncertainty.
- Stay within power budget — clock network is typically 30–40% of chip dynamic power.
Tools (Cadence Innovus, Synopsys ICC2) use H-tree or mesh topologies with cells from a dedicated clock cell library.
BIST (Built-In Self-Test) embeds test circuitry on-chip so the device can test itself without an external tester.
LBIST (Logic BIST): Tests combinational and sequential logic. A PRPG/LFSR (pseudo-random pattern generator) produces stimulus internally. Responses are compressed by a MISR (Multiple-Input Signature Register) into a single signature, compared against a golden value. No external tester is needed after manufacturing — ideal for in-system test during field operation. Used in automotive (ISO 26262) and aerospace for periodic self-check. Limitation: lower fault coverage than ATPG because patterns are pseudo-random.
MBIST (Memory BIST): Tests embedded SRAMs, register files, and ROM arrays. Implements standard memory test algorithms — MARCH C−, MATS+, or Checkerboard — that write and read specific data patterns to detect stuck-at, transition, coupling, and data-retention faults. ATPG cannot model the internal fault models of array structures, making MBIST mandatory for high-density embedded memories.
In practice, most production chips include both: LBIST for logic and MBIST for each embedded memory instance, all controllable via JTAG.
OCV (On-Chip Variation) accounts for spatial variation in process, voltage, and temperature across a single die. Two transistors at opposite corners of the chip can differ in speed by 10–20% even from the same wafer.
Flat OCV modeling: Apply a derating factor to all cells on each path:
- Setup: launch clock path × 1.1 (late), capture clock path × 0.9 (early), data path × 1.1 (late).
- Hold: launch × 0.9 (early), capture × 1.1 (late).
This is pessimistic because spatially adjacent cells on the same path are correlated — they will not simultaneously see worst-case and best-case conditions.
AOCV (Advanced OCV): Uses depth/distance lookup tables. Short paths (few logic levels, cells close together) are highly correlated → lower derating. Long paths with many independent stages → higher derating.
POCV (Parametric OCV): Characterises each cell's delay as a statistical distribution (mean + σ). STA combines path delays using σ_total = √(Σσᵢ²) (independent stages) or σ_total = Σσᵢ (fully correlated). Recovers 10–20% of timing margin versus flat OCV while remaining statistically justified. Required at 7nm and below.
Crosstalk occurs when a switching net (aggressor) induces a voltage disturbance on an adjacent net (victim) via parasitic coupling capacitance (Cc) between parallel wires.
Crosstalk noise: Glitch on a static victim net. If the glitch crosses the switching threshold it can cause functional failure in combinational logic.
Crosstalk delay: On a transitioning victim, an aggressor switching in the same direction speeds the victim up (beneficial). An aggressor switching in the opposite direction slows the victim down — this is the dangerous case for setup timing (increases data path delay) and can also worsen hold (decreases data path delay if victim speeds up).
Mitigation:
- Shield critical nets by inserting VDD/VSS wires between aggressor and victim.
- Increase wire spacing (reduces Cc ∝ 1/spacing).
- Downsize the aggressor driver to reduce dV/dt.
- Insert buffers to break long victim nets into shorter segments.
- Change routing layer or direction — wires crossing at 90° have minimal coupling.
Tools: Cadence Voltus-Fi, Synopsys PrimeTime SI perform crosstalk analysis using extracted parasitics (SPEF).
Full-scan test data volume grows as O(N × P) — N scan cells × P patterns. A modern 28nm chip with 10M scan cells and 100K patterns generates terabytes of test data — impractical for ATE (Automatic Test Equipment) storage and tester time.
EDT (Embedded Deterministic Test / Synopsys): Achieves 50–100× compression.
Architecture:
- Decompressor: An LFSR-based network takes a narrow external scan channel (e.g., 32 ATE pins) and expands it into many wide internal scan chains (e.g., 3200 chains). Care bits (fully specified bits needed for a specific fault) are encoded into LFSR seeds; the rest are pseudo-random fill.
- Compactor: An XOR network (space compactor) compresses all internal chain outputs back to a narrow external scan-out channel.
Result: 100× fewer ATE channels and 100× less test application time while maintaining equivalent stuck-at coverage, because the LFSR seed encodes all deterministic care bits. The tool solves for the LFSR seed that simultaneously satisfies all care-bit constraints for a given pattern.
Clock gating inserts a gating cell between the clock source and a register's clock pin. When the gate enable is low, the clock is suppressed — the flip-flop holds state without toggling, eliminating its dynamic switching power.
Power savings: Dynamic power P = α × C × V² × f. Clock gating reduces α (activity factor) for idle registers to zero. Since the clock network consumes 30–40% of dynamic power, even partial gating yields large savings.
ICG cell (Integrated Clock Gate): A latch-based gate that samples the enable on the low phase of the clock and produces a glitch-free gated output. The latch prevents enable glitches from propagating to the clock pin (which would cause spurious register captures).
RTL inference:
always @(posedge clk)
if (en) q <= d; // synthesis tool infers ICG
Avoid assign clk_g = clk & en; — this creates a glitchy gated clock. Let the synthesis tool infer the ICG from the conditional register enable. Modern tools insert ICGs automatically when the clock_gating_style attribute is set.
Boundary scan (IEEE 1149.1) provides access to chip I/O pins without physical probing — essential for board-level testing of BGA packages where pins are inaccessible.
Each I/O pin has a Boundary Scan Cell (BSC): a MUX that passes functional data in normal mode or routes scan-shift data during test.
TAP (Test Access Port): A 4-pin interface — TCK (test clock), TMS (test mode select), TDI (data in), TDO (data out). The TAP controller is a 16-state FSM driven by TMS. Key states:
- Test-Logic-Reset: All test logic is inactive; chip operates normally.
- Shift-DR / Shift-IR: Data or instruction register is shifted serially via TDI→TDO.
- Capture-DR: The boundary scan register captures the live pin state.
- Update-DR: Latches the shifted data to drive pins or internal logic.
Key instructions: BYPASS (1-bit pass-through to shorten chain), SAMPLE/PRELOAD (capture I/O without disturbing function), EXTEST (drive/observe boundary cells for board-level test), INTEST (test internal logic).
JTAG is also the transport for ARM CoreSight debug, RISC-V debug, and IEEE 1687 (iJTAG) for embedded instrument access.
Electromigration is the gradual displacement of metal atoms in a conductor caused by momentum transfer from electrons to lattice ions at high current density. Over time, this creates voids (open circuits) and hillocks (short circuits) in metal interconnect.
Black's equation: MTTF = A × j−n × e(Ea/kT), where j is current density, Ea is activation energy (~0.9 eV for Cu), T is temperature, and n ≈ 2. EM lifetime degrades rapidly with increasing current density and temperature.
Violation trigger: A wire's RMS or peak current density exceeds the PDK-specified limit for its width and length. Power/ground stripes and clock nets are most susceptible due to high sustained current.
Fixes:
- Widen the metal wire (reduces current density for the same current).
- Add parallel metal stripes to distribute current.
- Add vias (stacked vias lower via EM risk which is often the bottleneck).
- Reduce switching activity on the violating signal net.
EM sign-off is performed by tools like Synopsys PrimeRail or Cadence Voltus using extracted parasitics and activity-based current profiles.
A multicycle path (MCP) constraint relaxes the setup check so data may propagate across N clock cycles rather than 1. Used when a combinational path is too slow to meet a single-cycle constraint and the design intentionally uses multiple cycles (qualified by a valid/enable signal in RTL).
Example: A 32-bit multiplier takes 3 ns in a 2 ns clock design:
set_multicycle_path -setup 2 \
-from [get_cells mult_reg_A] \
-to [get_cells mult_result_reg]
set_multicycle_path -hold 1 \
-from [get_cells mult_reg_A] \
-to [get_cells mult_result_reg]
What happens: -setup 2 moves the setup check from cycle 1 to cycle 2 (opens a full extra cycle of margin). Without the paired -hold 1, STA moves the hold check to cycle 1 (default), which becomes extremely tight. The -hold 1 keeps the hold check at 1 cycle before the new setup edge, preventing over-constraining hold. Always pair -setup N with -hold N−1 for correct MCP analysis.
A false path is a topologically valid timing path that can never be sensitised during actual circuit operation. STA would otherwise pessimistically flag it as a violation.
Common cases:
- Asynchronous reset MUX output to flop data pin — reset is asserted asynchronously; the combinational path from the reset MUX is never a functional data path.
- Paths between unrelated clock domains where a CDC synchronizer already handles the crossing.
- Scan data paths during functional mode.
- One-time configuration registers only written at power-on initialisation.
SDC: set_false_path -from [get_clocks clk_a] -to [get_clocks clk_b]
Key difference from MCP:
set_multicycle_path: relaxes the timing check by N cycles but still performs the check. The path is real; it just needs more time.set_false_path: removes the path from analysis entirely. Over-using it masks real timing problems — apply conservatively and always document the reason in the constraints file.
The power grid is the network of metal stripes (VDD and VSS) that distributes supply voltage from chip pads to every standard cell and macro with minimal resistive loss.
Analysis flow:
- Extract parasitic resistance of all power grid metals (SPEF or internal extraction).
- Assign current sources at each cell using activity factor × Imax from the liberty model (or VCD/SAIF for dynamic).
- Solve the resistive network using a matrix solver (SPICE-accurate or faster linear solver in RedHawk/Voltus).
- Compute voltage at every grid node; flag nodes where V < VDD − ΔV_budget (typically 5–10% of VDD).
Fixes:
- Add wider or additional metal stripes in congested current-density regions.
- Add vias between metal layers to reduce via resistance (often the bottleneck).
- Insert decoupling capacitors (decaps) near switching hotspots for dynamic IR.
- Add a power ring around macros or add additional power rails through the macro halo.
Post-fix, voltage annotations are fed back to PrimeTime for voltage-aware timing signoff, ensuring IR-induced timing degradation is correctly modelled.
Setup check: Data must arrive before the capture clock edge minus setup margin: T_launch + T_comb ≤ T_capture − t_setup
Identification: Use report_timing -max_paths 10 -path_type full in PrimeTime. Examine: data arrival time, required time, negative slack (WNS/TNS), and the logic levels on the critical path.
Optimisation techniques:
- Cell sizing: Replace HVT (slow, low-leakage) cells with SVT or LVT on the critical path — LVT is typically 20–30% faster but has higher leakage. Apply selectively only to failing cells to control leakage budget.
- Buffering: Insert buffers to break long wires with high RC delay. A single long wire is slower than two shorter wired segments with a buffer.
- Retiming: Move registers across combinational logic to rebalance stage depths — shift a register forward to absorb logic from the critical stage into a lighter adjacent stage.
- Placement: Reduce physical distance between critical-path cells (re-place closer together to shorten wire RC).
- Logic restructuring: Replace multi-level AND/OR trees with fewer levels (e.g., 3-level NAND tree instead of 4-level AND tree).
- Architecture: Pipeline — insert a register mid-path, accepting extra latency but doubling frequency.
UPF (Unified Power Format, IEEE 1801) specifies power intent separately from RTL, allowing power-aware verification and implementation without modifying the design source.
Key constructs:
create_power_domain PD_CPU -include_scope— defines a set of logic sharing a common supply.create_supply_net VDD_CPU -domain PD_CPU— declares the supply net for that domain.set_domain_supply_net PD_CPU -primary_power_net VDD_CPU— connects domain to supply.
Isolation: When a domain powers down, its outputs can float or be indeterminate. Isolation cells clamp outputs to a known safe value (0 or 1) before the domain is cut:
set_isolation ISO_CPU \
-domain PD_CPU \
-isolation_signal cpu_iso_en \
-isolation_sense high \
-clamp_value 0 \
-applies_to outputs
The isolation cell must be in an always-on domain, powered continuously.
Level shifters: Required when two domains operate at different voltages (e.g., 0.8 V core → 1.8 V I/O). set_level_shifter in UPF directs the tool to insert level-shifter cells at domain crossings. EDA tools (Synopsys Verdi, Cadence Voltus) use UPF to verify correct isolation, level-shifting, and retention implementation.
Hold check: Data_arrival ≥ Capture_clock_arrival + t_hold. A hold violation means data changes too quickly relative to the capture edge.
Why post-CTS: In pre-CTS STA, ideal clocks assume zero insertion delay and zero skew — the launch and capture clocks appear to arrive simultaneously. After CTS, real clock buffers introduce real delays. If the launch flop's clock arrives noticeably later than the capture flop's clock (positive skew toward launch), the launched data arrives later at the capture D-pin — reducing the margin above the hold requirement. Paths where launch clock delay > capture clock delay become hold-critical.
Fixes:
- Delay cell insertion (ECO): Add delay buffers (DEL/BUF chains) on the data path to slow data arrival. These ECO cells are placed in available whitespace and routed. This is the standard post-CTS hold fix.
- Clock skew adjustment: Re-balance the CTS buffer tree to equalise arrival times — but this risks disturbing setup closure elsewhere.
Hold violations must be fixed before tapeout — unlike setup failures, they cannot be addressed by slowing the operating clock frequency in production.
Moore FSM: Outputs depend only on the current state (not inputs). Outputs are registered — glitch-free. One extra clock cycle of latency since the state must transition before the output changes. Simpler to verify.
Mealy FSM: Outputs depend on both the current state and current inputs. Responds one cycle faster (output changes immediately when input changes, without waiting for a state transition). Risk of combinational glitches on outputs if inputs are noisy.
One-hot encoding: One flip-flop per state; exactly one bit is high at any time (e.g., 8 states → 8 FFs). Next-state logic reduces to simple OR/AND of predecessor state bits — fewer logic levels, faster transitions.
When to prefer one-hot:
- State machine is timing-critical and needs minimum propagation delay through next-state logic.
- Available flip-flops are plentiful (FPGA always; ASIC when state count is small).
- Glitch immunity is important — one-hot decoding avoids the decoding hazards of binary encoding during state transitions.
Binary encoding uses log₂(N) flip-flops — fewer FFs, but more complex next-state logic. Preferred for large state machines where FF count dominates area.
All three model on-chip process variation (PVT) but with increasing physical accuracy and decreasing pessimism:
OCV (flat): Applies a single worst-case derating multiplier to all cells on the launch/data path and a best-case multiplier on the capture/clock path. Simple but extremely conservative — it assumes every cell on a long path simultaneously hits its worst-case delay, which is physically impossible for correlated cells.
AOCV (Advanced OCV): Replaces the flat factor with a lookup table indexed by path depth (number of stages) and spatial distance between cells. Short paths (few stages, cells close together) are strongly correlated → smaller derate. Long paths with many independent stages → larger derate. Recovers 5–10% timing margin versus flat OCV while remaining deterministic.
POCV (Parametric OCV): Characterises each cell's delay as a mean μ plus a standard deviation σ from silicon measurements or Monte Carlo simulation. STA combines path delays statistically:
- Independent stages: σ_total = √(Σσᵢ²)
- Correlated (common source) stages: σ_total = Σσᵢ
The timing budget is set at μ ± k×σ (typically k = 3 for 99.87% yield). POCV recovers 10–20% more margin versus AOCV, using a physically justified statistical model. Required at 7 nm and below where OCV pessimism would make closure impossible.
Retention flip-flops (also called balloon or shadow-latch FFs) allow a power domain to be shut down while preserving register state, avoiding a full re-initialisation on power-up.
Structure: Two storage elements in one cell:
- Primary latch: Connected to the normal domain supply (VDD). Operates during functional mode. Powers down when the domain is switched off.
- Shadow latch: Connected to a separate always-on retention supply (VRET, typically lower voltage, e.g., 0.5 V). Retains data through domain power-down.
Save/restore sequence:
- Save: Assert SAVE signal while VDD is still active. Data propagates from primary latch → shadow latch.
- Power-down: VDD is removed. Primary latch loses state. Shadow latch retains data on VRET.
- Power-up: VDD is restored.
- Restore: Assert RESTORE signal. Data propagates from shadow latch → primary latch. Domain resumes normal operation.
Critical: SAVE must complete before VDD is cut. RESTORE must complete after VDD is stable. The PMU (Power Management Unit) enforces this sequencing. UPF: set_retention -retention_condition {save_en} -save_signal {save_net rising}.
Metastability occurs when a flip-flop's D input changes within its setup or hold window. The flip-flop enters an indeterminate analog state — neither a valid 0 nor 1 — and resolves to one after a random resolution time τ that follows an exponential distribution. If τ exceeds the available resolution window, the output propagates metastability to downstream logic, causing functional failures.
Probability of failure for one synchronizer:
P_fail = f_src × f_dst × T_met × e^(−Tr/τ)
f_src,f_dst— source and destination clock frequenciesT_met— metastability window width (~t_setup + t_hold of receiving FF, typically 10–50 ps at 28 nm)Tr— available resolution time (destination clock period minus timing margin)τ— flip-flop characterisation constant from silicon (technology-dependent)
MTBF = 1 / P_fail
Adding synchronizer stages: Each additional stage increases Tr by one destination clock period, reducing P_fail exponentially. At 1 GHz with 28 nm FFs, two stages typically yield MTBF > 10¹⁰ years. Three stages are used for safety-critical CDC crossings (ISO 26262 ASIL-D).
Clock uncertainty is the total uncertainty in when a clock edge arrives at any flip-flop. It is added as a margin to both setup and hold checks, consuming timing budget. It decomposes into:
1. Jitter: Cycle-to-cycle variation in the PLL/oscillator output period. Types:
- Period jitter: Variation in a single cycle's period from the ideal. Adds directly to setup uncertainty.
- Phase jitter (RMS): Long-term phase deviation, specified in ps RMS in the PLL datasheet.
2. Clock skew: Systematic difference in clock arrival time between two registers due to unequal path lengths in the clock distribution tree. Ideal CTS targets <50 ps post-route.
3. Crosstalk on clock nets: Aggressor switching events perturb clock edge timing — adds a small random component.
SDC modeling:
set_clock_uncertainty -setup 0.15 [get_clocks clk]
set_clock_uncertainty -hold 0.05 [get_clocks clk]
Setup uncertainty is larger (typically 100–200 ps) because it must account for worst-case PLL jitter across both launch and capture cycles. Hold uncertainty is smaller since jitter on adjacent cycles is correlated. POCV-based flows partially absorb jitter into the statistical model.
Clock gating reduces dynamic power by stopping the clock to idle registers (P = α × C × V² × f — reducing α to 0 for gated sections).
Fine-grain clock gating: ICG cells are placed close to individual registers or small groups (8–16 bits). Each idle register (or small cluster) is gated independently. Maximum power savings because the gate exactly tracks which registers are idle at any given cycle. High area overhead — one ICG per register or small group. Synthesis tools infer fine-grain CG from RTL conditional enables: if (en) q <= d.
Coarse-grain clock gating: ICG cells are placed higher in the clock tree, gating an entire functional block (e.g., an entire ALU, cache bank, or peripheral unit). The entire block must be idle to benefit — less granular. Minimal area overhead (one ICG per block). Implemented as a block-level enable signal at RTL module boundaries.
Trade-off:
- Fine-grain: more ICG area, better power granularity, synthesised automatically.
- Coarse-grain: less area, simpler power control, requires architectural power management.
Best practice: use coarse-grain gating at the power-domain level (via UPF) and supplement with fine-grain gating inferred by synthesis within each always-on block.
Why async assert: Reset must take effect immediately regardless of the clock state — at power-on the clock may not yet be running. Async assert guarantees every flip-flop reaches a known state even before the clock stabilises.
Why sync deassert: If reset deasserts asynchronously, different flip-flops may exit reset on different clock edges (due to reset net skew), causing part of the design to start operating while other parts are still held in reset — creating illegal intermediate states. Synchronous deassert ensures all registers exit reset together on the same rising clock edge.
Implementation: The external reset input passes through a 2-FF synchroniser (in the destination clock domain) before distributing to all registers:
// 2-FF reset synchroniser
always @(posedge clk or negedge rst_n_ext)
if (!rst_n_ext) {sync1, rst_n} <= 2'b00; // async assert
else {sync1, rst_n} <= {1'b1, sync1}; // sync deassert
// Design flop using synchronised rst_n
always @(posedge clk or negedge rst_n)
if (!rst_n) q <= 0;
else q <= d;
PCIe hot-reset, power-on reset (POR), and domain resets all follow this pattern. The synchroniser itself uses async assert to ensure the reset state is captured even without a running clock.
An ECO (Engineering Change Order) fixes timing violations at the end of the physical design cycle without re-running full synthesis or place-and-route — preserving the existing floorplan, routing, and overall timing closure.
Setup ECO flow:
- Identify failing paths from PrimeTime signoff (
report_timing -slack_lesser_than 0). - Select fix: upsize critical-path cells to higher drive strength; replace HVT with SVT/LVT; remove unnecessary buffer stages on the data path; downsize high-fanout buffers to reduce load on critical nets.
- ECO insertion in PD tool (Innovus
eco_place, ICC2place_eco_cells): place new/resized cells in available filler cell gaps. eco_routere-routes only the affected nets; existing routing is preserved.- Re-run signoff STA to verify closure.
Constraints: Filler/decap cells must be removed to make room for new ECO cells. Large ECOs risk congestion if many cells are moved. ECO changes must not violate DRC/LVS.
Metal ECO (post-tapeout): Only changes metal layers — no base-layer changes. Faster and cheaper for a re-spin. Used to fix bugs or timing paths discovered after mask generation using spare cells (ECO cells pre-placed in the original design).
Cell delay depends on supply voltage — when VDD drops due to IR drop, the transistor drive current decreases (I_ds ∝ (V_gs − V_t)^n), increasing propagation delay. Approximately: a 5% VDD reduction causes a 5–10% increase in cell delay, depending on the process and cell type.
Voltage-aware STA (signoff flow):
- Run power grid analysis (Cadence Voltus or Synopsys RedHawk) to compute per-cell supply voltage under realistic activity (from VCD/SAIF).
- Export per-cell voltage as a SPEF-format annotation or a voltage map file.
- Import into PrimeTime:
read_parasitics -format SPEF voltages.spefor useset_voltageper cell. - STA re-queries the liberty model at the annotated voltage for each cell, computing voltage-derated delay.
- Paths through high-IR-drop regions show additional setup slack erosion.
Fixing IR-induced timing failures:
- Widen power stripes in the hotspot region to reduce resistance.
- Add power vias to improve connectivity between layers.
- Insert decoupling capacitors near the switching activity hotspot.
- As a last resort, downsize the switching logic to reduce current draw.
IR-aware timing is mandatory for advanced nodes where VDD is low (0.7–0.8 V) and 50 mV of drop is a significant fraction of the supply.
When a power domain is shut down, its output signals can float to an intermediate voltage or go to an unknown state (X). If these outputs drive logic in an always-on domain, they can cause X-propagation — corrupting state — or create a short-circuit DC current path through CMOS gates (one transistor on, one off but with floating input).
Isolation cells clamp the domain outputs to a known logic value before the domain powers down:
- AND-type isolator: output = data AND ISO_ENABLE. When ISO_ENABLE = 0 (isolate), output is forced to 0.
- OR-type isolator: output = data OR ISO_ENABLE. When ISO_ENABLE = 1 (isolate), output is forced to 1.
ISO_ENABLE is driven by the always-on Power Management Unit (PMU). It must be asserted (domain isolated) before the domain supply is cut, and de-asserted after the domain supply is restored and stable.
Placement: Isolation cells must be placed in the always-on power domain, not in the domain being shut down — otherwise they lose power along with the domain they are supposed to isolate.
UPF: set_isolation ISO_PD -domain PD_CPU -applies_to outputs -clamp_value 0 -isolation_signal iso_ctrl -isolation_sense high
Pipelining divides a combinational logic block into N stages separated by registers, so each stage can start on a new input every clock cycle while the previous input is still in flight.
Throughput vs latency: A non-pipelined block with T_comb delay produces one result per T_comb. An N-stage pipeline produces one result per T_comb/N (ideally), with latency of N cycles.
Optimal stage count: T_stage = T_comb / N must satisfy T_stage ≥ t_setup + t_flop + t_skew for the target clock period T_clk. Ideal N = floor(T_comb / (T_clk − t_overhead)). Adding stages beyond this yields diminishing returns — register overhead (t_setup, t_hold, t_skew) begins to dominate.
RTL considerations:
- Retiming: move register boundaries to equalise stage delays — synthesis tools do this automatically with the
retimeattribute. - Pipeline hazards: data hazards (result not ready for dependent instruction) → stall logic or data forwarding. Control hazards (branches) → pipeline flush or prediction.
- Valid/ready signals: propagate through each stage alongside data so partial-valid inputs are handled correctly.
For arithmetic (multiplier, FIR filter), 3–8 stages is typical at 1–2 GHz in 16 nm. Apple's custom cores use deep pipelines (12–15 stages) to achieve 3+ GHz operation.
A dual-clock asynchronous FIFO transfers a multi-bit data stream between two unrelated clock domains without metastability, using separate write (wclk) and read (rclk) pointers.
Gray-code pointers: Write and read pointers are binary counters converted to Gray code before synchronisation. Gray code ensures only one bit changes per increment — critical because if multiple bits changed simultaneously, a 2-FF synchroniser capturing mid-transition could produce an illegal intermediate pointer value, falsely triggering full/empty detection.
Synchronisation: The write-domain Gray pointer is synchronised into the read domain (to detect full), and the read-domain Gray pointer is synchronised into the write domain (to detect empty), each through a 2-FF synchroniser.
FIFO depth calculation:
Minimum depth must absorb the maximum write burst while accounting for the synchronisation latency before the read side can start draining:
depth ≥ burst_size + sync_latency_in_write_cycles
Where sync_latency_in_write_cycles = ceil(2 / f_read × f_write) + pipeline registers.
Example: 1 GHz write, 500 MHz read, 16-word burst, 2-cycle sync latency on read side = 2 / 0.5 ns × 1 ns = 4 write cycles latency. Depth ≥ 16 + 4 = 20 → round up to 32 (power of 2).
Formal verification (JasperGold CDC, Questa CDC) confirms Gray-code conversion, pointer synchronisation, and absence of overflow/underflow.
The antenna effect (plasma-induced gate oxide damage) occurs during the metal etch steps of fabrication. The plasma used for reactive-ion etching charges the metal being patterned. A long floating metal wire accumulates charge proportional to its area. If that wire is directly connected to a gate oxide, the voltage can build up and tunnel through the thin oxide — causing permanent degradation (increased leakage, Vt shift) or hard breakdown (short circuit).
Antenna ratio: Metal area of the wire / connected gate oxide area. PDKs specify a maximum ratio per metal layer (typically 200–400 for metal-1, larger for higher layers etched later).
Fixes during physical design:
- Antenna diode: Place a reverse-biased diode (P+/N-well) at the gate input. During fabrication the diode conducts and bleeds accumulated charge to the substrate safely. Most common fix — inserted automatically by the router or added as an ECO.
- Wire jumping (layer hopping): Break the long metal segment by routing up to a higher metal layer for a short segment and routing back down. Higher layers are etched later, so charge has less time to accumulate on the gate before it is connected to a driver. The jump resets the antenna ratio.
- Re-route: Use shorter wire segments that stay below the antenna ratio limit.
DRC tools flag antenna violations using the PDK antenna rules; PD tools like Innovus can auto-insert diodes post-route.
A 1T1C DRAM cell uses one access transistor and one capacitor. The capacitor stores charge representing a bit (charged = 1, discharged = 0). Because the capacitor leaks charge over time (retention ~64 ms typical), all rows must be refreshed periodically using a RAS-only cycle (RAS-only Refresh or Auto-Refresh). During refresh, the sense amplifier reads and rewrites each row before the charge decays below the sensing threshold.
Refresh overhead impacts bandwidth — at 64ms/8192 rows, one row is refreshed roughly every 7.8 µs. LPDDR5 adds per-bank refresh (PBR) to reduce latency penalty.
A cross-coupled latch (two inverters back to back) is pre-charged to VDD/2, then the bit lines are slightly separated by the cell current. The sense amp fires, amplifying the small differential into full swing. Speed depends on the gain-bandwidth and the initial ΔV developed.
Minimum Vmin is constrained by the 6T cell's read stability (read static noise margin, RSNM) and the sense amp's offset. At low VDD, transistor threshold variation (σVt) can exceed the developed ΔV, causing read failures. Assist techniques (negative word-line, bit-line bias) extend Vmin.
HBM stacks multiple DRAM dies vertically connected by Through-Silicon Vias (TSVs). A logic base die interfaces to the host via a 2.5D silicon interposer. Each HBM2E stack has up to 8 DRAM dies, 1024-bit-wide interface per stack, achieving ~460 GB/s per stack vs. GDDR6's 64-bit × 16 Gbps = 64 GB/s.
The wide parallel bus (1024 b) at lower per-pin frequency (2 Gbps) dramatically cuts power per bit transferred. TSV pitch is ~55 µm; the interposer carries thousands of TSV connections between GPU/AI die and HBM stack.
DVFS (Dynamic Voltage and Frequency Scaling) works by selecting an operating point (OPP) from a pre-characterized table of {frequency, voltage} pairs. The PMIC adjusts VDD, then the PLL is reprogrammed to the new frequency. The CPUFreq governor (schedutil, interactive) monitors CPU load and selects the OPP accordingly.
In Exynos: big.LITTLE cores have separate voltage domains. The Mali GPU has its own DVFS loop. A Power Management Unit (PMU) coordinates domain isolation before power gating idle cores. Transition latency (voltage ramp + PLL lock) is typically 50–200 µs.
LPDDR5 improvements: (1) Speed: 6400 Mbps vs LPDDR4X's 4266 Mbps; (2) Per-bank refresh (PBR) reduces refresh stall latency; (3) Write X (WRX) allows partial write masking; (4) Decision Feedback Equalization (DFE) for signal integrity at high speed; (5) Lower VDDQ (0.5V) reduces I/O power; (6) Link ECC for in-line error correction.
Power savings: ~20% lower power per transfer at equivalent bandwidth due to lower VDDQ and better termination control.
Each NAND flash cell stores charge in a floating gate or charge-trap layer. The threshold voltage (Vt) is programmed to one of 2^n levels representing n bits. MLC=2 bits (4 levels), TLC=3 bits (8 levels), QLC=4 bits (16 levels). Vt distributions must not overlap — tighter margins at higher bit density.
Reliability trade-off: More bits per cell → narrower Vt windows → more program/erase (P/E) cycle wear → lower endurance. SLC endures 100K cycles; TLC ~3000 cycles; QLC ~1000 cycles. ECC strength (LDPC) must scale up to correct wider error rates, adding latency.
Planar MOSFET: gate controls channel from one side only → short-channel effects (SCE) dominate below ~28nm (drain-induced barrier lowering, punchthrough, subthreshold leakage). FinFET: gate wraps three sides of a thin silicon fin → superior electrostatic control, effectively gates from three sides → dramatically reduced SCE, lower Vt variation, steeper subthreshold slope (~65 mV/dec vs. ~80+ for planar).
Leakage: FinFET reduces Ioff by 10–100× at same Ion vs. planar at same node. Fin height/width is quantized, so Vt tuning is limited to gate metal work function and fin width rather than channel doping.
Transition fault model: tests whether a node can change from 0→1 or 1→0 within one clock cycle. It targets delay defects where a path is slow but functional at low speed. Two patterns are applied: initialization + launch pattern on the same clock edge.
Scan power: during scan shift, half the flip-flops toggle every clock (high switching activity → IR drop spikes, potential timing violations, overheating). Techniques: scan segmentation (multiple shift enables), X-filling (reduce transitions), enhanced scan (OCC-based hold), low-power scan cells. Power faults are not a separate fault model — they are a concern about structural test conditions damaging the chip or causing false failures due to excessive IR drop.
Electromigration (EM): atomic migration of metal ions driven by electron wind in current-carrying wires. Results in void or hillock formation, eventually causing open or short circuits. Failure follows Black's equation: MTTF = A · J^−n · e^(Ea/kT), where J is current density and Ea is activation energy.
Sign-off flow: (1) Extract average and peak current from switching activity (ECSM/CCS models + dynamic simulation); (2) Compare per-segment J vs. PDK EM limits; (3) Fix violations by widening wires (lower J), adding parallel straps, or reducing frequency. Tools: Voltus (Cadence), RedHawk (Ansys). Via EM and power rail EM are checked separately.
2.5D: dies are placed side-by-side on a silicon interposer (or organic RDL). Interconnects are dense (interposer bump pitch ~10–55 µm) but dies are not stacked. HBM + GPU on interposer is classic 2.5D. Good thermal dissipation since both dies are cooled from top. Bandwidth: limited by interposer wire density.
3D: dies are stacked vertically via TSVs or hybrid bonding (Cu-Cu, pitch <1 µm). Extremely short interconnects → high bandwidth, low latency, low power per bit. Thermal challenge: top die blocks heat from bottom die. Logic-on-DRAM (Samsung X-Cube) or SRAM-on-Logic (Apple M-series) uses 3D.
Shared bus: all masters time-share one bus. Simple arbitration, low area, but bandwidth is O(1) — adding masters degrades per-master bandwidth linearly. Bus stalls when any master holds the bus. Works for ≤4 masters.
NoC: packet-switched mesh/ring/crossbar of routers. Bandwidth scales with topology — multiple transactions in flight simultaneously. Latency is higher (router hops) but throughput is O(N) for properly designed topology. MediaTek Dimensity SoCs use a ring-bus NoC interconnecting Cortex-A cores, GPU, APU (AI), modem, camera ISP, and display subsystem.
Flow: (1) OS kernel (schedutil governor) measures CPU utilization every scheduler tick; (2) Selects target OPP from cpufreq table; (3) Issues voltage change request to PMIC via I²C/SPMI — voltage ramps up; (4) After voltage settled (PMIC PGOOD signal or fixed delay), PLL reprogrammed to new frequency; (5) CPUFreq driver updates clk_rate. For frequency decrease: lower frequency first, then voltage.
Thermal governor (step_wise) can further cap OPP if Tj exceeds threshold. MediaTek's SVS (Smart Voltage Scaling) characterizes each chip at test to find minimum voltage per OPP, reducing average Vdd by 30–50 mV.
A central memory controller (EMC) arbitrates DRAM access. Each subsystem (CPU, GPU, APU, camera, display) has a QoS priority and bandwidth limiter in the NoC. The EMC uses weighted round-robin (WRR) scheduling with urgency preemption for latency-sensitive paths (display must refill frame buffer before vsync).
LPDDR5 at 6400 Mbps × 128-bit bus = ~100 GB/s theoretical peak. Real sustained BW is ~60–70% of peak. MediaTek's BWM (bandwidth management) unit monitors per-master usage and dynamically adjusts QoS weights to prevent one subsystem (GPU) from starving others.
D-PHY: differential pairs (lane + clock lane), NRZ encoding. Each lane: 80 Mbps (LP mode) to 2.5 Gbps (HS mode). Simple, widely adopted for camera and display (DSI). Separate clock lane adds overhead. Max ~6 Gbps with 4 data lanes.
C-PHY: 3-wire trio (no dedicated clock — embedded in symbol transitions). Uses 3-phase symbol encoding (5-wire equivalent efficiency: 2.28 bits per symbol). Achieves higher data density per wire count: ~2.5 Gsps per trio ≈ 5.7 Gbps. No clock lane means fewer pins. Trade-off: complex receiver CDR, higher power in PHY, less ecosystem support.
H.265 (HEVC) encoder pipeline: (1) Input fetch + preprocessing (color space, padding); (2) CTU partitioning (Coding Tree Unit, up to 64×64); (3) Intra/inter mode decision (IME — integer motion estimation, FME — fractional ME); (4) Transform (DCT/DST on residuals); (5) Quantization; (6) CABAC entropy coding; (7) Deblocking + SAO filter; (8) Reconstructed frame buffer write-back.
RTL challenges: IME requires searching a large reference frame region (hundreds of candidates) — needs highly parallel SAD computation (SIMD-style array of adders). CABAC is inherently sequential (context-adaptive arithmetic coding) — limits throughput and is difficult to pipeline. Memory bandwidth for reference frames is large — requires DMA engines and on-chip line buffers.
Clock gating: disables clock to flip-flops in an idle block using ICG cells. Eliminates dynamic switching power (C·V²·f·α). Saves 20–40% of active power. State is retained. Wake-up latency: 0 cycles (clock gating cell re-enables immediately). No level shifters or isolation cells needed.
Power gating: shuts off VDD to an entire block via header/footer switches. Eliminates both dynamic AND static (leakage) power. State is lost unless retention FFs are used. Wake-up latency: several µs (power-up sequence, state restore). Requires isolation cells and power switches. Use power gating for blocks idle for >100 µs (e.g., video encoder idle between sessions).
PLL components: Phase Frequency Detector (PFD) compares reference vs. feedback phase → Charge Pump (CP) generates correction current → Loop Filter (LF) integrates to VCO control voltage → VCO oscillates at target frequency → Frequency Divider (/N) feeds back. Closed loop phase-locks so fout = N·fref.
Jitter sources: (1) PFD/CP noise: mismatched charge pump currents → phase error; (2) VCO phase noise: flicker noise dominates close-in, thermal noise at high offset; (3) Supply/substrate noise coupling into VCO (solved with differential LC-VCO, regulated supply); (4) Reference clock jitter; (5) Divider quantization noise (for fractional-N: Σ-Δ modulator adds high-frequency noise shaped out of band).
Workload partitioning depends on operation type: (1) Control flow, low-throughput tasks → ARM CPU; (2) Signal processing (FFT, FIR, audio) → DSP (VLIW architecture, MAC arrays, low power); (3) Tensor operations (matrix multiply, convolution) → NPU (systolic array or PE array, optimized for 8-bit/16-bit fixed-point).
MediaTek's APU (AI Processing Unit) uses a multi-core design: MDLA (deep learning accelerator, INT8 systolic array) + VPU (vector processor for pre/post-processing). The scheduler in the APU driver partitions a neural network graph into subgraphs and assigns to MDLA or VPU based on operator support and latency targets.
Thermal throttling: when Tj approaches TJ_MAX (typically 85–95°C), the thermal framework reduces OPP to lower power dissipation. Steps: (1) Thermal sensors (on-die BJT or resistive) report temperature; (2) Linux thermal framework compares to trip points; (3) Passive cooling: reduce CPU/GPU frequency; (4) Active cooling: fan on premium phones; (5) Emergency shutdown at TJ_CRIT.
Trade-off: throttling causes user-visible performance drops (game frame rate drops, benchmark inconsistency). MediaTek's AI-based thermal prediction (in Dimensity 9000+) pre-emptively throttles before reaching the thermal limit, maintaining a more consistent temperature and smoother performance curve vs. reactive throttling.
Massive MIMO: base station uses 64–256 antennas to form narrow beams toward individual UEs using spatial multiplexing. Precoding: each antenna applies a complex weight (phase + amplitude) so signals add constructively at the target UE and destructively elsewhere. Precoding matrix computed from CSI feedback.
UE modem challenges: (1) CSI-RS measurement across hundreds of beams → huge compute for beam management; (2) Channel estimation for massive MIMO reference signals requires large matrix operations; (3) PDSCH decoding with LDPC at 1+ Gbps requires highly parallel decoders; (4) mm-Wave beamforming (FR2) at UE side requires phased array RF front-end with per-element phase control — separate analog beamforming IC or integrated in modem SoC.
Successive Approximation Register ADC: (1) Sample: sample-and-hold circuit captures Vin; (2) MSB trial: DAC outputs Vref/2, comparator checks if Vin > Vref/2 → if yes, bit=1 and keep; (3) Next bit: DAC outputs current output ± Vref/4, repeat binary search; (4) After N cycles, N-bit digital code in SAR register = approximation of Vin/Vref.
Key specs: conversion time = N clock cycles. No latency pipeline — result available after N cycles. Power scales with sample rate. Accuracy limited by DAC linearity (INL/DNL), comparator offset, and kT/C noise in sample-and-hold. TI's ADS8588 achieves 16-bit, 500 kSPS with <1 LSB INL.
Sigma-Delta (ΣΔ) ADC: oversamples the input at many times Nyquist (OSR = 64–512×). A 1-bit quantizer generates a high-frequency bitstream. Noise shaping: the feedback loop pushes quantization noise to high frequencies (out-of-band). A decimation filter (sinc or FIR) downsamples and low-pass filters, removing shaped noise, yielding high-resolution output at lower rate.
Resolution gain: each doubling of OSR adds 0.5 bits; first-order ΣΔ adds 1.5 bits per octave; higher-order modulators add more. A 2nd-order modulator at OSR=256 achieves ~16-bit ENOB. Used in audio (TI PCM1794A) and precision measurement (ADS124S08, 24-bit).
CAN uses a wired-AND bus: dominant bit (0) overrides recessive bit (1). During arbitration, all nodes simultaneously transmit their 11-bit (or 29-bit) identifier. Each node monitors the bus while transmitting. When a node transmits a recessive bit (1) but reads a dominant bit (0), it lost arbitration and backs off immediately — no collision, no data corruption.
The node with the lowest numerical ID wins (more 0s in MSBs). Non-destructive: the winning frame continues uninterrupted; losing nodes retry after the winning frame completes. Bit rate limited to ~1 Mbps at 40m to ensure all nodes see a stable level within one bit time (propagation constraint). CAN FD extends data phase to 8 Mbps.
In a half-bridge, high-side and low-side switches must never be on simultaneously (shoot-through → destructive cross-conduction short). Dead-time is a forced delay between turning off one switch and turning on the other. During dead-time, body diode of the off switch conducts the inductor current (freewheeling).
Implementation in RTL/hardware: two independent PWM outputs with asymmetric edge delays. A dead-time generator inserts N clock cycles of both-off state on the falling edge of each output. TI's C2000 ePWM module has programmable dead-band (DBRED, DBFED registers). Minimum dead-time = gate driver propagation delay + worst-case VGS fall time of the turning-off device.
TI C66x is a VLIW (Very Long Instruction Word) DSP. Each core has 8 functional units: 2× L units (logical), 2× S units (shift/branch), 2× M units (multiply — each does 4× 16-bit MACs/cycle = 8 MACs per core), 2× D units (load/store). VLIW executes up to 8 operations per cycle in parallel with no hardware interlocking — compiler schedules all hazards statically.
Throughput: at 1.25 GHz, each core delivers 40 GMACS (16-bit). A 8-core C6678 reaches 320 GMACS. Deep pipeline (10–12 stages) maximizes clock rate. Load-use latency: 4 cycles → compiler fills with independent instructions. Loop buffer (SPLOOP) executes tight loops without fetch overhead.
Buck converter (switching): efficiency = Vout/Vin (ideally); 85–95% typical. Use when Vin/Vout ratio is large (e.g., 5V→1V — LDO would dissipate 80% as heat). Requires LC filter, inductor, controller — larger PCB area and cost. Best for high-current loads (0.5A+).
LDO (linear): efficiency = Vout/Vin — poor for large drop. But: no switching noise, small PCB area, fast transient response, no EMI. Use for sensitive analog supplies (ADC AVDD, PLL supply) where switching noise would degrade PSRR. Also acceptable for small dropout (e.g., 1.8V→1.5V at 100mA — only 30mW heat).
ENOB (Effective Number of Bits) = (SINAD − 1.76) / 6.02. SINAD = Signal-to-Noise And Distortion ratio. An ideal N-bit ADC has SINAD = 6.02N + 1.76 dB. ENOB accounts for all noise and distortion sources: quantization noise, thermal noise, DNL/INL nonlinearity, and clock jitter.
Aperture jitter: uncertainty in the sampling instant (Δt). At input frequency fin, voltage uncertainty ΔV = 2π·fin·Vin·σt. This appears as additive noise. For a 16-bit ADC at 1 MHz input, required jitter σt < 1/(2π·fin·2^16) ≈ 2.4 ps RMS. High-speed ADCs use low-jitter LC oscillator or external clock reference to meet this. Jitter-limited ENOB = −20·log10(2π·fin·σt) / 6.02.
Full code histogram method: apply a slowly ramping or sinusoidal input, collect a large sample (10K–100K points), count hits per output code. Ideal uniform distribution → each code gets equal hits. DNL = (actual hits − ideal hits) / ideal hits. INL = cumulative sum of DNL. Accurate but slow.
Fast alternatives: (1) Sine-wave FFT test (SINAD, THD, SFDR in one measurement — covers linearity and noise together); (2) Beat frequency histogram (two tones near Nyquist create slow beat, sweeping all codes quickly); (3) Digital post-correction using factory trim: measure INL, store correction coefficients in OTP, apply digitally. TI uses ATE (Automatic Test Equipment) with precision signal sources (better than ADC under test).
Key RTL blocks: (1) Prescaler counter: divide system clock to generate SCLK at configurable baud rate; (2) Shift register (8/16/32-bit): parallel-load tx_data, shift out MSB/LSB first on SCLK edge, shift in MISO; (3) CS mux: decode cs_sel to assert the correct CS_N line low during transfer; (4) State machine: IDLE → CS_ASSERT → TRANSFER (N bits) → CS_DEASSERT → IDLE; (5) CPOL/CPHA configuration: control idle clock level and which edge is capture/launch.
Interface: AXI-lite register map for tx_data, rx_data, control (baud, mode, cs_sel, start). Interrupt on transfer complete. For multiple slaves: one MOSI, one MISO (tri-stated when CS deasserted), N CS lines. Each slave may need different CPOL/CPHA — store per-CS config.
CORDIC (COordinate Rotation DIgital Computer): computes trigonometric functions (sin, cos, arctan, magnitude) using only shifts and adds — no multipliers. Algorithm: iteratively rotate a vector by pre-defined angles (arctan(2^−i)) using shift-and-add until the rotation error is minimized. Two modes: rotation mode (given angle, find sin/cos) and vectoring mode (given vector, find angle/magnitude).
Why preferred in fixed-point hardware: no multipliers needed (shifts are free in hardware), predictable latency (N iterations = N-bit precision), low gate count, no LUT tables. Used in TI C2000 for FOC (Field-Oriented Control) motor drives where real-time sin/cos calculation at fixed-point is needed without floating-point overhead.
PAM4 (4-level Pulse Amplitude Modulation) encodes 2 bits per symbol. At 56 Gbaud, PAM4 achieves 112 Gbps vs. NRZ's 56 Gbps at the same baud rate. Trade-off: PAM4 voltage levels are 1/3 the spacing of NRZ → 9.5 dB worse SNR margin → much stricter EQ requirements.
SerDes blocks: (1) TX: DSP pre-emphasis (FFE — Feed Forward Equalization, 3–7 taps) to compensate channel loss; (2) RX: CTLE (Continuous Time Linear Equalizer) + DFE (Decision Feedback Equalizer, 6–10 taps) + CDR (Clock Data Recovery, Mueller-Müller or Alexander PD); (3) FEC: Reed-Solomon FEC (RS-528,514 for 112G-LR) mandatory due to higher raw BER. JTOL (jitter tolerance) spec: must tolerate 0.15 UI random jitter at 112 Gbaud.
8b10b maps each 8-bit data byte to a 10-bit code word ensuring: (1) DC balance (equal 0s and 1s → no low-frequency content → AC-coupled links work); (2) Sufficient transitions for CDR clock recovery (running disparity control); (3) K-codes (special characters) for comma, idle, and flow control.
Trade-off: 20% bandwidth overhead (8 bits → 10 bits). At 10 Gbps, effective data rate is 8 Gbps. Used in PCIe Gen1/2, SATA, USB 3.0, Gigabit Ethernet. Replaced by 128b130b in PCIe Gen3+ and USB 3.1+ for lower overhead (1.5%). 128b130b does not guarantee DC balance — requires scrambler for sufficient transitions.
TCAM (Ternary CAM) stores entries with don't-care bits (X) in addition to 0 and 1. Each cell has two storage nodes: one for data, one for mask. A lookup word is broadcast to all rows simultaneously; each row compares bit-by-bit with its mask applied. All matching rows assert a match signal. A priority encoder selects the highest-priority (longest prefix = most specific) match in O(1) time.
Implementation: TCAM arrays operate in parallel — the lookup takes 1–2 clock cycles regardless of table size (vs. O(log N) for SRAM binary search). Drawback: 4–10× area and power vs. SRAM. Marvell Prestera ASICs pair TCAM with SRAM: TCAM finds the match index, SRAM retrieves the action/next-hop. Power saving: narrow search (TCAM search key = 80/160/320 bits selectable) reduces active cells.
LTSSM (Link Training and Status State Machine): Detect → Polling → Configuration → L0. Key phases: (1) Detect: both sides detect electrical idle exit (EIEOS); (2) Polling: exchange TS1/TS2 ordered sets to establish bit lock and symbol lock; (3) Configuration: negotiate lane width, link number, speed; (4) Recovery (Gen2→Gen5 speed transition): retrain at new speed with equalization.
Gen4/5 Equalization: mandatory 3-phase EQ handshake (Preset exchange, coefficient request, coefficient set). Each direction negotiates TX pre-set and coefficients via TS2s. Failure causes: (1) Signal integrity issues (insufficient EQ at high speed); (2) Compliance pattern errors; (3) PLL lock failure at high baud rate; (4) Receiver detection failure (termination mismatch). Debug via LTSSM state capture and eye diagram.
SATA: 6 Gbps max (SATA III), single command queue depth (NCQ: 32 commands), HBA-based arbitration, legacy ATA command set. Simple controller, low CPU overhead for HDDs. Bottleneck at ~600 MB/s for SSDs.
NVMe over PCIe: PCIe Gen4 ×4 = 8 GB/s effective; up to 64K queues × 64K depth — parallelism matches NAND flash internal parallelism. NVMe command set is streamlined (2 doorbell register writes to submit/complete a command vs. SATA's 7 register writes). Controller ASIC: needs PCIe MAC, NVMe controller (queue management, completion), FTL (Flash Translation Layer) for NAND, ECC engine (LDPC). Marvell's 88SS9187 is a mainstream NVMe SSD controller.
Hardware hash tables are used for MAC address lookup, flow tables, and ARP caches in switch ASICs. Collision handling: (1) Open addressing (linear or quadratic probing): deterministic, bad worst-case; (2) Chaining: overflow list in a separate SRAM — requires pointer following (extra latency); (3) Cuckoo hashing: two hash functions, two tables — if primary bucket full, displace existing entry to alternate bucket. Lookup always checks both locations in parallel → O(1) guaranteed lookup, insertion O(1) amortized; (4) Perfect hashing: pre-compute collision-free table offline — feasible for static tables (e.g., VLAN tables).
Marvell uses cuckoo hashing in Prestera switches for the FDB (Forwarding Database). Two 512K-entry SRAM banks, each checked in parallel per lookup cycle. Guaranteed 2-cycle lookup at 400 MHz → under 5 ns, fast enough for 400GbE line rate.
WRR assigns a weight Wi to each priority queue i. The scheduler services queues in round-robin order but proportional to weight: queue i gets Wi/(sum of weights) fraction of bandwidth. Example: 3 queues, weights 4:2:1 → high priority gets 4/7 = 57% bandwidth.
Implementation: credit counter per queue. Each queue starts with Wi credits per round. Dequeue one packet per credit; when credits reach 0, skip to next queue. After full round, refill all credits. Strict priority queues (SP) for latency-sensitive flows (e.g., RoCE, voice) preempt WRR queues. Marvell Prestera uses SP+WRR hierarchy: top 2 queues strict priority, lower 6 queues WRR.
RS-FEC (RS(544,514) for 100G-LR): systematic block code — 514 data symbols + 30 parity symbols per codeword (each symbol = 10 bits). Can correct up to 15 symbol errors per codeword (t = 15). Key property: corrects burst errors within a codeword window — ideal for SerDes where PAM4 lane errors tend to cluster.
Hardware implementation: RS encoder is a shift register with GF(2^10) polynomial multiplications. Decoder: syndrome computation → Berlekamp-Massey algorithm (find error locator polynomial) → Chien search (find error locations) → Forney algorithm (find error values) → correct. Throughput: must process one FEC codeword every 80 ns at 100Gbps. Requires highly parallel pipelined GF arithmetic. Pre-FEC BER target: 2.4×10^-4; post-FEC BER target: 10^-15.
SerDes BIST (Built-In Self-Test): (1) PRBS generator (PRBS-7, PRBS-15, PRBS-31) generates pseudo-random bit stream in TX; RX checks against locally generated PRBS to count bit errors — BER measurement without external tester; (2) Loopback modes: (a) near-end analog loopback (TX analog out → RX input on same die — tests TX+RX path); (b) far-end loopback (loop at remote end — tests full channel); (c) digital loopback (TX digital data → RX digital input — tests only digital portion); (3) Eye diagram capture: built-in eye monitor samples RX at different phase/voltage offsets to map eye opening — reports horizontal/vertical eye margins.
Production test: ATE uses SerDes BIST + loopback for rapid go/no-go without expensive external high-speed stimulus. At-speed PRBS test catches SerDes-specific defects (CDR failure, EQ misadaptation, PLL phase noise).
Total jitter budget for 112G (56 Gbaud): IEEE 802.3ck specifies JTOL (jitter tolerance) and Jitter Generation limits. Budget allocation: Total Jitter (TJ) = Deterministic Jitter (DJ) + Random Jitter (RJ) × Q_factor. At BER=10^-12, Q≈7 → TJ = DJ + 14·σRJ.
Typical 112G-CR4 budget (0.5m copper cable): TX jitter generation ≤ 0.2 UI DJ + 0.03 UI RJ; channel ISI + crosstalk adds 0.15 UI DJ; RX must tolerate 0.5 UI DJ + 0.07 UI RJ at its CDR input. RX JTOL spec (from IEEE): 0.65 UI total at low frequency. Equalization (FFE+CTLE+DFE) reduces ISI-induced DJ at RX. CDR bandwidth determines RJ filtering. FEC reclaims 0.1 UI effective margin by correcting residual errors.
GPU: general-purpose SIMT (Single Instruction Multiple Threads) — thousands of small cores with shared memory and flexible compute. Good for training (irregular sparsity, variable batch sizes). High memory bandwidth required; DRAM bandwidth often the bottleneck.
Tesla FSD (HW3/HW4): two dedicated Neural Processing Units (NPUs), each a systolic array optimized for INT8/INT16 matrix multiply. Fixed dataflow (weight-stationary or output-stationary) eliminates scheduling overhead. On-chip SRAM sized for weight reuse across a batch of frames. Camera pipeline feeds directly into NPU memory — no DMA round-trips. Result: 36 TOPS at 72W (HW3) — better TOPS/W than GPU for this specific workload. Deterministic latency for safety-critical ADAS.
A systolic array is a 2D grid of Processing Elements (PEs). Each PE performs a multiply-accumulate (MAC). Data flows through the array rhythmically: activations flow left-to-right, weights flow top-to-bottom (or are pre-loaded). Each PE computes one partial sum, passes activations to the right and weights downward.
For an N×N array computing C = A×B: each output element C[i][j] = Σ A[i][k]·B[k][j] accumulates over K cycles. No global interconnect — each PE only communicates with neighbors → very regular layout, high utilization, low power. Data reuse: once a weight tile is loaded into the array, many input vectors flow through, amortizing DRAM bandwidth. Google TPU, Tesla FSD NPU, and Arm Ethos all use systolic arrays.
Quantization maps floating-point weights and activations to 8-bit integers: Q = round(FP / scale + zero_point). Scale and zero_point are per-layer or per-channel calibration parameters. Benefits: (1) 4× smaller model size vs. FP32; (2) 4× faster matrix multiply (INT8 SIMD 4× throughput); (3) Lower memory bandwidth; (4) Lower power.
Accuracy impact: post-training quantization (PTQ) typically loses <1% accuracy for CNN classification but can degrade more for detection/segmentation heads and transformers. Quantization-aware training (QAT) simulates INT8 during training — recovers most accuracy. Per-channel weight quantization (different scale per output channel) significantly reduces accuracy loss vs. per-tensor.
ISO 26262 ASIL-D (highest automotive safety integrity level): hardware architectural metrics — SPFM (Single-Point Fault Metric) ≥ 99%, LFM (Latent Fault Metric) ≥ 90%, PMHF (Probabilistic Metric for random Hardware Failures) < 1×10^-8 /h. Requires hardware redundancy and diagnostic coverage to meet these targets.
Implementation at SoC level: (1) Lock-step CPU cores (two identical cores execute same instructions, outputs compared cycle-by-cycle — any divergence = fault); (2) ECC on all on-chip SRAMs and caches; (3) Parity on buses and register files; (4) Watchdog timers; (5) BIST for logic (LBIST) and memory (MBIST) at boot and periodically during operation; (6) Fault injection testing to verify diagnostic coverage. Tesla FSD ASIC has redundant NPU cores with output comparison for safety-critical perception paths.
Large SRAM challenges for DFT: (1) MBIST (Memory BIST): testing large SRAM arrays at-speed requires MBIST controllers per array. March algorithms (March C-, March LR) detect stuck-at and transition faults. At high density, coupling faults between adjacent cells become significant; (2) Repair: use redundant rows/columns (column repair common) with programmable fuses (eFuse, polyfuse); (3) Compiler-generated SRAM: foundry memory compilers provide BIST interfaces — verify timing across all PVT corners; (4) Power: during BIST, all SRAM arrays switching simultaneously → huge IR drop — must stagger MBIST enable signals; (5) Test time: multi-megabyte on-chip SRAMs can take minutes at-speed — hierarchical BIST with concurrent testing of independent banks reduces test time.
Transformer inference (auto-regressive generation): each token generation requires loading all model weights from memory (for each attention head and FFN layer). For GPT-3 (175B parameters, FP16): 350 GB of weights must transit the memory bus per token step. At HBM bandwidth of 3.35 TB/s (A100), this takes 350/3350 ≈ 105 ms/token — severely limits throughput.
The compute-to-memory ratio (arithmetic intensity) for token generation is very low: few MACs per byte loaded (weights are loaded once, multiply against one token vector). Prefill (prompt processing) is compute-bound (large batch); decode (generation) is memory-bandwidth-bound (batch=1 typically). Solutions: weight quantization (INT4 → 4× bandwidth reduction), speculative decoding, KV-cache compression, and compute-in-memory (CIM) architectures.
Lock-step: two identical CPU cores (or NPUs) receive the same inputs and execute the same instruction stream simultaneously. Outputs (register writes, memory writes, branch outcomes) are compared by dedicated comparison logic every cycle. Any divergence triggers a safety fault — the system can then switch to a safe state or use a third redundant core for voting (TMR — Triple Modular Redundancy).
Temporal offset variant: one core runs N cycles behind, comparing outputs with delay — detects transient (single-event upset) faults that would not affect two simultaneous cores identically. Spatial diversity: cores are physically separated on die to reduce common-cause failures (radiation, thermal gradients). In Cortex-R52 (common in ADAS): lock-step mode is built into the core — a single IP providing dual-core lock-step with internal comparators.
Heterogeneous SoC clock domains: CPU (1–3 GHz), GPU (800 MHz–1.5 GHz), NPU (500 MHz–1 GHz), LPDDR controller (fixed ratio to memory clock), ISP, video codec — each potentially asynchronous. CDC challenges: (1) Metastability at every domain crossing — require 2FF synchronizers (or 3FF for very high speed); (2) Multi-bit buses: cannot synchronize bus directly — need handshake protocol (req/ack) or FIFO-based crossing; (3) Control signals and data must be co-synchronized — a stale address with new data causes memory corruption; (4) Timing convergence: STA tools must correctly identify false paths at CDC boundaries — missing false paths creates phantom timing violations; (5) Formal CDC verification (SpyGlass CDC, Meridian) required to prove all crossings are safe.
GPU: general SIMT, programmable, flexible data types. High power (200–400W). Non-deterministic scheduling. Best for: training, batch inference with irregular models. Latency not guaranteed.
NPU (Neural Processing Unit): structured MAC arrays (systolic or PE array), fixed dataflow, optimized for tensor ops. Programmable at layer granularity (compiler-generated schedules). Deterministic latency, power-efficient (1–10 TOPS/W). Best for: production inference of standard CNN/transformer layers.
Fixed-function accelerator: hard-wired pipeline for a specific algorithm (e.g., ISP pipeline, H.265 encoder, radar FFT engine). Lowest power, lowest area, fastest, most deterministic. Zero flexibility — cannot run a different algorithm. Used in camera ISP, video decode, radar processing in ADAS. Tesla FSD SoC combines: ARM CPUs (control), NPUs (neural net inference), fixed-function ISP (camera preprocessing), video decoder.
Vehicle thermal constraints: automotive SoC must operate continuously at ambient up to 105°C (cabin temperature) + self-heating. Tesla FSD computer is liquid-cooled, but the chip must still meet TDP (Thermal Design Power) limits (~72W for HW3, ~100W for HW4) to prevent liquid cooling system from being oversized.
Power management: (1) Per-block power gating when camera/radar pipelines not active; (2) DVFS on CPU and GPU based on scene complexity (highway vs. urban); (3) NPU utilization capped when thermal headroom is low; (4) Workload shaping: Tesla's firmware defers non-critical tasks (map updates, logging) during peak thermal loads; (5) PMIC coordination: SoC signals power intent to PMIC via I²C/SMBus, which adjusts rail voltages; (6) Boot-time power characterization (OTP trim) ensures each chip operates at Vmin for its specific leakage.
Store-and-forward: receive entire frame, check FCS (CRC), then forward. Latency = full frame transmission time (e.g., 64-byte frame at 100G = 5.12 ns; 1500-byte frame = 120 ns). Eliminates corrupt frames. Required when ingress and egress speeds differ (speed adaptation).
Cut-through: begin forwarding after reading the destination MAC (first 14 bytes = 112 ns at 1G, 1.12 ns at 100G). Dramatically lower latency for small frames. Cannot check FCS — corrupt frames propagate. Fragment-free mode (512-bit minimum) filters collision fragments (legacy Ethernet). Broadcom's Tomahawk series supports cut-through for low-latency HPC interconnects (RDMA/RoCE where latency matters more than error checking).
HOL blocking: in a shared input queue switch, if the head packet is destined for a busy output port, it blocks all packets behind it — even those destined for idle output ports. Efficiency degrades to ~58% under uniform random traffic (Karol-Hluchyj-Morgan theorem for FIFO input queuing).
VOQ (Virtual Output Queue): maintain one queue per (input, output) pair. Packet at head of VOQ for output X only blocks if output X is busy — it does not block packets destined for output Y. VOQ requires a central scheduler (crossbar arbiter) that grants one input per output per slot (matching problem). iSLIP algorithm (iterative round-robin matching) solves this in O(N) time converging to maximum weight matching. Broadcom's StrataXGS architecture uses VOQs for large-scale Clos fabric switches.
112G PAM4 (56 Gbaud) receiver DSP chain: (1) CTLE (Continuous Time Linear Equalizer): analog high-pass filter compensates cable/PCB loss (frequency-dependent attenuation); (2) VGA (Variable Gain Amplifier): normalize signal amplitude; (3) ADC (4-bit, ~3× oversampled): digitize signal for DSP; (4) Digital FFE (Feed-Forward Equalizer, 5–11 taps): linear FIR filter, compensates remaining ISI; (5) DFE (Decision Feedback Equalizer, 6–15 taps): uses previous decisions to cancel post-cursor ISI (non-linear, avoids noise amplification); (6) CDR (Clock Data Recovery): Mueller-Müller or baud-rate phase detector adapts sampling phase; (7) PAM4 slicer: compare against 3 thresholds (−2/3Vmax, 0, +2/3Vmax) to decode 2 bits per symbol; (8) FEC: RS(544,514) corrects residual errors.
Standard Ethernet is lossy — tail-drop when buffers fill. RDMA (and RoCE v2 over UDP) requires lossless transport (packet loss triggers expensive retransmission or connection reset). PFC (IEEE 802.1Qbb) adds per-priority flow control: when a switch ingress buffer nears full for priority class X, it sends a PAUSE frame to the upstream sender for class X only — other priorities continue unaffected.
Implementation in Broadcom ASIC: per-port, per-priority threshold triggers XOFF PAUSE generation. Sender hardware responds to PAUSE frame by halting class-X transmission within 100 ns (hardware-enforced, not software). The backpressure propagates hop-by-hop through the fabric. Headroom buffer per port-priority is sized to absorb in-flight packets while PAUSE propagates. ECN (Explicit Congestion Notification) is used end-to-end alongside PFC for congestion avoidance (DCQCN algorithm).
ECMP: when multiple next-hops exist for a destination (equal routing metric), traffic is distributed across all of them. Hashing: a tuple (src IP, dst IP, src port, dst port, protocol) is hashed (CRC32 or Toeplitz hash) to select a next-hop index. Same flow always hashes to the same next-hop → consistent ordering within a flow (no reordering). Different flows hash to different next-hops → load balancing.
ASIC implementation: ECMP group table stores N next-hop entries. Hash output selects index 0..N-1. For N not a power of 2: modulo N or resilient hashing (Broadcom's Resilient ECMP) avoids remapping of surviving flows when one next-hop fails. Resilient ECMP: use a large bucket table (1024 entries) mapped to N next-hops; on failure, only reassign the failed next-hop's buckets, minimizing flow disruption.
WRR (Weighted Round-Robin): service queues in proportion to weight. Works well for fixed-size packets. For variable-size packets: a queue with large packets gets more bytes than intended per round even if its packet count matches the weight — unfair.
DWRR (Deficit WRR): each queue maintains a deficit counter. Per round, each queue receives Wi bytes of credit (not packets). A packet is dequeued if (deficit ≥ packet_size). After dequeue, deficit -= packet_size. Unused deficit carries over to next round. Result: byte-accurate fairness regardless of packet size. Implementation: hardware maintains per-queue deficit register, updated each dequeue decision. Broadcom's Trident series implements DWRR for best-effort queues with SP (strict priority) queues for latency-critical traffic (RDMA, latency-critical apps).
RoCE v2 encapsulates InfiniBand transport packets in UDP/IP/Ethernet. Hardware offload on NIC: zero-copy DMA, QP (Queue Pair) management, CQ (Completion Queue) notifications — CPU is not involved per-packet. The switch ASIC is RoCE-unaware at packet level — it treats RoCE as UDP traffic. But the switch must support: (1) PFC for lossless delivery; (2) ECN marking (WRED/RED with ECN bit set) for DCQCN congestion control; (3) Low latency (cut-through forwarding); (4) RoCE v2 requires routable IP — switch does L3 routing same as any IP packet.
Broadcom StrataXGS implements hardware ECN marking: when queue depth exceeds a threshold, incoming RoCE packets get ECN CE (Congestion Experienced) bit set in IP header. The receiver NIC echoes this to the sender NIC, which reduces injection rate. This DCQCN loop runs entirely in hardware at 100ns timescales.
Broadcom Tomahawk 400G pipeline (simplified): (1) Ingress MAC: receive 400G signal, SerDes decode, RS-FEC decode, MAC framing, CRC check; (2) Ingress parser: extract packet headers (Ethernet, VLAN, IP, TCP/UDP) into header vector; (3) Ingress pipeline: VLAN lookup, ACL (TCAM), L2 FDB lookup, L3 LPM (ECMP), QoS marking, meter/policing — all in 1–2 ns pipeline stages; (4) Switching fabric: cell-based internal fabric (cells = 128-byte cells from packet segmentation) — crossbar or Clos; (5) Egress pipeline: rewrite (TTL decrement, MAC swap, VLAN push/pop), QoS queue selection, scheduler; (6) Egress MAC: RS-FEC encode, SerDes transmit.
Throughput requirement: 25.6 Tbps aggregate for 64×400G ports. Pipeline clock: 1.2 GHz, highly parallel. Packet rate: 64 bytes minimum → 3.2 Bpps — each pipeline stage processes one cell per clock.
Production test strategy for 112G SerDes: (1) PRBS-31 loopback BER test: all SerDes lanes enabled simultaneously in near-end analog loopback, PRBS-31 pattern generated and checked. BER < 10^-12 required for pass; (2) Eye monitor scan: hardware eye monitor sweeps phase (16 steps across UI) and voltage (16 levels), records error count per point — maps eye opening. Minimum eye height/width checked vs. spec; (3) EQ self-adaptation: CDR, CTLE, DFE adapt during PRBS run — verify adaptation convergence (final coefficient values within bounds); (4) Lane-to-lane crosstalk: run PRBS on all lanes simultaneously to stress worst-case crosstalk; (5) Clock jitter measurement: internal TDC (Time-to-Digital Converter) measures CDR recovered clock phase noise.
Test time: 64 lanes × PRBS run + eye scan ≈ 30–60 seconds per device. Parallel testing all lanes simultaneously cuts total time.
400G Ethernet MAC processes data at 400 Gbps. Using 256-bit internal datapath at 1.6 GHz: each cycle processes 256 bits = 32 bytes at 1.6 GHz = 51.2 GB/s per MAC. For 64 ports: 64×400G = 25.6 Tbps total. Critical paths: (1) FCS (CRC-32) computation: must compute CRC over 256-bit word in one cycle → requires parallel CRC algorithm (precomputed Galois field matrix); (2) Parser: extracting IP/TCP offsets from variable-length headers in one cycle → wide combinational mux tree; (3) Alignment FIFO: CDC between SerDes recovered clock and MAC core clock — synchronizer contributes latency; (4) Insertion of preamble/SFD and padding: modifies first/last cells — conditional logic on critical path; (5) Checksum offload: inner IP checksum calculation for tunnel traffic in the same cycle as outer MAC.