The systolic array is the most regular, dense, and power-hungry block. It should be placed as a hard macro in a corner of the die with: (1) its scratchpad memories directly adjacent to minimise wire length, (2) the DMA engine between the systolic array and the AXI crossbar, (3) the RISC-V CPU core near the boot ROM and UART for short control paths.
Block
Placement Strategy
Reason
Systolic array
Hard macro, bottom-left corner
Regular array, dense, away from pads
Scratchpad SRAM
Directly above systolic array
Minimise data wire length
DMA engine
Between SRAM and AXI crossbar
On the data path, short wire to both
RISC-V CPU
Top-right, near AXI crossbar
Control traffic, near IO pads
UART / PLIC
Near IO pads
Short routing to chip boundary
PLL / Clock
Centre, clock spine entry point
Equal distance to all clock endpoints
Power Domain Partitioning
Two Power Domains
PD_ALWAYS_ON: RISC-V CPU, UART, PLIC, boot ROM, AXI crossbar — always powered. PD_ACCEL: Systolic array + DMA + scratchpad — can be power-gated when idle. Isolation cells on all signals crossing from PD_ACCEL to PD_ALWAYS_ON clamp outputs to a safe value when PD_ACCEL is off.
The critical timing path in the systolic array is: psum_in → 8×8 multiplier → 32-bit adder → psum_out register. At 1 GHz (1 ns period), this must complete in under ~0.85 ns (after setup margin). The multiplier alone is typically 0.5–0.7 ns in 7nm. Fix options:
Fix
Cost
Benefit
Pipeline the multiplier (2-stage)
+1 cycle latency
Halves multiplier path
Use higher-drive cells on critical net
+area, +power
Reduces net delay
Place adder near multiplier (manual)
PD effort
Reduces wire delay
Reduce clock frequency to 800 MHz
−20% throughput
Easiest; no RTL change
Use DSP hard macro for multiply-add
DSP count
Fastest; dedicated silicon
Day 14 — Interview Questions
Q1What is a floorplan and why is it important for SoC design?
A floorplan is the spatial arrangement of major functional blocks (macros, power domains, clock regions, IO pads) on the die before detailed placement and routing. It is important because: (1) Wire length between blocks is fixed by their relative positions — placing the systolic array far from its scratchpad creates long data wires with high latency and power, (2) Power distribution (PDN — power delivery network) must be planned to deliver adequate current to high-power blocks like the systolic array; poor floorplanning leads to IR drop, (3) Clock tree synthesis (CTS) works best when all registers are in a compact, symmetric region — a poorly shaped floorplan creates an unbalanced clock tree with high skew, (4) The floorplan determines which blocks share routing channels, affecting congestion. A good floorplan is shaped to match the design's dataflow — data flows short distances, control flows longer distances.
Q2What is an isolation cell and when is it required?
An isolation cell is inserted at every signal that crosses from a power-gated domain to an always-on domain. When the power-gated domain (PD_ACCEL) is turned off, its outputs float to an unknown state — this could propagate as random logic into the always-on domain (PD_ALWAYS_ON), causing incorrect behaviour. An isolation cell clamps the output to a defined safe value (0 or 1, specified in the UPF via -clamp_value) when the isolate enable signal is high (domain is off). It is placed on the always-on side, powered by VDD_AON, so it remains functional even when PD_ACCEL is gated. Without isolation cells, power-gating is functionally dangerous — the floating outputs corrupt the CPU's AXI response channel or interrupt signals.
Q3What is clock tree synthesis (CTS) and what causes clock skew?
Clock tree synthesis builds the distribution network from the PLL/clock source to every flip-flop's clock pin. The goal is to minimise clock skew (difference in arrival time at different FFs) and insertion delay. The CTS tool inserts buffer trees to drive the large capacitive load of thousands of FFs. Clock skew causes: (1) Hold violations — if the clock arrives too late at the launching FF and too early at the capturing FF, the new data can overwrite the old data before it's captured, (2) Setup violations — skew in the other direction tightens the effective setup window. For the systolic array, all 16×16 = 256 PE registers must have near-identical clock arrival times (skew budget ~ clock_period × 5% = 50 ps at 1 GHz). CTS achieves this using a balanced H-tree or shielded clock trunk. The PE array's regular structure makes it an ideal CTS target.
Q4What is IR drop and how does it affect the systolic array?
IR drop is the voltage reduction across the power supply network caused by the resistance of metal interconnects (V = I × R). When the systolic array switches all 256 PEs simultaneously (worst-case: 0→1 transition on all MAC inputs), the instantaneous current spike causes the local supply voltage to drop below nominal VDD. Lower voltage means slower gates (Vt margin reduction increases delay), which can cause setup violations — the accumulator add may not complete before the clock edge. Mitigation: (1) Decap cells (on-chip decoupling capacitors) placed near the systolic array absorb the transient current spike, (2) Multiple power straps (wide M8/M9 rails) reduce the resistance path, (3) The power switch (for PD_ACCEL) must be sized to supply the peak current (calculate I = P/V; for 500mW at 0.7V, I = 714mA). IR drop analysis must be run in signoff to verify the drop is < 5% of VDD.
Q5What is OCV (On-Chip Variation) and how does it affect timing signoff?
OCV refers to variation in transistor characteristics across the die due to process gradients, temperature gradients, and IR drop. Two FFs on the same chip may have different drive strengths and delays even though they're identical in the netlist. For timing signoff, OCV is modelled using derating factors: the launch path (from the launching FF to the net) is analysed at slow corner (worst delay), and the capture path (from the clock source to the capturing FF's clock pin) is analysed at fast corner (best delay). This creates a pessimistic but safe timing analysis. Advanced tools use AOCV (Advanced OCV) tables that derate paths based on distance and number of stages — longer paths have more variation averaging and need less derate. For the systolic array, the regular placement means adjacent PEs have similar OCV, which is a beneficial side effect of the regular structure.
Q6How do you handle the power-up sequence for a multi-domain SoC?
The always-on domain (PD_ALWAYS_ON) powers up first and must be fully stable before the accelerator domain (PD_ACCEL) powers up. The power-up sequence: (1) VDD_AON ramps to nominal — stabilisation time typically 100 µs, (2) PLL acquires lock on the system clock — 50–200 µs, (3) CPU comes out of reset, executes boot code, (4) Boot code initialises PLIC, UART, and other always-on peripherals, (5) When the application decides to use the accelerator: write accel_power_gate=0 (enable power switch), wait for VDD_ACCEL to stabilise (~10 µs), deassert accel_iso_en (release isolation), then deassert the accelerator's reset signal. The order matters: if the reset is released while isolation is still active, the accelerator sees clamped inputs and may initialise to a wrong state. If isolation is released before VDD_ACCEL is stable, floating signals corrupt the always-on domain. Always add settling time between power switch enable and isolation release.