Why does the systolic array need its own power domain?

The systolic array is the highest power block in the SoC — a 16×16 INT8 array switching at 1 GHz can consume 100–500 mW depending on the process node. Having its own power domain (separate power rail, isolation cells, and power switch) allows the OS or runtime to power-gate the accelerator when not in use, saving significant leakage power. Power gating also enables the accelerator to run at a different voltage than the CPU (for optimal power-performance trade-off): at peak performance, the accelerator voltage can be raised above nominal (OCV); at idle, lowered to near-threshold. Without a separate power domain, the accelerator is always on, always leaking, regardless of whether it's computing.

RISC-V Accelerator Day 14 — Physical Design: Floorplan, Power Domains, CTS & Timing Closure

SoC Floorplan Strategy

The systolic array is the most regular, dense, and power-hungry block. It should be placed as a hard macro in a corner of the die with: (1) its scratchpad memories directly adjacent to minimise wire length, (2) the DMA engine between the systolic array and the AXI crossbar, (3) the RISC-V CPU core near the boot ROM and UART for short control paths.

Block	Placement Strategy	Reason
Systolic array	Hard macro, bottom-left corner	Regular array, dense, away from pads
Scratchpad SRAM	Directly above systolic array	Minimise data wire length
DMA engine	Between SRAM and AXI crossbar	On the data path, short wire to both
RISC-V CPU	Top-right, near AXI crossbar	Control traffic, near IO pads
UART / PLIC	Near IO pads	Short routing to chip boundary
PLL / Clock	Centre, clock spine entry point	Equal distance to all clock endpoints

Power Domain Partitioning

Two Power Domains

PD_ALWAYS_ON: RISC-V CPU, UART, PLIC, boot ROM, AXI crossbar — always powered. PD_ACCEL: Systolic array + DMA + scratchpad — can be power-gated when idle. Isolation cells on all signals crossing from PD_ACCEL to PD_ALWAYS_ON clamp outputs to a safe value when PD_ACCEL is off.

Verilog — Power intent (UPF 2.1 snippet)

# Always-on domain create_power_domain PD_ALWAYS_ON -elements {cpu uart plic crossbar} create_supply_net VDD_AON -domain PD_ALWAYS_ON connect_supply_net VDD_AON -ports {VDD} # Accelerator domain — can be gated create_power_domain PD_ACCEL -elements {systolic_array dma sram} create_supply_net VDD_ACCEL -domain PD_ACCEL create_power_switch PS_ACCEL -domain PD_ACCEL \ -input_supply_port {vin VDD_AON} \ -output_supply_port {vout VDD_ACCEL} \ -control_port {sleep accel_power_gate} \ -on_state {on_state vin {!sleep}} # Isolation cells: clamp accel outputs LOW when off set_isolation ISO_ACCEL \ -domain PD_ACCEL \ -isolation_signal accel_iso_en \ -isolation_sense high \ -clamp_value 0 \ -applies_to outputs

Critical Path: Accumulator in the PE

The critical timing path in the systolic array is: psum_in → 8×8 multiplier → 32-bit adder → psum_out register. At 1 GHz (1 ns period), this must complete in under ~0.85 ns (after setup margin). The multiplier alone is typically 0.5–0.7 ns in 7nm. Fix options:

Fix	Cost	Benefit
Pipeline the multiplier (2-stage)	+1 cycle latency	Halves multiplier path
Use higher-drive cells on critical net	+area, +power	Reduces net delay
Place adder near multiplier (manual)	PD effort	Reduces wire delay
Reduce clock frequency to 800 MHz	−20% throughput	Easiest; no RTL change
Use DSP hard macro for multiply-add	DSP count	Fastest; dedicated silicon

Day 14 — Interview Questions

Q1What is a floorplan and why is it important for SoC design?

A floorplan is the spatial arrangement of major functional blocks (macros, power domains, clock regions, IO pads) on the die before detailed placement and routing. It is important because: (1) Wire length between blocks is fixed by their relative positions — placing the systolic array far from its scratchpad creates long data wires with high latency and power, (2) Power distribution (PDN — power delivery network) must be planned to deliver adequate current to high-power blocks like the systolic array; poor floorplanning leads to IR drop, (3) Clock tree synthesis (CTS) works best when all registers are in a compact, symmetric region — a poorly shaped floorplan creates an unbalanced clock tree with high skew, (4) The floorplan determines which blocks share routing channels, affecting congestion. A good floorplan is shaped to match the design's dataflow — data flows short distances, control flows longer distances.

Q2What is an isolation cell and when is it required?

An isolation cell is inserted at every signal that crosses from a power-gated domain to an always-on domain. When the power-gated domain (PD_ACCEL) is turned off, its outputs float to an unknown state — this could propagate as random logic into the always-on domain (PD_ALWAYS_ON), causing incorrect behaviour. An isolation cell clamps the output to a defined safe value (0 or 1, specified in the UPF via -clamp_value) when the isolate enable signal is high (domain is off). It is placed on the always-on side, powered by VDD_AON, so it remains functional even when PD_ACCEL is gated. Without isolation cells, power-gating is functionally dangerous — the floating outputs corrupt the CPU's AXI response channel or interrupt signals.

Q3What is clock tree synthesis (CTS) and what causes clock skew?

Clock tree synthesis builds the distribution network from the PLL/clock source to every flip-flop's clock pin. The goal is to minimise clock skew (difference in arrival time at different FFs) and insertion delay. The CTS tool inserts buffer trees to drive the large capacitive load of thousands of FFs. Clock skew causes: (1) Hold violations — if the clock arrives too late at the launching FF and too early at the capturing FF, the new data can overwrite the old data before it's captured, (2) Setup violations — skew in the other direction tightens the effective setup window. For the systolic array, all 16×16 = 256 PE registers must have near-identical clock arrival times (skew budget ~ clock_period × 5% = 50 ps at 1 GHz). CTS achieves this using a balanced H-tree or shielded clock trunk. The PE array's regular structure makes it an ideal CTS target.

Q4What is IR drop and how does it affect the systolic array?

IR drop is the voltage reduction across the power supply network caused by the resistance of metal interconnects (V = I × R). When the systolic array switches all 256 PEs simultaneously (worst-case: 0→1 transition on all MAC inputs), the instantaneous current spike causes the local supply voltage to drop below nominal VDD. Lower voltage means slower gates (Vt margin reduction increases delay), which can cause setup violations — the accumulator add may not complete before the clock edge. Mitigation: (1) Decap cells (on-chip decoupling capacitors) placed near the systolic array absorb the transient current spike, (2) Multiple power straps (wide M8/M9 rails) reduce the resistance path, (3) The power switch (for PD_ACCEL) must be sized to supply the peak current (calculate I = P/V; for 500mW at 0.7V, I = 714mA). IR drop analysis must be run in signoff to verify the drop is < 5% of VDD.

Q5What is OCV (On-Chip Variation) and how does it affect timing signoff?

OCV refers to variation in transistor characteristics across the die due to process gradients, temperature gradients, and IR drop. Two FFs on the same chip may have different drive strengths and delays even though they're identical in the netlist. For timing signoff, OCV is modelled using derating factors: the launch path (from the launching FF to the net) is analysed at slow corner (worst delay), and the capture path (from the clock source to the capturing FF's clock pin) is analysed at fast corner (best delay). This creates a pessimistic but safe timing analysis. Advanced tools use AOCV (Advanced OCV) tables that derate paths based on distance and number of stages — longer paths have more variation averaging and need less derate. For the systolic array, the regular placement means adjacent PEs have similar OCV, which is a beneficial side effect of the regular structure.

Q6How do you handle the power-up sequence for a multi-domain SoC?

The always-on domain (PD_ALWAYS_ON) powers up first and must be fully stable before the accelerator domain (PD_ACCEL) powers up. The power-up sequence: (1) VDD_AON ramps to nominal — stabilisation time typically 100 µs, (2) PLL acquires lock on the system clock — 50–200 µs, (3) CPU comes out of reset, executes boot code, (4) Boot code initialises PLIC, UART, and other always-on peripherals, (5) When the application decides to use the accelerator: write accel_power_gate=0 (enable power switch), wait for VDD_ACCEL to stabilise (~10 µs), deassert accel_iso_en (release isolation), then deassert the accelerator's reset signal. The order matters: if the reset is released while isolation is still active, the accelerator sees clamped inputs and may initialise to a wrong state. If isolation is released before VDD_ACCEL is stable, floating signals corrupt the always-on domain. Always add settling time between power switch enable and isolation release.

← Day 13: FPGA Implementation Day 15: Capstone →

Physical Design ConsiderationsFloorplan, Power Domains & CTS

SoC Floorplan Strategy

Power Domain Partitioning

Critical Path: Accumulator in the PE

Day 14 — Interview Questions

Physical Design Considerations
Floorplan, Power Domains & CTS