HomeDay 30

RTL to Silicon

The complete journey: simulation, synthesis, place & route, tape-out. From Verilog to wafer. Lessons from Google, Apple, NVIDIA.

The Design Flow

Days 26-27: RTL Design (Verilog/SystemVerilog) │ ├─ Simulation (VCS, ModelSim) ├─ Logic synthesis (Cadence Genus, Synopsys DC) └─ Netlist (gates + wires) │ Day 28-29: Physical Design ├─ Floorplanning (where to place blocks) ├─ Place & Route (P&R) (Cadence Innovus) ├─ Timing closure (meet clock targets) ├─ Power/ground routing └─ GDS file (layout in transistors) │ Day 30: Verification & Tape-out ├─ Design rule checks (DRC) ├─ Layout vs schematic (LVS) ├─ Final netlist └─ TO FOUNDRY (TSMC, Samsung, Intel)

Phase 1: RTL & Verification

Simulation

Prove correctness before synthesis:

Linting

Catch bugs before hardware:

Phase 2: Logic Synthesis

Verilog → Gate-level netlist

Input: systolic_4x4.sv Output: systolic_4x4.netlist (reference to standard cells) Standard cell library (5nm, TSMC): - AND, OR, XOR, MUX gates - Flip-flops (DFF, DLATCH) - Specialized: Full adders, multipliers - Each has timing/power characteristics Example: 4×4 systolic array synthesizes to ~18,000 gates

Phase 3: Place & Route (P&R)

Arrange gates on die, connect with wires. Critical for timing & power.

Floorplanning

Systolic array floorplan (80 mm² for TPU): ┌─────────────────────────────────────────┐ │ Control (5 mm²) │ SRAM (10 mm²) │ ├──────────────────┼──────────────────────┤ │ │ │ │ Systolic Array (256×256) │ │ 65K MACs, 65 mm² │ │ │ │ ├──────────────────┴──────────────────────┤ │ Power distribution, clock tree, I/O │ └─────────────────────────────────────────┘

Critical Path Analysis

Timing closure: ensure clock constraints met

  • Critical path: Multiplier + adder = ~0.7 ns (tight at 1 GHz)
  • Slack: 1.0 ns - 0.7 ns = 0.3 ns (comfortable margin)
  • If negative slack: must optimize (buffer insertion, gate sizing)

Phase 4: Verification & Tape-Out

Design Rule Checks (DRC)

Ensure layout follows process rules (TSMC 5nm):

  • Minimum metal width (14 nm for metal 1)
  • Minimum spacing between wires (20 nm)
  • Via density rules (must have enough via connections)
  • Electromigration (current density limits)

Layout vs Schematic (LVS)

Verify layout matches original RTL:

  • Extract transistors from GDS, compare to netlist
  • Check all signals are connected
  • Identify floating nets (dangerous!)

Real Timelines

StageTime (weeks)Tools
RTL design6-12Verilog, Git
Verification4-8VCS, formal tools
Synthesis2-4Cadence Genus
P&R4-8Cadence Innovus
DRC/LVS1-2Mentor Calibre
To foundry (tape-out)1GDS submission
Manufacturing12-16 weeksTSMC foundry

Real Cost

Full-chip project (256×256 systolic, 5nm): - Engineering: $2-5M (20-40 engineers × 1 year) - NRE (non-recurring engineering): $500k-2M - Masks: $1M - Simulation licenses: $100k/year - Tools (Cadence, Synopsys): $50k/year - First wafer run: $10-20k - Test/validation: $100k+ - Total: $3-8M before selling first chip Break-even: Need ~10,000 units @ $1k margin = $10M revenue

30-Day Journey Recap

What You Learned

  • Days 1-10: Why AI chips, systolic arrays, dataflow patterns, real implementations
  • Days 11-15: Precision (FP32, BF16, INT8), quantization techniques, production flows
  • Days 16-20: Memory, bandwidth, roofline model, cache hierarchy, NoC, co-design
  • Days 21-25: Real chips (Apple A17, Google TPU v4, NVIDIA H100, specialized ASICs, mobile)
  • Days 26-30: Building hardware: MAC design, systolic arrays, power, area, tape-out

Next Steps

  • Build: Implement a 16×16 systolic array in Verilog (extend Day 27)
  • Simulate: Run matrix multiply, verify correctness
  • Synthesize: Using open-source tools (yosys) or academic licenses
  • Join: Teams at Google, NVIDIA, Apple, Qualcomm (they hire people who ship chip RTL)
  • Research: New architectures (dataflow computers, analog AI, quantum acceleration)

The Big Picture

AI chip design is not just about peak TFLOPS. It's a delicate balance:

  • Performance ↔ Power consumption (TFLOPS/W)
  • Throughput ↔ Latency (batch inference vs real-time)
  • Specialization ↔ Flexibility (systolic vs GPU)
  • Area ↔ Cost (die size, yield, manufacturing)

The best chip: solves the customer's problem with the least resources (power, area, cost, latency).

🎓 You've completed the 30-day AI Chip Design course! 🎓