AI Chip Design Day 30

The Design Flow

Days 26-27: RTL Design (Verilog/SystemVerilog) │ ├─ Simulation (VCS, ModelSim) ├─ Logic synthesis (Cadence Genus, Synopsys DC) └─ Netlist (gates + wires) │ Day 28-29: Physical Design ├─ Floorplanning (where to place blocks) ├─ Place & Route (P&R) (Cadence Innovus) ├─ Timing closure (meet clock targets) ├─ Power/ground routing └─ GDS file (layout in transistors) │ Day 30: Verification & Tape-out ├─ Design rule checks (DRC) ├─ Layout vs schematic (LVS) ├─ Final netlist └─ TO FOUNDRY (TSMC, Samsung, Intel)

Phase 1: RTL & Verification

Simulation

Prove correctness before synthesis:

Testbenches (SystemVerilog): drive inputs, check outputs
Coverage: Did we test all code paths? (>95% target)
Formal verification: Mathematical proof of correctness (for critical blocks)

Linting

Catch bugs before hardware:

Unused variables
Logic errors (combinational loops, latches vs flip-flops)
Clock domain crossing issues

Phase 2: Logic Synthesis

Verilog → Gate-level netlist

Input: systolic_4x4.sv Output: systolic_4x4.netlist (reference to standard cells) Standard cell library (5nm, TSMC): - AND, OR, XOR, MUX gates - Flip-flops (DFF, DLATCH) - Specialized: Full adders, multipliers - Each has timing/power characteristics Example: 4×4 systolic array synthesizes to ~18,000 gates

Phase 3: Place & Route (P&R)

Arrange gates on die, connect with wires. Critical for timing & power.

Floorplanning

Systolic array floorplan (80 mm² for TPU): ┌─────────────────────────────────────────┐ │ Control (5 mm²) │ SRAM (10 mm²) │ ├──────────────────┼──────────────────────┤ │ │ │ │ Systolic Array (256×256) │ │ 65K MACs, 65 mm² │ │ │ │ ├──────────────────┴──────────────────────┤ │ Power distribution, clock tree, I/O │ └─────────────────────────────────────────┘

Critical Path Analysis

Timing closure: ensure clock constraints met

Critical path: Multiplier + adder = ~0.7 ns (tight at 1 GHz)
Slack: 1.0 ns - 0.7 ns = 0.3 ns (comfortable margin)
If negative slack: must optimize (buffer insertion, gate sizing)

Phase 4: Verification & Tape-Out

Design Rule Checks (DRC)

Ensure layout follows process rules (TSMC 5nm):

Minimum metal width (14 nm for metal 1)
Minimum spacing between wires (20 nm)
Via density rules (must have enough via connections)
Electromigration (current density limits)

Layout vs Schematic (LVS)

Verify layout matches original RTL:

Extract transistors from GDS, compare to netlist
Check all signals are connected
Identify floating nets (dangerous!)

Real Timelines

Stage	Time (weeks)	Tools
RTL design	6-12	Verilog, Git
Verification	4-8	VCS, formal tools
Synthesis	2-4	Cadence Genus
P&R	4-8	Cadence Innovus
DRC/LVS	1-2	Mentor Calibre
To foundry (tape-out)	1	GDS submission
Manufacturing	12-16 weeks	TSMC foundry

Real Cost

Full-chip project (256×256 systolic, 5nm): - Engineering: $2-5M (20-40 engineers × 1 year) - NRE (non-recurring engineering): $500k-2M - Masks: $1M - Simulation licenses: $100k/year - Tools (Cadence, Synopsys): $50k/year - First wafer run: $10-20k - Test/validation: $100k+ - Total: $3-8M before selling first chip Break-even: Need ~10,000 units @ $1k margin = $10M revenue

30-Day Journey Recap

What You Learned

Days 1-10: Why AI chips, systolic arrays, dataflow patterns, real implementations
Days 11-15: Precision (FP32, BF16, INT8), quantization techniques, production flows
Days 16-20: Memory, bandwidth, roofline model, cache hierarchy, NoC, co-design
Days 21-25: Real chips (Apple A17, Google TPU v4, NVIDIA H100, specialized ASICs, mobile)
Days 26-30: Building hardware: MAC design, systolic arrays, power, area, tape-out

Next Steps

Build: Implement a 16×16 systolic array in Verilog (extend Day 27)
Simulate: Run matrix multiply, verify correctness
Synthesize: Using open-source tools (yosys) or academic licenses
Join: Teams at Google, NVIDIA, Apple, Qualcomm (they hire people who ship chip RTL)
Research: New architectures (dataflow computers, analog AI, quantum acceleration)

The Big Picture

AI chip design is not just about peak TFLOPS. It's a delicate balance:

Performance ↔ Power consumption (TFLOPS/W)
Throughput ↔ Latency (batch inference vs real-time)
Specialization ↔ Flexibility (systolic vs GPU)
Area ↔ Cost (die size, yield, manufacturing)

The best chip: solves the customer's problem with the least resources (power, area, cost, latency).

🎓 You've completed the 30-day AI Chip Design course! 🎓

RTL to Silicon