The Design Flow
Days 26-27: RTL Design (Verilog/SystemVerilog)
│
├─ Simulation (VCS, ModelSim)
├─ Logic synthesis (Cadence Genus, Synopsys DC)
└─ Netlist (gates + wires)
│
Day 28-29: Physical Design
├─ Floorplanning (where to place blocks)
├─ Place & Route (P&R) (Cadence Innovus)
├─ Timing closure (meet clock targets)
├─ Power/ground routing
└─ GDS file (layout in transistors)
│
Day 30: Verification & Tape-out
├─ Design rule checks (DRC)
├─ Layout vs schematic (LVS)
├─ Final netlist
└─ TO FOUNDRY (TSMC, Samsung, Intel)
Phase 1: RTL & Verification
Simulation
Prove correctness before synthesis:
- Testbenches (SystemVerilog): drive inputs, check outputs
- Coverage: Did we test all code paths? (>95% target)
- Formal verification: Mathematical proof of correctness (for critical blocks)
Linting
Catch bugs before hardware:
- Unused variables
- Logic errors (combinational loops, latches vs flip-flops)
- Clock domain crossing issues
Phase 2: Logic Synthesis
Verilog → Gate-level netlist
Input: systolic_4x4.sv
Output: systolic_4x4.netlist (reference to standard cells)
Standard cell library (5nm, TSMC):
- AND, OR, XOR, MUX gates
- Flip-flops (DFF, DLATCH)
- Specialized: Full adders, multipliers
- Each has timing/power characteristics
Example: 4×4 systolic array synthesizes to ~18,000 gates
Phase 3: Place & Route (P&R)
Arrange gates on die, connect with wires. Critical for timing & power.
Floorplanning
Systolic array floorplan (80 mm² for TPU):
┌─────────────────────────────────────────┐
│ Control (5 mm²) │ SRAM (10 mm²) │
├──────────────────┼──────────────────────┤
│ │ │
│ Systolic Array (256×256) │
│ 65K MACs, 65 mm² │
│ │ │
├──────────────────┴──────────────────────┤
│ Power distribution, clock tree, I/O │
└─────────────────────────────────────────┘
Critical Path Analysis
Timing closure: ensure clock constraints met
- Critical path: Multiplier + adder = ~0.7 ns (tight at 1 GHz)
- Slack: 1.0 ns - 0.7 ns = 0.3 ns (comfortable margin)
- If negative slack: must optimize (buffer insertion, gate sizing)
Phase 4: Verification & Tape-Out
Design Rule Checks (DRC)
Ensure layout follows process rules (TSMC 5nm):
- Minimum metal width (14 nm for metal 1)
- Minimum spacing between wires (20 nm)
- Via density rules (must have enough via connections)
- Electromigration (current density limits)
Layout vs Schematic (LVS)
Verify layout matches original RTL:
- Extract transistors from GDS, compare to netlist
- Check all signals are connected
- Identify floating nets (dangerous!)
Real Timelines
| Stage | Time (weeks) | Tools |
|---|---|---|
| RTL design | 6-12 | Verilog, Git |
| Verification | 4-8 | VCS, formal tools |
| Synthesis | 2-4 | Cadence Genus |
| P&R | 4-8 | Cadence Innovus |
| DRC/LVS | 1-2 | Mentor Calibre |
| To foundry (tape-out) | 1 | GDS submission |
| Manufacturing | 12-16 weeks | TSMC foundry |
Real Cost
Full-chip project (256×256 systolic, 5nm):
- Engineering: $2-5M (20-40 engineers × 1 year)
- NRE (non-recurring engineering): $500k-2M
- Masks: $1M
- Simulation licenses: $100k/year
- Tools (Cadence, Synopsys): $50k/year
- First wafer run: $10-20k
- Test/validation: $100k+
- Total: $3-8M before selling first chip
Break-even: Need ~10,000 units @ $1k margin = $10M revenue
30-Day Journey Recap
What You Learned
- Days 1-10: Why AI chips, systolic arrays, dataflow patterns, real implementations
- Days 11-15: Precision (FP32, BF16, INT8), quantization techniques, production flows
- Days 16-20: Memory, bandwidth, roofline model, cache hierarchy, NoC, co-design
- Days 21-25: Real chips (Apple A17, Google TPU v4, NVIDIA H100, specialized ASICs, mobile)
- Days 26-30: Building hardware: MAC design, systolic arrays, power, area, tape-out
Next Steps
- Build: Implement a 16×16 systolic array in Verilog (extend Day 27)
- Simulate: Run matrix multiply, verify correctness
- Synthesize: Using open-source tools (yosys) or academic licenses
- Join: Teams at Google, NVIDIA, Apple, Qualcomm (they hire people who ship chip RTL)
- Research: New architectures (dataflow computers, analog AI, quantum acceleration)
The Big Picture
AI chip design is not just about peak TFLOPS. It's a delicate balance:
- Performance ↔ Power consumption (TFLOPS/W)
- Throughput ↔ Latency (batch inference vs real-time)
- Specialization ↔ Flexibility (systolic vs GPU)
- Area ↔ Cost (die size, yield, manufacturing)
The best chip: solves the customer's problem with the least resources (power, area, cost, latency).
🎓 You've completed the 30-day AI Chip Design course! 🎓