1. Technology Node Selection
| Node | Year | Density | Cost/die | Use Case |
|---|---|---|---|---|
| 5nm | 2021+ | Highest | High | High-perf (GPU, TPU) |
| 7nm | 2019+ | High | Medium | Balanced (inference) |
| 14nm | 2015+ | Medium | Low | Mobile, volume |
| 28nm | 2012+ | Low | Very Low | Legacy, cost-critical |
Trade-off: Advanced node = higher density/performance, but higher masks cost (need high volume)
2. Architecture Decision Tree
Question 1: Use case?
- Inference only → Simpler, optimize latency/throughput
- Training + inference → More complex, need both forward/backward
Question 2: Performance target?
- Latency-critical → Few large cores, high clock
- Throughput-critical → Many smaller cores, lower clock
Question 3: Flexibility?
- Specialized → Systolic array (fast, inflexible)
- General-purpose → GPU-like (slower, flexible)
3. Design Flow
- Specification: Performance target, power budget, cost
- Architecture: Core count, memory hierarchy, interconnect
- RTL design: Implement in Verilog/SystemVerilog
- Simulation: Verify correctness at architectural level
- Synthesis: Convert RTL to gate-level (timing closure)
- Physical design: Place & route, power/ground distribution
- Sign-off: Static timing, DRC/LVS, power analysis
- Tape-out: Send to fab
- Post-silicon: Test, bring-up, production
4. Verification Strategy
- Unit testing: Each module independently (MAC unit, cache, etc.)
- Integration testing: Full chip with real models (ResNet, BERT)
- Formal verification: Arithmetic correctness proofs
- Timing closure: Static timing analysis (STA)
- Power simulation: Worst-case power scenarios
5. Real-World Design Examples
Google TPU v4 Decisions
- Node: 5nm (highest density)
- Architecture: 256×256 systolic array (specialized)
- Memory: 8GB on-chip HBM3 (weight caching)
- Power: 400W (datacenter power budget)
- Design choice: Specialize for matrix multiply, accept inflexibility
NVIDIA H100 Decisions
- Node: 5nm (competitive with TPU)
- Architecture: GPU-like (many cores, tensor units)
- Memory: 80GB HBM3 (large, flexible)
- Power: 700W (higher than TPU)
- Design choice: Flexibility over specialization (supports various models)
6. Cost Considerations
Mask cost (NRE): ~$5M at 5nm, $500K at 28nm
- Amortized if volume > 100K units
- For low-volume (academia), use FPGA or 65nm
Die cost: Proportional to area and yield
- 256×256 systolic: ~100mm² at 5nm
- Cost ~$50 per die (5nm, high yield)
7. Design Checklist
- ✅ Define specs: Performance, power, cost targets
- ✅ Choose technology: Balance density, cost, schedule
- ✅ Architecture: Systolic, GPU-like, or hybrid?
- ✅ Memory hierarchy: On-chip SRAM, off-chip DRAM
- ✅ Precision support: FP32, BF16, INT8 combo
- ✅ Design flow: Tools (Synopsys, Cadence)
- ✅ Verification: Testbenches, formal proofs
- ✅ Timing/power: Closure and sign-off
Next (Day 15): System integration and production.