1. SoC Architecture
System-on-Chip (SoC): AI accelerator + CPU + memory + I/O on single die
| Component | Purpose | Design Considerations |
|---|---|---|
| AI Accelerator | Matrix multiply, convolution | Specialized, optimized for throughput |
| CPU Cores | Control, data preprocessing | General-purpose, lower power |
| Shared Memory | Weights, intermediate results | Multi-port SRAM, cache coherency |
| Interconnect | Data movement | High bandwidth (NoC or bus) |
| I/O Controllers | Camera, network, storage | Standard interfaces (USB, Ethernet) |
2. On-Device Inference Stack
Software layers for mobile/edge AI:
- ML Framework: TensorFlow Lite, PyTorch Mobile
- Compiler: Converts model to device-specific operations
- Runtime: Schedules execution, manages memory
- Driver: Hardware-specific optimizations
- Firmware: Low-level hardware control
Example: TensorFlow Lite on Apple Neural Engine
- Model (ResNet-50) → TFLite converter
- Quantize to INT8
- Compile to Neural Engine operations
- Runtime schedules on 16 cores in parallel
- Result: 50ms inference, < 2W power
3. Memory Subsystem Integration
Shared memory hierarchy:
- AI accelerator: Local SRAM (weights cache)
- CPU: L1/L2 cache (program, small data)
- Shared: L3 cache or DRAM (larger working set)
- External: DRAM or flash (model storage)
Challenge: Cache coherency between CPU and accelerator
Solution: Explicit memory barriers, or separate memory spaces
4. Software-Hardware Co-Design
Key insight: Hardware and software must be optimized together
Hardware perspective: Systolic array expects dense matrix multiply, tiled data
Software perspective: Framework must generate code for systolic tiling
Example trade-off:
- Option A: Hardware supports arbitrary strides → software flexibility, but more complex hardware
- Option B: Hardware requires aligned blocks → simpler hardware, but software must reformat data
5. Production Inference Optimization
Model-specific tuning:
- Quantization: ResNet-50 loses < 1% accuracy in INT8
- Pruning: 50% sparsity maintains accuracy
- Distillation: Student model (smaller) taught by teacher model
Runtime optimization:
- Kernel fusion: Combine ReLU + batch norm into single operation
- Memory layout: Store weights in accelerator-friendly format
- Prefetching: Load next layer while computing current
6. Real-World SoC Examples
Apple A17 Pro
- 6 CPU cores (efficiency + performance)
- 16-core Neural Engine (AI acceleration)
- Shared 8MB L3 cache
- Peak 11 TFLOPS (neural engine at FP32 equivalent)
- Used for on-device voice, images, AR
Google Tensor (Pixel)
- Google TPU as co-processor
- Tensor Processing Unit: ~100 GFLOPS peak
- Specialized for Google's models (MagicEraser, Face Unblur)
- Shares DRAM with CPU
7. Phase 1 Completion Checklist
- ✅ Days 1-6: Fundamentals (accelerators, networks, precision, quantization)
- ✅ Days 7, 10, 16, 22: Core architectures (systolic, processors, roofline, TPU)
- ✅ Days 8-9: Optimization (mixed precision, sparsity)
- ✅ Days 11-15: Integration (memory, power, performance, design, system)
- ✅ Total: ~50,000+ words across 15 comprehensive topics
Phase 1 Summary:
You now understand the complete pipeline from neural network operations to hardware design decisions:
- ✅ What makes AI accelerators fast (matrix multiply, memory hierarchy)
- ✅ How to optimize inference vs training (latency vs throughput)
- ✅ Precision trade-offs (FP32, BF16, INT8, mixed)
- ✅ Quantization techniques (PTQ, QAT)
- ✅ Sparsity and pruning benefits
- ✅ Architecture designs (systolic arrays, TPU, GPU)
- ✅ Memory hierarchy and bandwidth optimization
- ✅ Power and thermal management
- ✅ Performance metrics and tradeoffs
- ✅ Practical design decisions and SoC integration
Phase 2 (Days 16-30): Will cover advanced topics: roofline model, detailed case studies, FPGA implementation, tools, production verification, and next-gen architectures.