1. What is an AI Accelerator?
An AI accelerator (or AI chip) is specialized hardware designed to perform neural network computations far faster and more efficiently than a general-purpose CPU.
Key insight: Neural networks are compute-intensive (trillions of operations), memory-intensive (gigabytes of weights), and parallelizable (matrix multiplications). CPUs are optimized for sequential logic. AI accelerators are optimized for parallel matrix math.
Why not use CPUs?
A CPU can run neural networks, but inefficiently:
- CPU @ 3 GHz: Can do ~3 billion operations per second (scalar)
- GPU @ 2 GHz: Can do ~10 trillion operations per second (matrix, with 5000+ cores)
- TPU @ 1.5 GHz: Can do ~100+ trillion operations per second (optimized systolic arrays)
Accelerators achieve this through:
- Parallelism: Thousands of cores working simultaneously
- Specialized units: Matrix multiply engines (MAC arrays)
- Memory bandwidth: Optimized hierarchy for weight reuse
- Data types: Lower precision (INT8, BF16) instead of FP64
2. Inference vs. Training
Inference: Fast Prediction
Definition: Using a pre-trained model to make predictions on new data.
Characteristics:
- Read-only weights: Pre-trained, don't change
- One forward pass: Input → hidden layers → output
- Low latency required: Milliseconds (for real-time, e.g., image recognition)
- Throughput: Process many inputs (batches)
Example: Phone recognizes face (inference on NPU) in 50ms
Training: Learning Weights
Definition: Adjusting weights to minimize loss on labeled data.
Characteristics:
- Forward + backward passes: Compute gradients for all weights
- Weight updates: Gradient descent, momentum, optimization algorithms
- High throughput required: Process thousands of images per second
- Memory intensive: Store activations for backward pass
Example: Datacenter trains ResNet-50 on ImageNet (weeks on GPU cluster)
3. Compute Patterns in Neural Networks
Pattern 1: Matrix Multiplication (MAC)
The dominant operation (>90% of compute).
Pattern 2: Convolution
Used in image processing (CNNs). Sliding window of weights across spatial dimensions.
- Compute: Similar to matrix multiply (can be rewritten as GEMM)
- Data reuse: Weights are reused across spatial positions
- Optimization: Winograd, FFT-based convolution (faster)
Pattern 3: Element-Wise Operations
Addition, multiplication, activation functions (ReLU, softmax).
- Compute light: Less than 10% of total
- Simple hardware: Single ALU per core
4. Memory Hierarchy & Bandwidth
Neural networks are memory-bound, not compute-bound. The bottleneck is loading weights, not computing.
5. Power Consumption Considerations
Accelerators consume gigawatts at datacenter scale. Power is a primary design constraint.
| Device | Peak Power | Typical Use | Energy/OP |
|---|---|---|---|
| Smartphone NPU | 2-5W | Inference only | 10-20 pJ |
| GPU (consumer) | 250-450W | Inference + training | 5-10 pJ |
| TPU (datacenter) | 350-450W | Inference + training | 3-8 pJ |
| H100 (NVIDIA) | 700W | Training, inference | 2-5 pJ |
Power optimization techniques:
- Lower precision: INT8 vs FP32 = 4x less power
- Gating idle units: Clock gating, power gating
- Voltage scaling: Reduce voltage when frequency low
- Sparsity: Skip zero multiplications (neural network weights are sparse)
6. Real-World Examples
NVIDIA GPU (H100)
Architecture: 132 SMs (Streaming Multiprocessors), 456 GB/s memory bandwidth
Specs:
- Peak FP32: 60 TFLOPS
- Peak FP8: 480 TFLOPS (8x more with lower precision)
- Memory: 80 GB HBM3
- Power: 700W
Use case: Training LLMs (GPT, BERT), inference at scale
Google TPU v4
Architecture: 256×256 systolic array with 8GB on-chip HBM
Specs:
- Peak BF16: 430 TFLOPS
- Peak INT8: 430 TFLOPS (no degradation with quantization)
- Memory bandwidth: 1.2 TB/s (on-chip)
- Power: 400W
Use case: Training large models at Google scale, inference
Apple Neural Engine
Architecture: 16 cores, each with 16×16 systolic array
Specs:
- Peak INT8: 11 TFLOPS (mobile)
- On-chip memory: Limited but efficient
- Power: < 2W (smartphone)
Use case: Face recognition, image processing, on-device inference (privacy)
7. Key Design Trade-Offs
Trade-Off 1: Generality vs Specialization
General-purpose GPU: Can run any algorithm, but not optimized for neural networks
Specialized TPU: Optimized for matrix multiply, but harder to support other workloads
Choice depends on: Workload diversity, product roadmap, time-to-market
Trade-Off 2: Latency vs Throughput
Low latency (mobile): Optimize for fast single inference, sacrifice throughput
High throughput (datacenter): Process batches, optimize pipeline bandwidth
Real example: Phone inference: 50ms single image. Datacenter: 1000 images/second in batches
8. Complete Checklist: AI Chip Design Fundamentals
- ✅ Understand inference vs training (2 fundamentally different problems)
- ✅ Identify dominant compute patterns (matrix multiply = 90% of work)
- ✅ Memory bandwidth critical (not compute, but moving data)
- ✅ Power is a first-class constraint (not an afterthought)
- ✅ Parallelism is key (systolic arrays, many cores)
- ✅ Lower precision trades accuracy for speed/power (INT8 vs FP32)
- ✅ Specialization wins (TPU beats GPU on matrix multiply)
- ✅ Real-world constraints matter (mobile: power, datacenter: throughput)
Next (Day 2): Neural network fundamentals—layers, operations, and math.