1. The Perceptron Model
Foundation of all neural networks: a single processing unit that mimics a neuron.
2. Multi-Layer Networks
Stacking perceptrons creates expressive models.
3. Common Layer Types
Fully Connected (Dense)
Every input connected to every output. Most straightforward layer.
- Compute: O(input_size × output_size) MACs
- Hardware: Systolic array excels at this (matrix multiply)
- Example: Output layer of classifiers
Convolutional (CNN)
Sliding window of weights across spatial input.
- Compute: O(spatial_size × kernel_size²× channels) MACs
- Hardware: Parallelizable across spatial positions
- Optimization: Winograd, FFT acceleration possible
- Example: Image recognition (ResNet, VGG)
Recurrent (LSTM, GRU)
Sequential processing with hidden state (memory).
- Compute: Sequential (hard to parallelize between timesteps)
- Latency: High (O(sequence_length) cycles)
- Example: Language models, time series
Attention (Transformer)
Query, key, value dot products (matrix multiply again!).
- Compute: O(seq_length²) for self-attention (expensive)
- Hardware: Also systolic array friendly
- Example: GPT, BERT, large language models
4. Backpropagation (Training)
How weights are updated during training (chip designers need to support this).
5. Activation Functions
Introduce non-linearity. Hardware must support these efficiently.
| Function | Formula | Hardware Cost | Use Case |
|---|---|---|---|
| ReLU | max(0, x) | 1 comparison | Hidden layers (most common) |
| Sigmoid | 1/(1+e^-x) | Expensive (exp) | Binary classification |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | Expensive (exp) | RNN gates |
| Softmax | e^x_i / Σe^x_j | Expensive (exp, reduce) | Multi-class output |
| GELU | x × Φ(x) | Moderate (approx) | Transformers (modern) |
Hardware design tip: ReLU is free (just max logic). Others require expensive exponential hardware or lookup tables.
6. Batch Processing
Processing multiple samples simultaneously (critical for throughput).
7. Model Architectures
CNNs (Convolutional Neural Networks)
- Optimized for images
- ResNet, VGG, EfficientNet
- Hardware-friendly (parallel convolutions)
RNNs/LSTMs
- Optimized for sequences
- Latency challenges (sequential)
- Less suitable for inference accelerators
Transformers
- State-of-the-art for NLP
- Attention is compute-heavy but parallelizable
- GPT, BERT, modern LLMs
- Hardware: Matrix multiply again (TPU advantage)
8. Hardware Implications
- ✅ Matrix multiply is king: 80-90% of all compute
- ✅ ReLU should be free: Just max logic, don't spend silicon on expensive activations
- ✅ Batching is critical: 16+ samples per batch for efficiency
- ✅ Memory reuse matters: Weights must be cached locally
- ✅ Precision trade-offs: INT8 works for most models, saves power and memory
- ✅ Sparsity wins: Many weights are zero, skip them (hardware complexity vs. gains)
Next (Day 3): Inference architecture and data flow optimization.