AI Chip Day 2 Enhanced — Neural Network Fundamentals

1. The Perceptron Model

Foundation of all neural networks: a single processing unit that mimics a neuron.

Perceptron Output: output = activation(Σ(weight_i × input_i) + bias) Example (3 inputs): z = w₁×x₁ + w₂×x₂ + w₃×x₃ + b output = ReLU(z) = max(0, z) Hardware implication: - Multiply-accumulate (MAC): 3 multiplications + 2 additions - Simple, parallelizable - Can be implemented in single cycle with pipeline Why it works: - Weights learn features (e.g., edge detection in images) - Bias adjusts threshold - Activation adds non-linearity (essential for learning)

2. Multi-Layer Networks

Stacking perceptrons creates expressive models.

Neural Network Structure: Input Layer (28×28=784 pixels) ↓ Hidden Layer 1 (128 neurons) ↓ Hidden Layer 2 (64 neurons) ↓ Output Layer (10 classes: digits 0-9) Forward Pass: h₁ = ReLU(W₁ × x + b₁) # 784×128 matrix multiply h₂ = ReLU(W₂ × h₁ + b₂) # 128×64 matrix multiply y = softmax(W₃ × h₂ + b₃) # 64×10 matrix multiply Total MACs: 784×128 + 128×64 + 64×10 ≈ 110,000 MACs per inference

3. Common Layer Types

Fully Connected (Dense)

Every input connected to every output. Most straightforward layer.

Compute: O(input_size × output_size) MACs
Hardware: Systolic array excels at this (matrix multiply)
Example: Output layer of classifiers

Convolutional (CNN)

Sliding window of weights across spatial input.

Compute: O(spatial_size × kernel_size²× channels) MACs
Hardware: Parallelizable across spatial positions
Optimization: Winograd, FFT acceleration possible
Example: Image recognition (ResNet, VGG)

Recurrent (LSTM, GRU)

Sequential processing with hidden state (memory).

Compute: Sequential (hard to parallelize between timesteps)
Latency: High (O(sequence_length) cycles)
Example: Language models, time series

Attention (Transformer)

Query, key, value dot products (matrix multiply again!).

Compute: O(seq_length²) for self-attention (expensive)
Hardware: Also systolic array friendly
Example: GPT, BERT, large language models

4. Backpropagation (Training)

How weights are updated during training (chip designers need to support this).

Forward Pass: z = Wx + b a = ReLU(z) Loss Computation: L = (a - target)² Backward Pass (gradient computation): dL/da = 2(a - target) dL/dz = dL/da × ReLU'(z) dL/dW = dL/dz × x^T dL/db = dL/dz Weight Update (Gradient Descent): W_new = W_old - learning_rate × dL/dW b_new = b_old - learning_rate × dL/db Hardware: Forward + backward = 2-3x compute of inference alone

5. Activation Functions

Introduce non-linearity. Hardware must support these efficiently.

Function	Formula	Hardware Cost	Use Case
ReLU	max(0, x)	1 comparison	Hidden layers (most common)
Sigmoid	1/(1+e^-x)	Expensive (exp)	Binary classification
Tanh	(e^x - e^-x)/(e^x + e^-x)	Expensive (exp)	RNN gates
Softmax	e^x_i / Σe^x_j	Expensive (exp, reduce)	Multi-class output
GELU	x × Φ(x)	Moderate (approx)	Transformers (modern)

Hardware design tip: ReLU is free (just max logic). Others require expensive exponential hardware or lookup tables.

6. Batch Processing

Processing multiple samples simultaneously (critical for throughput).

Single Sample: output = f(W × x + b) # 1 sample Compute: K MACs (K = weight parameters) Batch of N Samples: Output = f(W × X + b) # X is N×input matrix Compute: K × N MACs (same hardware, N times throughput) Utilization: Single sample: Low (if hardware sized for batches) Batch of 32: Good Batch of 256: Excellent (memory bandwidth limits) Chip design: Systolic array sized for batch processing Example: 256×256 array processes batch-of-16, 256-dimensional vectors

7. Model Architectures

CNNs (Convolutional Neural Networks)

Optimized for images
ResNet, VGG, EfficientNet
Hardware-friendly (parallel convolutions)

RNNs/LSTMs

Optimized for sequences
Latency challenges (sequential)
Less suitable for inference accelerators

Transformers

State-of-the-art for NLP
Attention is compute-heavy but parallelizable
GPT, BERT, modern LLMs
Hardware: Matrix multiply again (TPU advantage)

8. Hardware Implications

✅ Matrix multiply is king: 80-90% of all compute
✅ ReLU should be free: Just max logic, don't spend silicon on expensive activations
✅ Batching is critical: 16+ samples per batch for efficiency
✅ Memory reuse matters: Weights must be cached locally
✅ Precision trade-offs: INT8 works for most models, saves power and memory
✅ Sparsity wins: Many weights are zero, skip them (hardware complexity vs. gains)

Next (Day 3): Inference architecture and data flow optimization.