AI Chip Day 4 Enhanced — Training Accelerators

1. Training vs Inference Hardware

Key differences:

Inference: Read-only weights, forward pass only, optimized for latency
Training: Read-write weights, forward + backward pass, optimized for throughput
Memory: Training needs 3-4x more memory (activations for backward pass)
Compute: Training is 3x compute (forward + 2 gradient computations)

2. Backpropagation Hardware

Forward Pass (inference): a = f(Wx + b) # matrix multiply + activation Cost: K MACs (K = params) Backward Pass (training): dL/dz = dL/da * f'(z) # activation gradient dL/dW = dL/dz * x^T # weight gradient (matrix multiply) dL/dx = W^T * dL/dz # input gradient (for next layer) dL/db = sum(dL/dz) # bias gradient (reduce) Cost: ~3K MACs (similar to forward) Hardware implication: - Same systolic array compute both forward and backward - Need to store activations (memory overhead) - Gradient computation bottleneck: activation functions (can't parallelize sigmoid/tanh)

3. Memory Requirements for Training

Component	Size	Notes
Model weights	X GB	Parameters to learn
Forward activations	2-5X GB	Stored for backward pass
Gradients (dW, db)	X GB	Same size as weights
Optimizer state	1-2X GB	Momentum, Adam second moment
Total	5-10X GB	5-10x model size needed

Example: Training GPT-3 (175B parameters) requires 1-2TB of GPU memory (H100 has 80GB, needs ~20 GPUs)

4. Optimization Algorithms

SGD (Stochastic Gradient Descent)

Simple: W_new = W - lr * dW

Hardware: Just subtract (cheap)

Momentum

Accelerates: v = β*v + dW; W = W - lr*v

Hardware: Extra memory for velocity, one extra multiply

Adam (Adaptive Moment Estimation)

State-of-the-art: Maintains first and second moments of gradients

Hardware cost: 2x memory (m, v per parameter), more ALU operations

Trade-off: Better convergence, but higher memory and compute cost

5. Distributed Training

Problem: No single chip can fit LLM training (175B params, 5TB memory needed)

Solution: Distribute across multiple chips

Data parallelism: Each chip processes different batch samples, all have same weights
Model parallelism: Different layers on different chips
Pipeline parallelism: Stagger batches across chips to overlap compute

Hardware requirement: High-speed interconnect (NVLink, InfiniBand) to synchronize gradients

6. Gradient Accumulation & Mixed Precision

Problem: Can't fit large batch in memory, but large batches train faster

Solution: Gradient accumulation (accumulate gradients over smaller batches)

Standard training (batch=64): gradients_64 = backward_pass(batch_64) weights -= lr * gradients_64 Gradient accumulation (batch=16, accumulate 4 times): for i in range(4): small_batch_16 = get_batch() gradients += backward_pass(small_batch_16) # accumulate weights -= lr * (gradients / 4) # same effect as batch=64

7. Hardware Optimizations for Training

Tensor cores: Specialized matrix multiply (faster, lower precision)
Gradient clipping: Prevent exploding gradients (simple threshold hardware)
Mixed precision: Forward in FP16, backward in FP32 (slower but saves memory)
Checkpointing: Don't store all activations, recompute during backward pass
Sparsity: Skip zero-weight gradients (complex hardware, big payoff)

8. Real-World Training Examples

ResNet-50 Training

Model size: 100M parameters
Batch size: 256-512
Time: Hours on GPU, days on TPU
Memory: 16-32GB (1 GPU)
Optimizer: SGD with momentum

BERT Training

Model size: 110M-340M parameters
Batch size: 256-1024
Time: Days on TPUv3 (8 chips)
Memory: 256GB+ (distributed)
Optimizer: Adam (higher memory, better convergence)

GPT-3 Training

Model size: 175B parameters
Batch size: 3.2M tokens (distributed)
Time: Weeks on GPU cluster (1024+ A100s)
Memory: 1-2TB total (distributed)
Optimizer: Adam with gradient checkpointing
Cost: Millions of dollars in compute

9. Training Hardware Checklist

✅ Support both forward and backward: Same compute units do both
✅ Enough memory: 5-10x model size for activations, gradients, optimizer
✅ Efficient gradient computation: Parallel systolic arrays
✅ Fast interconnect: For distributed gradient synchronization
✅ Mixed precision support: FP32 + FP16 for memory/speed tradeoff
✅ Gradient accumulation: Handle smaller batches
✅ Checkpointing capability: Recompute vs store trade-off
✅ Measure energy/iteration: Larger training = more power consumption

Next (Day 5): Floating-point and precision formats.