1. Why Sparsity Matters
Key insight: Neural network weights are naturally sparse (many are near zero).
- ResNet-50 layer: 70% zeros after training
- BERT layer: 50-80% zeros after pruning
- LLaMA model: 90%+ zeros achievable
Benefit: Skip multiplications by zero (reduce compute, memory bandwidth, power)
2. Pruning Methods
Magnitude Pruning
Simple: Remove weights below threshold (small magnitude = less important)
- Structured: Remove entire filters/channels
- Unstructured: Remove individual weights
Magnitude + Fine-Tuning
Prune, then retrain to recover accuracy
Lottery Ticket Hypothesis
Insight: Dense networks contain sparse subnetworks that train well from scratch
Process: Find "winning tickets" (important weights), train only those
3. Structured vs Unstructured Sparsity
| Type | What's Removed | Hardware | Speedup | Challenge |
|---|---|---|---|---|
| Unstructured | Individual weights | Complex (sparse matrix ops) | 2-4x possible | Needs special hardware |
| Structured (Channel) | Entire filters | Simple (skip filters) | 1.5-2x | May degrade accuracy |
| Block Sparsity | Blocks of weights | Medium (regular pattern) | 2-3x | Balance complexity/speedup |
4. Hardware Support for Sparsity
Challenge: Unstructured sparsity requires complex sparse matrix multiply hardware
Solutions:
- NVIDIA Ampere (A100): Structured sparsity (2:4) in tensor cores
- Google TPU: BF16 weights, systolic array skips zeros at low cost
- Custom accelerators: Sparse matrix engines (e.g., Nvidia Orin for automotive)
5. Inference Sparsity Acceleration
2:4 Structured Sparsity (NVIDIA): Every 4 weights, 2 must be zero
- Hardware support in tensor cores
- 2x speedup with structured pattern
- Modest accuracy loss (< 0.5%)
Dynamic Sparsity: Activations sparse (ReLU outputs), skip zero activations
6. Real-World Sparsity Examples
MobileNet Pruning:
- Original: 4.2MB, 100ms latency on phone
- 80% sparse: 0.8MB (5x smaller), 40ms latency (2.5x faster)
- Accuracy: 70% (vs 72% original, slight loss)
BERT Pruning for Inference:
- Original: 340M params, 340MB (FP32)
- 80% sparse: 68M params, 68MB
- Speedup: 3-4x
- Accuracy: Minimal impact on downstream tasks
7. Sparsity Production Checklist
- ✅ Profile baseline: Measure accuracy, latency, memory before pruning
- ✅ Choose pruning method: Magnitude + fine-tune for safest approach
- ✅ Structured for efficiency: Simpler hardware, faster
- ✅ Fine-tune after pruning: Recover lost accuracy
- ✅ Measure speedup on target hardware: Simulation ≠ real device
- ✅ Sparsity-aware training: Train with pruning in mind from start
- ✅ Test on production hardware: Ensure speedup realized
Next (Day 10): Processor architectures (Apple, Google, NVIDIA comparison - already enhanced).