AI Chip Day 9 Enhanced — Sparsity & Pruning

1. Why Sparsity Matters

Key insight: Neural network weights are naturally sparse (many are near zero).

ResNet-50 layer: 70% zeros after training
BERT layer: 50-80% zeros after pruning
LLaMA model: 90%+ zeros achievable

Benefit: Skip multiplications by zero (reduce compute, memory bandwidth, power)

2. Pruning Methods

Magnitude Pruning

Simple: Remove weights below threshold (small magnitude = less important)

Structured: Remove entire filters/channels
Unstructured: Remove individual weights

Magnitude + Fine-Tuning

Prune, then retrain to recover accuracy

Iterative Pruning: 1. Train model to convergence (FP32) 2. Prune X% smallest weights 3. Fine-tune for Y epochs (lr = 0.1 * original) 4. Repeat until target sparsity or accuracy loss acceptable Result: ResNet-50 → 80% sparse, <0.5% accuracy loss

Lottery Ticket Hypothesis

Insight: Dense networks contain sparse subnetworks that train well from scratch

Process: Find "winning tickets" (important weights), train only those

3. Structured vs Unstructured Sparsity

Type	What's Removed	Hardware	Speedup	Challenge
Unstructured	Individual weights	Complex (sparse matrix ops)	2-4x possible	Needs special hardware
Structured (Channel)	Entire filters	Simple (skip filters)	1.5-2x	May degrade accuracy
Block Sparsity	Blocks of weights	Medium (regular pattern)	2-3x	Balance complexity/speedup

4. Hardware Support for Sparsity

Challenge: Unstructured sparsity requires complex sparse matrix multiply hardware

Solutions:

NVIDIA Ampere (A100): Structured sparsity (2:4) in tensor cores
Google TPU: BF16 weights, systolic array skips zeros at low cost
Custom accelerators: Sparse matrix engines (e.g., Nvidia Orin for automotive)

5. Inference Sparsity Acceleration

2:4 Structured Sparsity (NVIDIA): Every 4 weights, 2 must be zero

Hardware support in tensor cores
2x speedup with structured pattern
Modest accuracy loss (< 0.5%)

Dynamic Sparsity: Activations sparse (ReLU outputs), skip zero activations

6. Real-World Sparsity Examples

MobileNet Pruning:

Original: 4.2MB, 100ms latency on phone
80% sparse: 0.8MB (5x smaller), 40ms latency (2.5x faster)
Accuracy: 70% (vs 72% original, slight loss)

BERT Pruning for Inference:

Original: 340M params, 340MB (FP32)
80% sparse: 68M params, 68MB
Speedup: 3-4x
Accuracy: Minimal impact on downstream tasks

7. Sparsity Production Checklist

✅ Profile baseline: Measure accuracy, latency, memory before pruning
✅ Choose pruning method: Magnitude + fine-tune for safest approach
✅ Structured for efficiency: Simpler hardware, faster
✅ Fine-tune after pruning: Recover lost accuracy
✅ Measure speedup on target hardware: Simulation ≠ real device
✅ Sparsity-aware training: Train with pruning in mind from start
✅ Test on production hardware: Ensure speedup realized

Next (Day 10): Processor architectures (Apple, Google, NVIDIA comparison - already enhanced).