1. Inference vs Training Dataflow
Inference: Read-only weights, forward pass only. Optimize for latency or throughput.
Key differences from training:
- No gradient computation (no backward pass)
- Weights are frozen (no updates during inference)
- Memory can be optimized for weights (cache permanently)
- Activations don't need to be stored (no backprop)
2. Memory Hierarchy for Inference
Typical Inference Accelerator Memory:
Input (Image)
↓
┌─────────────────────────┐
│ On-Chip SRAM (10 MB) │ ← All weights here
│ │ (cached from DRAM)
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Compute Unit │
│ (Systolic Array) │
└─────────────────────────┘
↓
Off-Chip DRAM (8-80 GB)
(backup for weights, activations)
Bandwidth:
- SRAM ↔ Compute: 1-10 TB/s (no bottleneck)
- DRAM ↔ Chip: 100-900 GB/s (bandwidth bottleneck)
- Optimize: Keep weights in SRAM, minimize DRAM traffic
3. Compute Dataflow Patterns
Weight Stationary
Load weights once, stream data through.
- Advantage: Minimize weight memory bandwidth
- Use case: Batch inference (weights same for all samples)
Output Stationary
Keep partial results in local memory, accumulate.
- Advantage: Minimize output bandwidth
- Use case: Real-time inference (low latency)
Data Stationary
Keep input data local, reuse across weight tiles.
- Advantage: Balanced memory access
- Use case: Convolutions (spatial reuse)
4. Latency Optimization
Goal: Single inference in minimum milliseconds (e.g., phone face recognition in 50ms).
- Pipeline layers (start next layer before previous finishes)
- Reduce batch size (trade throughput for latency)
- Optimize memory access (minimize stalls)
- Use lower precision (faster ops, less bandwidth)
5. Throughput Optimization
Goal: Process maximum inferences per second (e.g., datacenter: 1000 inferences/sec).
- Large batch sizes (fill compute units)
- Maximize memory bandwidth utilization
- Balance compute and memory (roofline model, covered Day 16)
- Parallelize across multiple chips (distributed inference)
6. Batch Processing Strategy
Batch Throughput = (Compute_per_sample × Batch_size) / Total_latency
Single sample (batch=1):
ResNet-50 inference: ~100M MACs
Latency: ~100ms on mobile (under-utilized accelerator)
Throughput: 10 samples/sec
Batch of 32 (datacenter):
ResNet-50 inference: 32 × 100M MACs = 3.2B MACs
Latency: ~1 second
Throughput: 32 samples/sec
Batch of 256 (max GPU utilization):
Total compute: 256 × 100M = 25.6B MACs
Latency: ~8 seconds
Throughput: 32 samples/sec... wait, that's same!
→ Throughput is memory-bandwidth limited, not compute-limited
7. Real-World Inference Examples
Mobile Inference (Apple Neural Engine)
- Batch size: 1 (single image at a time)
- Latency requirement: < 100ms
- Models: MobileNet, SqueezeNet (small models)
- Power budget: < 2W
- Design: Everything optimized for latency, not throughput
Server Inference (NVIDIA GPU, Google TPU)
- Batch size: 32-256 (query batching)
- Latency SLA: < 100ms (end-to-end)
- Throughput: 1000+ inferences/second
- Models: ResNet, BERT, GPT (larger models)
- Power budget: 700W (acceptable at datacenter scale)
- Design: Optimized for throughput while meeting latency SLA
8. Performance Metrics
| Metric | Definition | Inference Focus |
|---|---|---|
| Latency | Time for single inference | Critical for real-time apps |
| Throughput | Inferences per second (batch) | Critical for server/datacenter |
| Energy/inference | Joules per inference | Battery-powered devices |
| Model size | Bytes for all weights | Important for mobile (storage) |
| Precision | Bits per number (FP32, INT8) | Trade accuracy for speed/power |
9. Inference Checklist
- ✅ Identify latency vs throughput target (mobile vs datacenter)
- ✅ Cache weights locally (minimize DRAM bandwidth)
- ✅ Pipeline execution (overlap compute and memory)
- ✅ Choose batch size wisely (affects latency, throughput, memory)
- ✅ Optimize for roofline model (memory or compute bound?)
- ✅ Lower precision where possible (INT8 accuracy trade-offs)
- ✅ Measure real performance (simulation ≠ silicon)
Next (Day 4): Training architecture and weight updates.