1. Latency vs Throughput
Latency: Time for single inference (milliseconds)
Throughput: Inferences per second (batch mode)
Mobile Example (Apple Neural Engine):
Single inference latency: 50ms (must be <100ms for real-time)
Batch size: 1 (no batching)
Throughput: 20 inferences/second
Datacenter Example (Google TPU):
Single inference latency: 10ms (with batching overhead)
Batch size: 256 (query batching)
Throughput: 25,000+ inferences/second
Same hardware can optimize for either, but not both equally
2. Latency Optimization
Target:** Mobile/real-time inference (< 100ms)
- Smaller models: MobileNet, SqueezeNet (fewer layers)
- Quantization: INT8 (faster, lower precision)
- Pipeline parallelism: Overlap layers
- Low batch size: Batch=1 (minimize latency)
- Specialized hardware: Single-cycle operations (ReLU free)
3. Throughput Optimization
Target: Datacenter batch inference (maximize QPS)
- Large models: ResNet-50, BERT (more compute)
- Large batch: Batch=32-256 (amortize kernel launch)
- High utilization: Fill all cores with work
- Memory bandwidth: Optimize data movement
- Multi-chip scaling: Distribute load across GPUs
4. QoS (Quality of Service) Requirements
| Application | Latency SLA | Throughput | Optimization |
|---|---|---|---|
| Face recognition | < 50ms | Real-time only | Latency |
| Search ranking | < 100ms | 1000+ qps | Both |
| Image classification | < 200ms | 100+ inferences/sec | Throughput |
| Batch processing | None | Maximize | Throughput |
5. Batch Size Trade-offs
Small Batch (batch=1):
Latency: 50ms
Throughput: 20 inf/sec
Utilization: 10% (wastes compute)
Medium Batch (batch=16):
Latency: 150ms (3x higher)
Throughput: 320 inf/sec (16x better)
Utilization: 80%
Large Batch (batch=256):
Latency: 1 second (20x higher)
Throughput: 5000+ inf/sec (250x better)
Utilization: 95%+
Strategy: Use async batching (wait 50ms for batch, but serve immediately)
6. Architecture Tradeoff Examples
Choice 1: Core Count vs Clock Speed
- Few cores, high clock: Better single-threaded latency
- Many cores, lower clock: Better throughput, lower power
Choice 2: Systolic Array Size
- Small (64×64): Low power, fits on mobile
- Large (256×256): Better throughput, more power
Choice 3: Precision Support
- FP32 only: Full precision, slow, high power
- INT8 only: Fast, low power, accuracy concerns
- Mixed: Best of both, added complexity
7. Measuring Performance
- Peak FLOPS: Hardware capability (not real-world)
- Sustained throughput: Real benchmark on models
- Latency percentiles: p50 (median), p99 (tail)
- Cost per inference: $/op or energy/op
8. Design Tradeoff Checklist
- ✅ Define use case: Mobile, datacenter, or both?
- ✅ Latency requirements: SLA in milliseconds
- ✅ Throughput targets: Inferences/second or QPS
- ✅ Choose model size: Latency vs accuracy
- ✅ Plan batching: Single or multi-sample?
- ✅ Architecture: Systolic size, core count, precision
- ✅ Benchmark: Measure real performance on target hardware
Next (Day 14): Practical design decisions and integration.