AI Chip Day 13 Enhanced — Latency, Throughput & Tradeoffs

1. Latency vs Throughput

Latency: Time for single inference (milliseconds)

Throughput: Inferences per second (batch mode)

Mobile Example (Apple Neural Engine): Single inference latency: 50ms (must be <100ms for real-time) Batch size: 1 (no batching) Throughput: 20 inferences/second Datacenter Example (Google TPU): Single inference latency: 10ms (with batching overhead) Batch size: 256 (query batching) Throughput: 25,000+ inferences/second Same hardware can optimize for either, but not both equally

2. Latency Optimization

Target:** Mobile/real-time inference (< 100ms)

Smaller models: MobileNet, SqueezeNet (fewer layers)

Quantization: INT8 (faster, lower precision)

Pipeline parallelism: Overlap layers

Low batch size: Batch=1 (minimize latency)

Specialized hardware: Single-cycle operations (ReLU free)

3. Throughput Optimization

Target: Datacenter batch inference (maximize QPS)

Large models: ResNet-50, BERT (more compute)

Large batch: Batch=32-256 (amortize kernel launch)

High utilization: Fill all cores with work

Memory bandwidth: Optimize data movement

Multi-chip scaling: Distribute load across GPUs

4. QoS (Quality of Service) Requirements

Application Latency SLA Throughput Optimization

Face recognition < 50ms Real-time only Latency

Search ranking < 100ms 1000+ qps Both

Image classification < 200ms 100+ inferences/sec Throughput

Batch processing None Maximize Throughput

5. Batch Size Trade-offs

Small Batch (batch=1): Latency: 50ms Throughput: 20 inf/sec Utilization: 10% (wastes compute) Medium Batch (batch=16): Latency: 150ms (3x higher) Throughput: 320 inf/sec (16x better) Utilization: 80% Large Batch (batch=256): Latency: 1 second (20x higher) Throughput: 5000+ inf/sec (250x better) Utilization: 95%+ Strategy: Use async batching (wait 50ms for batch, but serve immediately)

6. Architecture Tradeoff Examples

Choice 1: Core Count vs Clock Speed

Few cores, high clock: Better single-threaded latency

Many cores, lower clock: Better throughput, lower power

Choice 2: Systolic Array Size

Small (64×64): Low power, fits on mobile

Large (256×256): Better throughput, more power

Choice 3: Precision Support

FP32 only: Full precision, slow, high power

INT8 only: Fast, low power, accuracy concerns

Mixed: Best of both, added complexity

7. Measuring Performance

Peak FLOPS: Hardware capability (not real-world)

Sustained throughput: Real benchmark on models

Latency percentiles: p50 (median), p99 (tail)

Cost per inference: $/op or energy/op

8. Design Tradeoff Checklist

✅ Define use case: Mobile, datacenter, or both?

✅ Latency requirements: SLA in milliseconds

✅ Throughput targets: Inferences/second or QPS

✅ Choose model size: Latency vs accuracy

✅ Plan batching: Single or multi-sample?

✅ Architecture: Systolic size, core count, precision

✅ Benchmark: Measure real performance on target hardware

Next (Day 14): Practical design decisions and integration.

Application	Latency SLA	Throughput	Optimization
Face recognition	< 50ms	Real-time only	Latency
Search ranking	< 100ms	1000+ qps	Both
Image classification	< 200ms	100+ inferences/sec	Throughput
Batch processing	None	Maximize	Throughput