HomeAI ChipDay 13

Latency, Throughput & Tradeoffs

AI chip performance design. Latency optimization, throughput maximization, QoS requirements, and architectural decisions.

By EcrioniX · Published June 13, 2026 · ~3400 words · 10 min read

1. Latency vs Throughput

Latency: Time for single inference (milliseconds)

Throughput: Inferences per second (batch mode)

Mobile Example (Apple Neural Engine): Single inference latency: 50ms (must be <100ms for real-time) Batch size: 1 (no batching) Throughput: 20 inferences/second Datacenter Example (Google TPU): Single inference latency: 10ms (with batching overhead) Batch size: 256 (query batching) Throughput: 25,000+ inferences/second Same hardware can optimize for either, but not both equally

2. Latency Optimization

Target:** Mobile/real-time inference (< 100ms)

  • Smaller models: MobileNet, SqueezeNet (fewer layers)
  • Quantization: INT8 (faster, lower precision)
  • Pipeline parallelism: Overlap layers
  • Low batch size: Batch=1 (minimize latency)
  • Specialized hardware: Single-cycle operations (ReLU free)

3. Throughput Optimization

Target: Datacenter batch inference (maximize QPS)

  • Large models: ResNet-50, BERT (more compute)
  • Large batch: Batch=32-256 (amortize kernel launch)
  • High utilization: Fill all cores with work
  • Memory bandwidth: Optimize data movement
  • Multi-chip scaling: Distribute load across GPUs

4. QoS (Quality of Service) Requirements

ApplicationLatency SLAThroughputOptimization
Face recognition< 50msReal-time onlyLatency
Search ranking< 100ms1000+ qpsBoth
Image classification< 200ms100+ inferences/secThroughput
Batch processingNoneMaximizeThroughput

5. Batch Size Trade-offs

Small Batch (batch=1): Latency: 50ms Throughput: 20 inf/sec Utilization: 10% (wastes compute) Medium Batch (batch=16): Latency: 150ms (3x higher) Throughput: 320 inf/sec (16x better) Utilization: 80% Large Batch (batch=256): Latency: 1 second (20x higher) Throughput: 5000+ inf/sec (250x better) Utilization: 95%+ Strategy: Use async batching (wait 50ms for batch, but serve immediately)

6. Architecture Tradeoff Examples

Choice 1: Core Count vs Clock Speed

  • Few cores, high clock: Better single-threaded latency
  • Many cores, lower clock: Better throughput, lower power

Choice 2: Systolic Array Size

  • Small (64×64): Low power, fits on mobile
  • Large (256×256): Better throughput, more power

Choice 3: Precision Support

  • FP32 only: Full precision, slow, high power
  • INT8 only: Fast, low power, accuracy concerns
  • Mixed: Best of both, added complexity

7. Measuring Performance

  • Peak FLOPS: Hardware capability (not real-world)
  • Sustained throughput: Real benchmark on models
  • Latency percentiles: p50 (median), p99 (tail)
  • Cost per inference: $/op or energy/op

8. Design Tradeoff Checklist

  • Define use case: Mobile, datacenter, or both?
  • Latency requirements: SLA in milliseconds
  • Throughput targets: Inferences/second or QPS
  • Choose model size: Latency vs accuracy
  • Plan batching: Single or multi-sample?
  • Architecture: Systolic size, core count, precision
  • Benchmark: Measure real performance on target hardware

Next (Day 14): Practical design decisions and integration.