AI Chip Day 3 Enhanced — Inference Architecture & Dataflow

1. Inference vs Training Dataflow

Inference: Read-only weights, forward pass only. Optimize for latency or throughput.

Key differences from training:

No gradient computation (no backward pass)
Weights are frozen (no updates during inference)
Memory can be optimized for weights (cache permanently)
Activations don't need to be stored (no backprop)

2. Memory Hierarchy for Inference

Typical Inference Accelerator Memory: Input (Image) ↓ ┌─────────────────────────┐ │ On-Chip SRAM (10 MB) │ ← All weights here │ │ (cached from DRAM) └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ Compute Unit │ │ (Systolic Array) │ └─────────────────────────┘ ↓ Off-Chip DRAM (8-80 GB) (backup for weights, activations) Bandwidth: - SRAM ↔ Compute: 1-10 TB/s (no bottleneck) - DRAM ↔ Chip: 100-900 GB/s (bandwidth bottleneck) - Optimize: Keep weights in SRAM, minimize DRAM traffic

3. Compute Dataflow Patterns

Weight Stationary

Load weights once, stream data through.

Advantage: Minimize weight memory bandwidth
Use case: Batch inference (weights same for all samples)

Output Stationary

Keep partial results in local memory, accumulate.

Advantage: Minimize output bandwidth
Use case: Real-time inference (low latency)

Data Stationary

Keep input data local, reuse across weight tiles.

Advantage: Balanced memory access
Use case: Convolutions (spatial reuse)

4. Latency Optimization

Goal: Single inference in minimum milliseconds (e.g., phone face recognition in 50ms).

Pipeline layers (start next layer before previous finishes)
Reduce batch size (trade throughput for latency)
Optimize memory access (minimize stalls)
Use lower precision (faster ops, less bandwidth)

5. Throughput Optimization

Goal: Process maximum inferences per second (e.g., datacenter: 1000 inferences/sec).

Large batch sizes (fill compute units)
Maximize memory bandwidth utilization
Balance compute and memory (roofline model, covered Day 16)
Parallelize across multiple chips (distributed inference)

6. Batch Processing Strategy

Batch Throughput = (Compute_per_sample × Batch_size) / Total_latency Single sample (batch=1): ResNet-50 inference: ~100M MACs Latency: ~100ms on mobile (under-utilized accelerator) Throughput: 10 samples/sec Batch of 32 (datacenter): ResNet-50 inference: 32 × 100M MACs = 3.2B MACs Latency: ~1 second Throughput: 32 samples/sec Batch of 256 (max GPU utilization): Total compute: 256 × 100M = 25.6B MACs Latency: ~8 seconds Throughput: 32 samples/sec... wait, that's same! → Throughput is memory-bandwidth limited, not compute-limited

7. Real-World Inference Examples

Mobile Inference (Apple Neural Engine)

Batch size: 1 (single image at a time)
Latency requirement: < 100ms
Models: MobileNet, SqueezeNet (small models)
Power budget: < 2W
Design: Everything optimized for latency, not throughput

Server Inference (NVIDIA GPU, Google TPU)

Batch size: 32-256 (query batching)
Latency SLA: < 100ms (end-to-end)
Throughput: 1000+ inferences/second
Models: ResNet, BERT, GPT (larger models)
Power budget: 700W (acceptable at datacenter scale)
Design: Optimized for throughput while meeting latency SLA

8. Performance Metrics

Metric	Definition	Inference Focus
Latency	Time for single inference	Critical for real-time apps
Throughput	Inferences per second (batch)	Critical for server/datacenter
Energy/inference	Joules per inference	Battery-powered devices
Model size	Bytes for all weights	Important for mobile (storage)
Precision	Bits per number (FP32, INT8)	Trade accuracy for speed/power

9. Inference Checklist

✅ Identify latency vs throughput target (mobile vs datacenter)
✅ Cache weights locally (minimize DRAM bandwidth)
✅ Pipeline execution (overlap compute and memory)
✅ Choose batch size wisely (affects latency, throughput, memory)
✅ Optimize for roofline model (memory or compute bound?)
✅ Lower precision where possible (INT8 accuracy trade-offs)
✅ Measure real performance (simulation ≠ silicon)

Next (Day 4): Training architecture and weight updates.