HomeAI ChipDay 3 Enhanced

Inference Architecture

Complete inference accelerator design. Dataflow optimization, memory hierarchy, batch processing, and real-world performance patterns.

By EcrioniX · Published June 13, 2026 · ~4500 words · 13 min read

1. Inference vs Training Dataflow

Inference: Read-only weights, forward pass only. Optimize for latency or throughput.

Key differences from training:

2. Memory Hierarchy for Inference

Typical Inference Accelerator Memory: Input (Image) ↓ ┌─────────────────────────┐ │ On-Chip SRAM (10 MB) │ ← All weights here │ │ (cached from DRAM) └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ Compute Unit │ │ (Systolic Array) │ └─────────────────────────┘ ↓ Off-Chip DRAM (8-80 GB) (backup for weights, activations) Bandwidth: - SRAM ↔ Compute: 1-10 TB/s (no bottleneck) - DRAM ↔ Chip: 100-900 GB/s (bandwidth bottleneck) - Optimize: Keep weights in SRAM, minimize DRAM traffic

3. Compute Dataflow Patterns

Weight Stationary

Load weights once, stream data through.

Output Stationary

Keep partial results in local memory, accumulate.

Data Stationary

Keep input data local, reuse across weight tiles.

4. Latency Optimization

Goal: Single inference in minimum milliseconds (e.g., phone face recognition in 50ms).

5. Throughput Optimization

Goal: Process maximum inferences per second (e.g., datacenter: 1000 inferences/sec).

6. Batch Processing Strategy

Batch Throughput = (Compute_per_sample × Batch_size) / Total_latency Single sample (batch=1): ResNet-50 inference: ~100M MACs Latency: ~100ms on mobile (under-utilized accelerator) Throughput: 10 samples/sec Batch of 32 (datacenter): ResNet-50 inference: 32 × 100M MACs = 3.2B MACs Latency: ~1 second Throughput: 32 samples/sec Batch of 256 (max GPU utilization): Total compute: 256 × 100M = 25.6B MACs Latency: ~8 seconds Throughput: 32 samples/sec... wait, that's same! → Throughput is memory-bandwidth limited, not compute-limited

7. Real-World Inference Examples

Mobile Inference (Apple Neural Engine)

Server Inference (NVIDIA GPU, Google TPU)

8. Performance Metrics

MetricDefinitionInference Focus
LatencyTime for single inferenceCritical for real-time apps
ThroughputInferences per second (batch)Critical for server/datacenter
Energy/inferenceJoules per inferenceBattery-powered devices
Model sizeBytes for all weightsImportant for mobile (storage)
PrecisionBits per number (FP32, INT8)Trade accuracy for speed/power

9. Inference Checklist

Next (Day 4): Training architecture and weight updates.