AI Chip Day 1 Enhanced — Fundamentals of AI Accelerators

1. What is an AI Accelerator?

An AI accelerator (or AI chip) is specialized hardware designed to perform neural network computations far faster and more efficiently than a general-purpose CPU.

Key insight: Neural networks are compute-intensive (trillions of operations), memory-intensive (gigabytes of weights), and parallelizable (matrix multiplications). CPUs are optimized for sequential logic. AI accelerators are optimized for parallel matrix math.

Why not use CPUs?

A CPU can run neural networks, but inefficiently:

CPU @ 3 GHz: Can do ~3 billion operations per second (scalar)
GPU @ 2 GHz: Can do ~10 trillion operations per second (matrix, with 5000+ cores)
TPU @ 1.5 GHz: Can do ~100+ trillion operations per second (optimized systolic arrays)

Accelerators achieve this through:

Parallelism: Thousands of cores working simultaneously
Specialized units: Matrix multiply engines (MAC arrays)
Memory bandwidth: Optimized hierarchy for weight reuse
Data types: Lower precision (INT8, BF16) instead of FP64

2. Inference vs. Training

Inference: Fast Prediction

Definition: Using a pre-trained model to make predictions on new data.

Characteristics:

Read-only weights: Pre-trained, don't change
One forward pass: Input → hidden layers → output
Low latency required: Milliseconds (for real-time, e.g., image recognition)
Throughput: Process many inputs (batches)

Example: Phone recognizes face (inference on NPU) in 50ms

Training: Learning Weights

Definition: Adjusting weights to minimize loss on labeled data.

Characteristics:

Forward + backward passes: Compute gradients for all weights
Weight updates: Gradient descent, momentum, optimization algorithms
High throughput required: Process thousands of images per second
Memory intensive: Store activations for backward pass

Example: Datacenter trains ResNet-50 on ImageNet (weeks on GPU cluster)

Inference vs Training Compute: Inference: Forward pass = matrix multiply (weights × input) Cost = 2 × (matrix_size) MACs per sample Training (per iteration): Forward pass = 2 × matrix_size MACs Backward pass = 4 × matrix_size MACs (2 gradient computations) Total = 6 × matrix_size MACs per sample Training is 3x more compute-intensive than inference (rough estimate)

3. Compute Patterns in Neural Networks

Pattern 1: Matrix Multiplication (MAC)

The dominant operation (>90% of compute).

Matrix Multiplication: Output = Weight × Input Fully Connected Layer (1000 inputs → 100 outputs): Output[0] = W[0,0]*I[0] + W[0,1]*I[1] + ... + W[0,999]*I[999] Output[1] = W[1,0]*I[0] + W[1,1]*I[1] + ... + W[1,999]*I[999] ... Total: 100 × 1000 = 100,000 multiply-accumulate operations Systolic Array Advantage: Instead of 100,000 sequential operations Process 100 rows in parallel → 1,000 cycles instead of 100,000 ~100x speedup through parallelism

Pattern 2: Convolution

Used in image processing (CNNs). Sliding window of weights across spatial dimensions.

Compute: Similar to matrix multiply (can be rewritten as GEMM)
Data reuse: Weights are reused across spatial positions
Optimization: Winograd, FFT-based convolution (faster)

Pattern 3: Element-Wise Operations

Addition, multiplication, activation functions (ReLU, softmax).

Compute light: Less than 10% of total
Simple hardware: Single ALU per core

4. Memory Hierarchy & Bandwidth

Neural networks are memory-bound, not compute-bound. The bottleneck is loading weights, not computing.

Arithmetic Intensity = Compute / Memory Access Example: ResNet-50 layer Compute: 100 million MACs Weights: 1 million parameters (4 MB @ FP32) Arithmetic intensity = 100M / (4M) = 25 ops per byte At 100 GB/s memory bandwidth: Time = 4 MB / 100 GB/s = 40 microseconds Compute time = 100M ops / 10 GOPS = 10 microseconds → Memory dominates (40 vs 10) Solution: Cache weights in fast on-chip memory On-chip SRAM: 10 MB, ~1 TB/s bandwidth Can sustain compute without memory stalls

5. Power Consumption Considerations

Accelerators consume gigawatts at datacenter scale. Power is a primary design constraint.

Device	Peak Power	Typical Use	Energy/OP
Smartphone NPU	2-5W	Inference only	10-20 pJ
GPU (consumer)	250-450W	Inference + training	5-10 pJ
TPU (datacenter)	350-450W	Inference + training	3-8 pJ
H100 (NVIDIA)	700W	Training, inference	2-5 pJ

Power optimization techniques:

Lower precision: INT8 vs FP32 = 4x less power
Gating idle units: Clock gating, power gating
Voltage scaling: Reduce voltage when frequency low
Sparsity: Skip zero multiplications (neural network weights are sparse)

6. Real-World Examples

NVIDIA GPU (H100)

Architecture: 132 SMs (Streaming Multiprocessors), 456 GB/s memory bandwidth

Specs:

Peak FP32: 60 TFLOPS
Peak FP8: 480 TFLOPS (8x more with lower precision)
Memory: 80 GB HBM3
Power: 700W

Use case: Training LLMs (GPT, BERT), inference at scale

Google TPU v4

Architecture: 256×256 systolic array with 8GB on-chip HBM

Specs:

Peak BF16: 430 TFLOPS
Peak INT8: 430 TFLOPS (no degradation with quantization)
Memory bandwidth: 1.2 TB/s (on-chip)
Power: 400W

Use case: Training large models at Google scale, inference

Apple Neural Engine

Architecture: 16 cores, each with 16×16 systolic array

Specs:

Peak INT8: 11 TFLOPS (mobile)
On-chip memory: Limited but efficient
Power: < 2W (smartphone)

Use case: Face recognition, image processing, on-device inference (privacy)

7. Key Design Trade-Offs

Trade-Off 1: Generality vs Specialization

General-purpose GPU: Can run any algorithm, but not optimized for neural networks

Specialized TPU: Optimized for matrix multiply, but harder to support other workloads

Choice depends on: Workload diversity, product roadmap, time-to-market

Trade-Off 2: Latency vs Throughput

Low latency (mobile): Optimize for fast single inference, sacrifice throughput

High throughput (datacenter): Process batches, optimize pipeline bandwidth

Real example: Phone inference: 50ms single image. Datacenter: 1000 images/second in batches

8. Complete Checklist: AI Chip Design Fundamentals

✅ Understand inference vs training (2 fundamentally different problems)
✅ Identify dominant compute patterns (matrix multiply = 90% of work)
✅ Memory bandwidth critical (not compute, but moving data)
✅ Power is a first-class constraint (not an afterthought)
✅ Parallelism is key (systolic arrays, many cores)
✅ Lower precision trades accuracy for speed/power (INT8 vs FP32)
✅ Specialization wins (TPU beats GPU on matrix multiply)
✅ Real-world constraints matter (mobile: power, datacenter: throughput)

Next (Day 2): Neural network fundamentals—layers, operations, and math.

AI Accelerators Fundamentals

1. What is an AI Accelerator?

Why not use CPUs?

2. Inference vs. Training

Inference: Fast Prediction

Training: Learning Weights

3. Compute Patterns in Neural Networks

Pattern 1: Matrix Multiplication (MAC)

Pattern 2: Convolution

Pattern 3: Element-Wise Operations

4. Memory Hierarchy & Bandwidth

5. Power Consumption Considerations

6. Real-World Examples

NVIDIA GPU (H100)

Google TPU v4

Apple Neural Engine

7. Key Design Trade-Offs

Trade-Off 1: Generality vs Specialization

Trade-Off 2: Latency vs Throughput

8. Complete Checklist: AI Chip Design Fundamentals