AI Chip Design Day 20

The Full Pipeline

HBM (8 GB) ↓ [2 TB/s, 50 ns] Local Controller ↓ [512 GB/s, 5 ns] Weight SRAM (24 MB) ↓ [256 GB/s, 1 ns] PE Input FIFOs (256 x 32 elements) ↓ Systolic Array (256×256) ↓ Output Accumulation (256×32 registers) ↓ [256 GB/s] Weight SRAM ↓ [512 GB/s] Local Controller ↓ [2 TB/s] HBM (write results)

Latency vs Throughput

Latency: HBM → Systolic

How long until first result?

HBM access: 50 ns (cold start, no prefetch)
Router latency: 5 ns
Weight SRAM load: 10 cycles = 5 ns
Systolic pipeline: 256 cycles (fill 256×256 array) = 128 ns
Total: ~200 ns (latency from request to first output)

Throughput: Sustained

How many FLOPS per second once running?

256×256 = 65,536 MACs active
Clock: 2 GHz
Throughput: 65,536 MACs × 2 GHz = 131 TFLOPS sustained
Peak: 256 TFLOPS (when 2× values per MAC per cycle)

Design Constraints

1. Weight Reuse

Loading weights from HBM is slow (50 ns). Solution: Keep weights in SRAM longer. For N consecutive matrix multiplies (batching): - Load weights once (50 ns) - Multiply N times using local SRAM (very fast) - Cost amortized over N operations Batch size = 64 (typical): - Cost per multiply: 50 ns / 64 ≈ 0.78 ns - Systolic latency: 128 ns - Net latency: 128 ns + 0.78 ns (amortized weight load)

2. Arithmetic Intensity Balance

From Day 16 roofline: AI must exceed knee to be compute-bound.

Operation	MACs	Bytes	AI (FLOP/B)	Bottleneck
256×256 matmul	16M	256KB	64	Memory
1024×1024 matmul	2B	8MB	256	Compute
Activation (ReLU)	N	2N bytes	0.5	Memory

3. Padding and Tiling

Systolic works best on multiples of 256:

Input problem: 224×224 (image) Not divisible by 256! Options: 1. Pad to 256×256 (2.3% extra compute, 14% extra memory I/O) 2. Tile into 224 = 128 + 96 (two separate systolic runs, cache misses) 3. Use smaller systolic (64×64) with batching Production choice: Pad to 256×256 (simpler, accepted overhead)

Real Throughput Example

Scenario: ResNet-50 inference on TPU - 50 layers (mostly 3×3 convolutions) - Total FLOPs: 4.1 billion - Peak TFLOPS: 256 Naive estimate: 4.1B / 256T = 16 ms Actual: 28 ms (43% efficiency) Why? Memory I/O, activation functions, non-matmul layers, batching overhead. Improvements (in production): - Batch size 128 (increases AI) - Fused kernels (skip activation I/O) - Lower precision (INT8) - Result: 45 ms → 8 ms (3.5× faster)

Next Phase: Real Implementations

Days 21-25 dive into how Apple, Google, NVIDIA designed production chips. From research papers to actual silicon.

Memory-Compute Co-Design