HomeDay 20

Memory-Compute Co-Design

Optimizing dataflow end-to-end: from HBM through NoC to systolic arrays. Real-world latency and throughput bottlenecks.

The Full Pipeline

HBM (8 GB) ↓ [2 TB/s, 50 ns] Local Controller ↓ [512 GB/s, 5 ns] Weight SRAM (24 MB) ↓ [256 GB/s, 1 ns] PE Input FIFOs (256 x 32 elements) ↓ Systolic Array (256×256) ↓ Output Accumulation (256×32 registers) ↓ [256 GB/s] Weight SRAM ↓ [512 GB/s] Local Controller ↓ [2 TB/s] HBM (write results)

Latency vs Throughput

Latency: HBM → Systolic

How long until first result?

Throughput: Sustained

How many FLOPS per second once running?

Design Constraints

1. Weight Reuse

Loading weights from HBM is slow (50 ns). Solution: Keep weights in SRAM longer. For N consecutive matrix multiplies (batching): - Load weights once (50 ns) - Multiply N times using local SRAM (very fast) - Cost amortized over N operations Batch size = 64 (typical): - Cost per multiply: 50 ns / 64 ≈ 0.78 ns - Systolic latency: 128 ns - Net latency: 128 ns + 0.78 ns (amortized weight load)

2. Arithmetic Intensity Balance

From Day 16 roofline: AI must exceed knee to be compute-bound.

OperationMACsBytesAI (FLOP/B)Bottleneck
256×256 matmul16M256KB64Memory
1024×1024 matmul2B8MB256Compute
Activation (ReLU)N2N bytes0.5Memory

3. Padding and Tiling

Systolic works best on multiples of 256:

Input problem: 224×224 (image) Not divisible by 256! Options: 1. Pad to 256×256 (2.3% extra compute, 14% extra memory I/O) 2. Tile into 224 = 128 + 96 (two separate systolic runs, cache misses) 3. Use smaller systolic (64×64) with batching Production choice: Pad to 256×256 (simpler, accepted overhead)

Real Throughput Example

Scenario: ResNet-50 inference on TPU - 50 layers (mostly 3×3 convolutions) - Total FLOPs: 4.1 billion - Peak TFLOPS: 256 Naive estimate: 4.1B / 256T = 16 ms Actual: 28 ms (43% efficiency) Why? Memory I/O, activation functions, non-matmul layers, batching overhead. Improvements (in production): - Batch size 128 (increases AI) - Fused kernels (skip activation I/O) - Lower precision (INT8) - Result: 45 ms → 8 ms (3.5× faster)

Next Phase: Real Implementations

Days 21-25 dive into how Apple, Google, NVIDIA designed production chips. From research papers to actual silicon.