The Full Pipeline
HBM (8 GB)
↓ [2 TB/s, 50 ns]
Local Controller
↓ [512 GB/s, 5 ns]
Weight SRAM (24 MB)
↓ [256 GB/s, 1 ns]
PE Input FIFOs (256 x 32 elements)
↓
Systolic Array (256×256)
↓
Output Accumulation (256×32 registers)
↓ [256 GB/s]
Weight SRAM
↓ [512 GB/s]
Local Controller
↓ [2 TB/s]
HBM (write results)
Latency vs Throughput
Latency: HBM → Systolic
How long until first result?
- HBM access: 50 ns (cold start, no prefetch)
- Router latency: 5 ns
- Weight SRAM load: 10 cycles = 5 ns
- Systolic pipeline: 256 cycles (fill 256×256 array) = 128 ns
- Total: ~200 ns (latency from request to first output)
Throughput: Sustained
How many FLOPS per second once running?
- 256×256 = 65,536 MACs active
- Clock: 2 GHz
- Throughput: 65,536 MACs × 2 GHz = 131 TFLOPS sustained
- Peak: 256 TFLOPS (when 2× values per MAC per cycle)
Design Constraints
1. Weight Reuse
Loading weights from HBM is slow (50 ns).
Solution: Keep weights in SRAM longer.
For N consecutive matrix multiplies (batching):
- Load weights once (50 ns)
- Multiply N times using local SRAM (very fast)
- Cost amortized over N operations
Batch size = 64 (typical):
- Cost per multiply: 50 ns / 64 ≈ 0.78 ns
- Systolic latency: 128 ns
- Net latency: 128 ns + 0.78 ns (amortized weight load)
2. Arithmetic Intensity Balance
From Day 16 roofline: AI must exceed knee to be compute-bound.
| Operation | MACs | Bytes | AI (FLOP/B) | Bottleneck |
|---|---|---|---|---|
| 256×256 matmul | 16M | 256KB | 64 | Memory |
| 1024×1024 matmul | 2B | 8MB | 256 | Compute |
| Activation (ReLU) | N | 2N bytes | 0.5 | Memory |
3. Padding and Tiling
Systolic works best on multiples of 256:
Input problem: 224×224 (image)
Not divisible by 256!
Options:
1. Pad to 256×256 (2.3% extra compute, 14% extra memory I/O)
2. Tile into 224 = 128 + 96 (two separate systolic runs, cache misses)
3. Use smaller systolic (64×64) with batching
Production choice: Pad to 256×256 (simpler, accepted overhead)
Real Throughput Example
Scenario: ResNet-50 inference on TPU
- 50 layers (mostly 3×3 convolutions)
- Total FLOPs: 4.1 billion
- Peak TFLOPS: 256
Naive estimate: 4.1B / 256T = 16 ms
Actual: 28 ms (43% efficiency)
Why? Memory I/O, activation functions, non-matmul layers, batching overhead.
Improvements (in production):
- Batch size 128 (increases AI)
- Fused kernels (skip activation I/O)
- Lower precision (INT8)
- Result: 45 ms → 8 ms (3.5× faster)
Next Phase: Real Implementations
Days 21-25 dive into how Apple, Google, NVIDIA designed production chips. From research papers to actual silicon.