AI Chip Design Day 2

Why CPU Design Fails for AI

CPUs are optimized for general-purpose compute: branch prediction, caching, speculative execution, out-of-order dispatch. All this overhead killed by one task: matrix multiply.

Problem 1: Instruction Fetch & Decode

// Every iteration of an inner loop requires:
// Fetch instruction: mov r1, [memory]  (3 cycles)
// Decode: identify what it is             (2 cycles)
// Execute: one multiply                   (1 cycle)

// For 16.7M multiplies in a 256×256 matrix mul:
// CPU wastes 2 cycles per 1 cycle of actual work!
// 67% of energy = instruction overhead

Problem 2: Cache Misses

Matrix multiply access patterns are predictable (row-major, column-major). But a general-purpose cache hierarchy was designed for random access (web browsers, databases, C++ algorithms).

Result: 40-60% cache miss rate. 200 CPU cycles wasted per miss. NPU: zero misses (data flows in from HBM).

Problem 3: Power Per Multiply

CPU: 1 Joule / 1 Giga-multiply (huge!)
Custom NPU: 0.001 Joule / 1 Giga-multiply (1000× better)

Day 3: Energy efficiency metrics and why they matter for design decisions.

The General-Purpose Problem

Why CPU Design Fails for AI

Problem 1: Instruction Fetch & Decode

Problem 2: Cache Misses

Problem 3: Power Per Multiply