HomeDay 2

The General-Purpose Problem

Why CPU Design Fails for AI

CPUs are optimized for general-purpose compute: branch prediction, caching, speculative execution, out-of-order dispatch. All this overhead killed by one task: matrix multiply.

Problem 1: Instruction Fetch & Decode

// Every iteration of an inner loop requires: // Fetch instruction: mov r1, [memory] (3 cycles) // Decode: identify what it is (2 cycles) // Execute: one multiply (1 cycle) // For 16.7M multiplies in a 256×256 matrix mul: // CPU wastes 2 cycles per 1 cycle of actual work! // 67% of energy = instruction overhead

Problem 2: Cache Misses

Matrix multiply access patterns are predictable (row-major, column-major). But a general-purpose cache hierarchy was designed for random access (web browsers, databases, C++ algorithms).

Result: 40-60% cache miss rate. 200 CPU cycles wasted per miss. NPU: zero misses (data flows in from HBM).

Problem 3: Power Per Multiply

CPU: 1 Joule / 1 Giga-multiply (huge!)
Custom NPU: 0.001 Joule / 1 Giga-multiply (1000× better)

Day 3: Energy efficiency metrics and why they matter for design decisions.