HomeAI Chip DesignDay 1

Why AI Chips?

The problem that makes specialized neural processors necessary: why your phone's CPU can't handle AI efficiently, and why Apple, Google, and NVIDIA design custom chips.

📍 Today's Topics

The Problem

In 2023, you ran a neural network on your phone. ChatGPT queries, image recognition, voice transcription—all powered by special-purpose chips that don't exist on a desktop CPU. Why?

Answer: Matrix multiplication is 99% of deep learning. A general-purpose CPU is terrible at it. A GPU is better. A custom neural processor is 100-1000× better per watt.

CPU vs GPU vs NPU

CPU (General Purpose)

What it's good at: One instruction at a time. Complex branching. Cache-friendly random access.

For AI: Terrible.

// CPU trying to multiply two 256x256 matrices for (int i = 0; i < 256; i++) { for (int j = 0; j < 256; j++) { float sum = 0; for (int k = 0; k < 256; k++) { sum += A[i][k] * B[k][j]; // One multiply per cycle } C[i][j] = sum; } } // Total: 256^3 = 16.7 million cycles // On a 3 GHz CPU = 5.6 milliseconds

Reality: Modern CPUs do run this faster (out-of-order, pipelining, AVX). Still, you get maybe 50-100 GFLOPS for matrix multiply.

GPU (Graphics Processor)

What it's good at: Thousands of identical operations in parallel. Massive memory bandwidth. Designed for matrix math (pixel shaders = SIMD).

For AI: Actually pretty good.

NVIDIA Tesla T4:
• 65 TFLOPS (FP32) for matrix multiply
• 300 GB/sec memory bandwidth
• Can run a large neural network in ~10ms
• But uses 70W of power (hot, needs cooling)

NPU (Neural Processing Unit) — Specialized

What it's good at: Only one thing: matrix multiplication with quantized values (INT8, BF16). Systolic arrays. No cache misses.

For AI: Phenomenal.

Apple Neural Engine (A17 Pro):
• 17 TFLOPS (INT8) for inference
• Uses only 2W of power
• Can run a full language model in ~50ms
• 10× more efficient than a GPU per watt

Matrix Multiplication: The Core Problem

Deep learning is 97% matrix multiply. Everything else (activation functions, normalization, attention) is noise.

// A transformer layer is basically: output = attention(Q @ K.T) @ V // 3 matrix muls output = MLPffn(output) // 2 matrix muls // All the rest (softmax, layer norm, etc) = <1% of compute

So the question becomes: Can we design a chip that does matrix multiply incredibly fast and efficiently, and nothing else?

Answer: Yes, and it's called a systolic array.

Energy Efficiency & Performance

Why a custom chip wins:

Performance Per Watt

ProcessorTFLOPSPowerEff (TFLOPS/W)Use Case
Intel Xeon (CPU)0.1200W0.0005General compute
RTX 4090 (GPU)650450W1.4Gaming + AI (overkill)
Google TPU v4430150W2.9Data center AI
Apple Neural Engine172W8.5Mobile inference

Real Examples in Your Pocket

iPhone 15 Pro: A17 Pro chip. Neural Engine inside.

All running on 2 watts. Try that on a GPU.

Key Takeaways

Tomorrow (Day 2): The problem in detail—why general-purpose chips will never win at AI.