AI Chip Design Day 1 — Why AI Chips?

The Problem

In 2023, you ran a neural network on your phone. ChatGPT queries, image recognition, voice transcription—all powered by special-purpose chips that don't exist on a desktop CPU. Why?

Answer: Matrix multiplication is 99% of deep learning. A general-purpose CPU is terrible at it. A GPU is better. A custom neural processor is 100-1000× better per watt.

CPU vs GPU vs NPU

CPU (General Purpose)

What it's good at: One instruction at a time. Complex branching. Cache-friendly random access.

For AI: Terrible.

// CPU trying to multiply two 256x256 matrices
for (int i = 0; i < 256; i++) {
  for (int j = 0; j < 256; j++) {
    float sum = 0;
    for (int k = 0; k < 256; k++) {
      sum += A[i][k] * B[k][j];  // One multiply per cycle
    }
    C[i][j] = sum;
  }
}
// Total: 256^3 = 16.7 million cycles
// On a 3 GHz CPU = 5.6 milliseconds

Reality: Modern CPUs do run this faster (out-of-order, pipelining, AVX). Still, you get maybe 50-100 GFLOPS for matrix multiply.

GPU (Graphics Processor)

What it's good at: Thousands of identical operations in parallel. Massive memory bandwidth. Designed for matrix math (pixel shaders = SIMD).

For AI: Actually pretty good.

NVIDIA Tesla T4:
• 65 TFLOPS (FP32) for matrix multiply
• 300 GB/sec memory bandwidth
• Can run a large neural network in ~10ms
• But uses 70W of power (hot, needs cooling)

NPU (Neural Processing Unit) — Specialized

What it's good at: Only one thing: matrix multiplication with quantized values (INT8, BF16). Systolic arrays. No cache misses.

For AI: Phenomenal.

Apple Neural Engine (A17 Pro):
• 17 TFLOPS (INT8) for inference
• Uses only 2W of power
• Can run a full language model in ~50ms
• 10× more efficient than a GPU per watt

Matrix Multiplication: The Core Problem

Deep learning is 97% matrix multiply. Everything else (activation functions, normalization, attention) is noise.

// A transformer layer is basically:
output = attention(Q @ K.T) @ V  // 3 matrix muls
output = MLPffn(output)           // 2 matrix muls

// All the rest (softmax, layer norm, etc) = <1% of compute

So the question becomes: Can we design a chip that does matrix multiply incredibly fast and efficiently, and nothing else?

Answer: Yes, and it's called a systolic array.

Energy Efficiency & Performance

Why a custom chip wins:

✅ No instruction fetch/decode: One operation: MAC (multiply-accumulate)
✅ No cache hierarchy: Data flows directly through systolic array
✅ No branch prediction: No branches at all
✅ Quantization: INT8 instead of FP32 = 4× less data, 4× less power
✅ Dataflow architecture: Data streams, not stored in memory

Performance Per Watt

Processor	TFLOPS	Power	Eff (TFLOPS/W)	Use Case
Intel Xeon (CPU)	0.1	200W	0.0005	General compute
RTX 4090 (GPU)	650	450W	1.4	Gaming + AI (overkill)
Google TPU v4	430	150W	2.9	Data center AI
Apple Neural Engine	17	2W	8.5	Mobile inference

Real Examples in Your Pocket

iPhone 15 Pro: A17 Pro chip. Neural Engine inside.

Camera: Real-time portrait mode (AI segmentation)
Siri: On-device voice recognition
Photos: Smart search, object detection
Live Translate: On-device language models

All running on 2 watts. Try that on a GPU.

Key Takeaways

✅ AI is matrix multiplication. 97% of the compute.
✅ General-purpose CPUs are slow. Designed for flexibility, not speed at one task.
✅ GPUs are better but power-hungry. Still fetch instructions, manage caches, handle general compute.
✅ Custom NPUs dominate on efficiency. Do one thing perfectly: MAC operations.
✅ The tradeoff: NPUs can't run general code. But for AI inference, they're unbeatable.

Tomorrow (Day 2): The problem in detail—why general-purpose chips will never win at AI.

Why AI Chips?

📍 Today's Topics