A CPU is a few geniuses. A GPU is a thousand interns doing the same tiny job at once. Race them, change the core count, and launch a real kernel across the cores below.
Both have to apply the same operation to 48 data elements (think: brighten 48 pixels). The CPU has 1 core — it does them one at a time. The GPU has many small cores — it does a whole batch every cycle. Hit Run and count the cycles.
Try it: drag the slider to 48 cores and the GPU finishes the whole array in a single cycle. That's the entire idea of a GPU — trade clever-and-serial for simple-and-massively-parallel.
A handful of complex cores with big caches and branch predictors. Brilliant at one complicated, decision-heavy task done fast — running your OS, logic, "if this then that". Latency-optimised.
Thousands of small arithmetic units that all run the same instruction on different data at once. Brilliant at one simple calculation done a million times — pixels, matrices, AI. Throughput-optimised.
Neither is "better" — they're built for opposite goals. The GPU wins only when the work is wide, uniform and independent: every data element needs the same math and doesn't depend on its neighbours.
The magic word is SIMT — Single Instruction, Multiple Threads. You write one tiny program (a kernel), and the GPU runs it on thousands of threads at once — each thread handles one element, identified by its thread ID. Launch the kernel below and watch every thread brighten its own pixel simultaneously.
GPUs aren't magic speed buttons. They lose when:
if paths, the GPU runs them one path at a time (warp divergence), wasting its cores.That's why your computer has both: a CPU for the smart, sequential, decision-heavy work, and a GPU for the wide, repetitive number-crunching. Graphics, deep learning, physics and crypto live on the GPU; everything else stays on the CPU.
For wide, uniform, independent data it processes many elements per cycle (thousands of simple cores) instead of one. The race above shows it directly.
Single Instruction, Multiple Threads — one kernel runs on thousands of threads, each on its own data element identified by its thread ID.
Serial, branchy or dependency-heavy work, or tiny datasets where data-transfer overhead dominates. Then the CPU's strong single-thread performance wins.
Related: Cache Simulator · Why AI Needs So Many Chips · Logic Gate Simulator