From rendering pixels to training AI — how the GPU became the most important chip in modern computing.
CPU: few powerful sequential cores | GPU: thousands of small parallel cores
A GPU (Graphics Processing Unit) is a specialized electronic circuit designed to rapidly perform thousands of mathematical operations simultaneously. Originally built to render graphics and video, modern GPUs have become the most important chip for AI, machine learning, scientific simulation, and any workload that can be parallelized.
The key insight: while a CPU is built for sequential speed (doing one complex thing very fast), a GPU is built for parallel throughput (doing thousands of simple things at the same time). This makes GPUs ideal for matrix multiplication — the core operation in neural networks.
Simple analogy: A CPU is like a formula one car — incredibly fast, one task at a time. A GPU is like a cargo ship — slower per unit, but moves 10,000 containers simultaneously. For AI training, you need the cargo ship.
| Property | CPU | GPU |
|---|---|---|
| Core count | 4 – 128 cores | 1,000 – 20,000+ cores |
| Core design | Large, complex (out-of-order, branch predict) | Small, simple (in-order, scalar) |
| Clock speed | 4 – 6 GHz | 1.5 – 3 GHz |
| Cache | Large (up to 256 MB L3) | Smaller (shared per SM) |
| Memory | System RAM (DDR5, 100–200 GB/s) | VRAM — HBM3 up to 3.35 TB/s |
| Best for | OS, apps, sequential logic, low-latency | AI, graphics, video, simulations |
| Programming | C, C++, Python (any language) | CUDA (NVIDIA), ROCm (AMD), OpenCL |
| Power (TDP) | 15 – 400W | 75 – 700W |
| Process node | TSMC 3nm, Intel 20A | TSMC 4N, TSMC 3nm |
A GPU is organized into a hierarchy of compute units. Understanding this hierarchy is essential for chip designers and software engineers working with GPUs.
The smallest compute unit. Performs one FP32 or INT32 operation per clock. Thousands are packed into a GPU die.
Group of 128 CUDA cores + shared memory + warp schedulers + Tensor Cores. The fundamental scheduling unit of NVIDIA GPUs.
32 threads that execute the same instruction in lockstep (SIMT — Single Instruction, Multiple Threads). The GPU's execution granularity.
Specialized units for matrix multiply-accumulate (MMA) operations. Critical for AI training. H100 has 528 Tensor Cores doing 3958 TFLOPS FP8.
Hardware-accelerated ray-box and ray-triangle intersection for photorealistic rendering. Present in RTX series from Turing onwards.
High-Bandwidth Memory stacked on the package (HBM3) or on the board (GDDR6X). H100 uses HBM3 for 3.35 TB/s bandwidth vs 1 TB/s for GDDR6X.
| Architecture | Series | Process | Key Feature | Use Case |
|---|---|---|---|---|
| Hopper | H100, H200 | TSMC 4N | Transformer Engine, NVLink 4 | AI Data Center |
| Blackwell | B100, B200, GB200 | TSMC 4NP | 5th gen Tensor Cores, FP4 | AI Data Center |
| Ada Lovelace | RTX 4090–4060 | TSMC 4N | 3rd gen RT Core, DLSS 3 | Gaming / Prosumer |
| Ampere | A100, RTX 30-series | Samsung 8nm | Multi-instance GPU (MIG) | AI + Gaming |
| Jetson Orin | AGX Orin | TSMC 8nm | Ampere GPU + Cortex-A78AE | Edge AI / Automotive |
| Architecture | Series | Process | Key Feature | Use Case |
|---|---|---|---|---|
| RDNA 4 | RX 9000-series | TSMC 3nm | Ray accelerators, FSR 4 | Gaming |
| CDNA 3 | Instinct MI300X | TSMC 5nm | 192 GB HBM3, unified CPU+GPU die | AI / HPC |
| RDNA 3 | RX 7000-series | TSMC 5nm | Chiplet design | Gaming |
NVIDIA's moat: NVIDIA's dominance isn't just the hardware — it's CUDA. Launched in 2006, CUDA is the software ecosystem (libraries, compilers, frameworks like PyTorch/TensorFlow) that runs on NVIDIA GPUs. AMD has ROCm as an alternative, but CUDA's 15+ year head start in developer mindshare is the real barrier to entry.
| Spec | What it means | Why it matters |
|---|---|---|
| CUDA cores / Shaders | Number of parallel compute units | More = higher throughput for parallel workloads |
| VRAM | Dedicated GPU memory (GB) | Model size limit for AI; texture budget for games |
| Memory Bandwidth | GB/s of data fed to GPU cores | Critical for memory-bound workloads like LLM inference |
| TDP (Watts) | Thermal Design Power = power consumption | Determines cooling, electricity cost, and data center density |
| TFLOPS | Trillion floating-point ops/second | Raw compute throughput — higher = faster AI training |
| Die size (mm²) | Physical area of the chip | Larger die = more cores but lower yield = higher cost |
| Process node | Transistor size (nm) | Smaller = more transistors, better power efficiency |
The rise of deep learning from 2012 onwards was enabled entirely by GPUs. The math of neural networks — forward pass, backpropagation, gradient descent — reduces to matrix multiplications. GPUs can do these 100–1000x faster than CPUs.
From a semiconductor perspective, GPUs are among the most complex chips ever designed:
VLSI challenge: At 800mm², GPU dies are near the reticle limit of EUV lithography (~850mm²). Any defect in such a large die risks scrapping the entire chip — yield management is a major engineering challenge at NVIDIA and AMD.
A GPU (Graphics Processing Unit) is a processor with thousands of small cores designed for parallel computation. Originally for graphics, now essential for AI, ML, and scientific computing.
CPU: few powerful sequential cores (4–128), best for low-latency logic. GPU: thousands of small parallel cores (1000–20000+), best for matrix math and throughput-heavy workloads like AI training.
VRAM is the GPU's dedicated high-speed memory. HBM3 in the H100 delivers 3.35 TB/s bandwidth — 20x faster than system RAM. For AI, VRAM size limits how large a model you can load onto the GPU.
CUDA (Compute Unified Device Architecture) is NVIDIA's platform for GPU programming. It lets developers write C/C++ code that runs on GPU cores. PyTorch, TensorFlow, and most AI frameworks are built on CUDA.
Data center: NVIDIA H100 (training) / H100/MI300X (inference). Consumer: RTX 4090 (24GB). Budget: RTX 4070 Ti. NVIDIA dominates due to the CUDA ecosystem advantage.
NVIDIA RTX 40-series and H100: TSMC 4N (customized 4nm). AMD RDNA 4: TSMC 3nm. AMD MI300X: TSMC 5nm chiplets. Fabricated exclusively at TSMC or Samsung foundries.