What is a GPU?
A Graphics Processing Unit runs thousands of threads simultaneously — what takes seconds on a CPU takes microseconds on a GPU. Learn how streaming multiprocessors, warps, SIMT execution, and memory hierarchy make GPUs the engine of AI and graphics.
GPU Architecture Overview
CPU vs GPU: Core Architecture
CPU — Few Powerful Cores
1
2
3
4
5
6
7
8
8–64 cores, deep OOO pipeline, big L3 cache, 3–5 GHz, optimized for low latency
GPU — Thousands of Simple Cores
16,896+ CUDA cores, simple pipeline, hide latency by thread switching, optimized for throughput
| Feature | CPU (e.g., i9-13900K) | GPU (e.g., H100) |
|---|---|---|
| Core count | 24 (P+E cores) | 16,896 CUDA cores |
| Clock speed | 5.8 GHz boost | 1.98 GHz |
| FP32 throughput | ~2 TFLOPS | 67 TFLOPS |
| Memory bandwidth | ~90 GB/s (DDR5) | 3.35 TB/s (HBM3) |
| Memory capacity | 192 GB (system RAM) | 80 GB (HBM) |
| Latency (single thread) | ~1 ns | ~100 ns |
| Power | 125W TDP | 700W TDP |
| Best for | OS, web, databases, games | AI training, graphics, HPC |
Interactive: SIMT Parallel Execution
A GPU launches a grid of thread blocks. Each block runs on one SM. Within a block, threads are grouped into warps of 32. All 32 threads in a warp execute the same instruction simultaneously. Click Run to see 128 threads execute in 4 waves of 32.
GPU Memory Hierarchy
GPU memory is deeply hierarchical. The fastest memory is closest to the cores. Choosing the right level is critical for performance — a cache miss to global DRAM can cost 600+ cycles.
| Memory Type | Scope | Latency | Key Use |
|---|---|---|---|
| Registers | Per-thread | <1 cycle | Local variables, loop counters |
| Shared Memory | Per-block (SM) | ~20 cycles | Thread cooperation, tile caching for matmul |
| L2 Cache | All SMs | ~100 cycles | Automatic — reused global data |
| Global (HBM) | Whole GPU | ~600 cycles | Large arrays, model weights, activations |
| Constant Cache | Read-only | ~4 cycles | Kernel parameters, weights that don't change |
GPU Evolution — From Graphics to AI
| Generation | Year | Key Addition | Peak FP16 (AI) |
|---|---|---|---|
| NVIDIA Pascal (GP100) | 2016 | NVLink, FP16 support | 21 TFLOPS |
| NVIDIA Volta (GV100) | 2017 | Tensor Cores (V1) | 112 TFLOPS |
| NVIDIA Ampere (GA100) | 2020 | BF16, TF32, A100 | 312 TFLOPS |
| NVIDIA Hopper (GH100) | 2022 | FP8, Transformer Engine, H100 | 3,958 TFLOPS (sparse) |
| NVIDIA Blackwell (GB200) | 2024 | FP4, NVLink 5, 2× H100 perf | ~8,000 TFLOPS (FP4) |
// CUDA kernel — GPU vector add (each thread handles one element)
__global__ void vecAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x; // unique thread ID
if (i < n) c[i] = a[i] + b[i]; // all threads run this in parallel
}
// Launch: 1M elements, 256 threads/block → 3906 blocks
vecAdd<<<(N+255)/256, 256>>>(d_a, d_b, d_c, N);
// CUDA shared memory tiled matrix multiply (simplified)
__global__ void matMul(float *A, float *B, float *C, int N) {
__shared__ float tileA[16][16], tileB[16][16]; // shared mem: one tile per block
int row = blockIdx.y*16 + threadIdx.y;
int col = blockIdx.x*16 + threadIdx.x;
float sum = 0;
for (int t = 0; t < N/16; t++) {
tileA[threadIdx.y][threadIdx.x] = A[row*N + t*16+threadIdx.x]; // cooperative load
tileB[threadIdx.y][threadIdx.x] = B[(t*16+threadIdx.y)*N + col];
__syncthreads(); // wait for all 256 threads in block
for (int k = 0; k < 16; k++) sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
__syncthreads();
}
C[row*N+col] = sum;
}
What is GPU Used For?
AI & Deep Learning
Training and inference of LLMs (ChatGPT, Gemini). Matrix multiply = the heart of transformers. H100 clusters train GPT-4-scale models.
Gaming & Graphics
Real-time ray tracing, rasterization, shading of millions of pixels at 60–240 FPS. RT Cores trace ray-triangle intersections in hardware.
Scientific HPC
Molecular dynamics, CFD, climate modeling, protein folding (AlphaFold ran on A100s). Simulations that take weeks on CPU run in hours.
Video Encode/Decode
Hardware NVENC/NVDEC encode 4K H.265 streams without using CUDA cores. Used in streaming platforms and video conferencing.
Cryptocurrency
SHA-256, Ethash, and other PoW hashing algorithms exploit GPU parallelism for mining (though ASICs now dominate Bitcoin).
Security & Cracking
Password hash brute-force, SSL offload, cryptography acceleration. GPUs compute billions of SHA-1 hashes per second.
Frequently Asked Questions
Do you need a GPU to run AI models?
No — small models run on CPU. But for training large models or fast inference, a GPU is essentially required. A transformer that trains in 1 hour on an A100 takes ~100 hours on a high-end CPU. For production inference of GPT-class models, you need multiple GPUs just to hold the weights in memory.
What is the difference between a GPU and an NPU?
An NPU (Neural Processing Unit) is an ASIC specifically designed for neural network inference — fixed data flow, INT8/INT4 operations, very low power. A GPU is a general-purpose parallel processor that can do AI but also graphics, compute, and more. NPUs (Apple Neural Engine, Qualcomm Hexagon) are more efficient for on-device inference; GPUs are flexible and reprogrammable.
What is warp divergence and why does it hurt performance?
When threads in the same warp take different branches (e.g., if/else based on thread ID), the GPU must serialize the two paths — half the warp idles during each path. This halves throughput. The fix: restructure code so all 32 threads in a warp always take the same branch, or use predication to avoid branching entirely.
What is occupancy in CUDA?
Occupancy is the ratio of active warps to the maximum supported warps on an SM. Higher occupancy lets the GPU hide memory latency by switching to another warp while one waits for a DRAM fetch. Occupancy is limited by register usage per thread and shared memory per block — using fewer registers per thread allows more warps to run concurrently.