A GPU (Graphics Processing Unit) is a massively parallel processor designed to execute thousands of threads simultaneously. Originally built for rendering triangles (each pixel computed independently), GPUs now accelerate any data-parallel workload: deep learning, scientific simulation, cryptocurrency mining, and video encoding. A modern GPU contains thousands of simple ALU cores grouped into Streaming Multiprocessors (SMs), each executing 32 threads (a warp) in lockstep using SIMT (Single Instruction, Multiple Thread) execution.

What is the difference between GPU and CPU?

A CPU is optimized for low-latency sequential execution: a few powerful cores (4-64) with large caches and out-of-order execution that minimize latency for any single task. A GPU is optimized for high-throughput parallel execution: thousands of simpler cores that hide memory latency by switching between thousands of threads. CPU wins for latency-sensitive serial code. GPU wins for embarrassingly parallel workloads — matrix multiply, convolution, ray tracing — where the same operation repeats over millions of data elements.

A CUDA core is a single-precision (FP32) floating-point and integer ALU inside an NVIDIA GPU. Each Streaming Multiprocessor (SM) contains a fixed number of CUDA cores (e.g., 128 per SM in Ampere). CUDA cores execute one operation per clock when supplied with a warp (32 threads). An H100 GPU has 132 SMs × 128 CUDA cores = ~16,896 CUDA cores. Tensor Cores (separate from CUDA cores) are specialized for matrix multiply-accumulate used in AI.

What is SIMT execution in a GPU?

SIMT (Single Instruction, Multiple Thread) means all 32 threads in a warp execute the same instruction at the same clock cycle. When threads diverge (e.g., different branches in an if/else), the GPU serializes the divergent paths — threads taking the 'if' branch run while 'else' threads are masked, then vice versa. This is why GPU code should minimize branching within a warp. SIMT differs from SIMD (CPU) in that each thread has its own register file and program counter, making divergence management automatic.

What is GPU used for in AI?

In AI, GPUs accelerate matrix multiplications — the core operation in neural networks. Training a large language model requires trillions of multiply-accumulate (MAC) operations; a modern H100 GPU delivers 3.9 petaFLOPS of FP16 tensor throughput. Tensor Cores (introduced in Volta) perform a 4×4 matrix multiply per clock, providing 10× the throughput of CUDA cores for AI. NVLink interconnects allow multiple GPUs to pool memory for models too large for one card.

Massively Parallel Computing

What is a GPU?

A Graphics Processing Unit runs thousands of threads simultaneously — what takes seconds on a CPU takes microseconds on a GPU. Learn how streaming multiprocessors, warps, SIMT execution, and memory hierarchy make GPUs the engine of AI and graphics.

SIMTWarpSMCUDA CoresTensor CoreGDDR / HBMShared MemoryGPU vs CPU

GPU Architecture Overview

CPU vs GPU: Core Architecture

CPU — Few Powerful Cores

Core
1

Core
2

Core
3

Core
4

Core
5

Core
6

Core
7

Core
8

8–64 cores, deep OOO pipeline, big L3 cache, 3–5 GHz, optimized for low latency

GPU — Thousands of Simple Cores

16,896+ CUDA cores, simple pipeline, hide latency by thread switching, optimized for throughput

Feature	CPU (e.g., i9-13900K)	GPU (e.g., H100)
Core count	24 (P+E cores)	16,896 CUDA cores
Clock speed	5.8 GHz boost	1.98 GHz
FP32 throughput	~2 TFLOPS	67 TFLOPS
Memory bandwidth	~90 GB/s (DDR5)	3.35 TB/s (HBM3)
Memory capacity	192 GB (system RAM)	80 GB (HBM)
Latency (single thread)	~1 ns	~100 ns
Power	125W TDP	700W TDP
Best for	OS, web, databases, games	AI training, graphics, HPC

Interactive: SIMT Parallel Execution

A GPU launches a grid of thread blocks. Each block runs on one SM. Within a block, threads are grouped into warps of 32. All 32 threads in a warp execute the same instruction simultaneously. Click Run to see 128 threads execute in 4 waves of 32.

Thread Block: 128 threads Warps: 4 warps × 32 threads SM: 1 (single block shown)

Ready — press Run to start

Idle Executing (warp active) Done

GPU Memory Hierarchy

GPU memory is deeply hierarchical. The fastest memory is closest to the cores. Choosing the right level is critical for performance — a cache miss to global DRAM can cost 600+ cycles.

Registers

256KB per SM · <1 cycle

Shared Memory / L1

96KB per SM · ~20 cycles · programmer-managed

L2 Cache

50MB (H100) · ~100 cycles · shared across all SMs

HBM3 / GDDR6X (Global Memory)

80GB · ~600 cycles · 3.35 TB/s bandwidth

CPU System RAM (via PCIe / NVLink)

Hundreds of GB · ~10,000 cycles · 900 GB/s NVLink

Memory Type	Scope	Latency	Key Use
Registers	Per-thread	<1 cycle	Local variables, loop counters
Shared Memory	Per-block (SM)	~20 cycles	Thread cooperation, tile caching for matmul
L2 Cache	All SMs	~100 cycles	Automatic — reused global data
Global (HBM)	Whole GPU	~600 cycles	Large arrays, model weights, activations
Constant Cache	Read-only	~4 cycles	Kernel parameters, weights that don't change

GPU Evolution — From Graphics to AI

Generation	Year	Key Addition	Peak FP16 (AI)
NVIDIA Pascal (GP100)	2016	NVLink, FP16 support	21 TFLOPS
NVIDIA Volta (GV100)	2017	Tensor Cores (V1)	112 TFLOPS
NVIDIA Ampere (GA100)	2020	BF16, TF32, A100	312 TFLOPS
NVIDIA Hopper (GH100)	2022	FP8, Transformer Engine, H100	3,958 TFLOPS (sparse)
NVIDIA Blackwell (GB200)	2024	FP4, NVLink 5, 2× H100 perf	~8,000 TFLOPS (FP4)

// CUDA kernel — GPU vector add (each thread handles one element)
__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;  // unique thread ID
    if (i < n) c[i] = a[i] + b[i];                  // all threads run this in parallel
}

// Launch: 1M elements, 256 threads/block → 3906 blocks
vecAdd<<<(N+255)/256, 256>>>(d_a, d_b, d_c, N);

// CUDA shared memory tiled matrix multiply (simplified)
__global__ void matMul(float *A, float *B, float *C, int N) {
    __shared__ float tileA[16][16], tileB[16][16];   // shared mem: one tile per block
    int row = blockIdx.y*16 + threadIdx.y;
    int col = blockIdx.x*16 + threadIdx.x;
    float sum = 0;
    for (int t = 0; t < N/16; t++) {
        tileA[threadIdx.y][threadIdx.x] = A[row*N + t*16+threadIdx.x];  // cooperative load
        tileB[threadIdx.y][threadIdx.x] = B[(t*16+threadIdx.y)*N + col];
        __syncthreads();                             // wait for all 256 threads in block
        for (int k = 0; k < 16; k++) sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        __syncthreads();
    }
    C[row*N+col] = sum;
}

What is GPU Used For?

🧠

AI & Deep Learning

Training and inference of LLMs (ChatGPT, Gemini). Matrix multiply = the heart of transformers. H100 clusters train GPT-4-scale models.

🎮

Gaming & Graphics

Real-time ray tracing, rasterization, shading of millions of pixels at 60–240 FPS. RT Cores trace ray-triangle intersections in hardware.

🔬

Scientific HPC

Molecular dynamics, CFD, climate modeling, protein folding (AlphaFold ran on A100s). Simulations that take weeks on CPU run in hours.

🎬

Video Encode/Decode

Hardware NVENC/NVDEC encode 4K H.265 streams without using CUDA cores. Used in streaming platforms and video conferencing.

💰

Cryptocurrency

SHA-256, Ethash, and other PoW hashing algorithms exploit GPU parallelism for mining (though ASICs now dominate Bitcoin).

🔐

Security & Cracking

Password hash brute-force, SSL offload, cryptography acceleration. GPUs compute billions of SHA-1 hashes per second.

Frequently Asked Questions

Do you need a GPU to run AI models?

No — small models run on CPU. But for training large models or fast inference, a GPU is essentially required. A transformer that trains in 1 hour on an A100 takes ~100 hours on a high-end CPU. For production inference of GPT-class models, you need multiple GPUs just to hold the weights in memory.

What is the difference between a GPU and an NPU?

An NPU (Neural Processing Unit) is an ASIC specifically designed for neural network inference — fixed data flow, INT8/INT4 operations, very low power. A GPU is a general-purpose parallel processor that can do AI but also graphics, compute, and more. NPUs (Apple Neural Engine, Qualcomm Hexagon) are more efficient for on-device inference; GPUs are flexible and reprogrammable.

What is warp divergence and why does it hurt performance?

When threads in the same warp take different branches (e.g., if/else based on thread ID), the GPU must serialize the two paths — half the warp idles during each path. This halves throughput. The fix: restructure code so all 32 threads in a warp always take the same branch, or use predication to avoid branching entirely.

What is occupancy in CUDA?

Occupancy is the ratio of active warps to the maximum supported warps on an SM. Higher occupancy lets the GPU hide memory latency by switching to another warp while one waits for a DRAM fetch. Occupancy is limited by register usage per thread and shared memory per block — using fewer registers per thread allows more warps to run concurrently.