Massively Parallel Computing

What is a GPU?

A Graphics Processing Unit runs thousands of threads simultaneously — what takes seconds on a CPU takes microseconds on a GPU. Learn how streaming multiprocessors, warps, SIMT execution, and memory hierarchy make GPUs the engine of AI and graphics.

SIMTWarpSMCUDA CoresTensor CoreGDDR / HBMShared MemoryGPU vs CPU

GPU Architecture Overview

GPU Internal Architecture (Simplified) Graphics Processing Clusters (GPCs) SM Streaming Multiprocessor CUDA Cores ×128 FP32 / INT32 ALUs Tensor Cores ×4 FP16 Matrix MAC Warp Schedulers ×4 Shared Mem / L1 96KB SM Streaming Multiprocessor CUDA Cores ×128 FP32 / INT32 ALUs Tensor Cores ×4 FP16 Matrix MAC Warp Schedulers ×4 Shared Mem / L1 96KB SM Streaming Multiprocessor CUDA Cores ×128 FP32 / INT32 ALUs Tensor Cores ×4 FP16 Matrix MAC Warp Schedulers ×4 Shared Mem / L1 96KB ··· SM Streaming Multiprocessor CUDA Cores ×128 FP32 / INT32 ALUs Tensor Cores ×4 FP16 Matrix MAC Warp Schedulers ×4 Shared Mem / L1 96KB L2 Cache (50MB H100) HBM3 / GDDR6X — 80GB, 3.35TB/s bandwidth (H100) SM CUDA Tensor SMEM HBM H100: 132 SMs × 128 CUDA cores = 16,896 CUDA cores total Each SM runs up to 2048 threads concurrently (64 warps × 32 threads)

CPU vs GPU: Core Architecture

CPU — Few Powerful Cores

Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
Core
8

8–64 cores, deep OOO pipeline, big L3 cache, 3–5 GHz, optimized for low latency

GPU — Thousands of Simple Cores

16,896+ CUDA cores, simple pipeline, hide latency by thread switching, optimized for throughput

FeatureCPU (e.g., i9-13900K)GPU (e.g., H100)
Core count24 (P+E cores)16,896 CUDA cores
Clock speed5.8 GHz boost1.98 GHz
FP32 throughput~2 TFLOPS67 TFLOPS
Memory bandwidth~90 GB/s (DDR5)3.35 TB/s (HBM3)
Memory capacity192 GB (system RAM)80 GB (HBM)
Latency (single thread)~1 ns~100 ns
Power125W TDP700W TDP
Best forOS, web, databases, gamesAI training, graphics, HPC

Interactive: SIMT Parallel Execution

A GPU launches a grid of thread blocks. Each block runs on one SM. Within a block, threads are grouped into warps of 32. All 32 threads in a warp execute the same instruction simultaneously. Click Run to see 128 threads execute in 4 waves of 32.

Thread Block: 128 threads Warps: 4 warps × 32 threads SM: 1 (single block shown)
Ready — press Run to start
Idle Executing (warp active) Done

GPU Memory Hierarchy

GPU memory is deeply hierarchical. The fastest memory is closest to the cores. Choosing the right level is critical for performance — a cache miss to global DRAM can cost 600+ cycles.

Registers
256KB per SM · <1 cycle
Shared Memory / L1
96KB per SM · ~20 cycles · programmer-managed
L2 Cache
50MB (H100) · ~100 cycles · shared across all SMs
HBM3 / GDDR6X (Global Memory)
80GB · ~600 cycles · 3.35 TB/s bandwidth
CPU System RAM (via PCIe / NVLink)
Hundreds of GB · ~10,000 cycles · 900 GB/s NVLink
Memory TypeScopeLatencyKey Use
RegistersPer-thread<1 cycleLocal variables, loop counters
Shared MemoryPer-block (SM)~20 cyclesThread cooperation, tile caching for matmul
L2 CacheAll SMs~100 cyclesAutomatic — reused global data
Global (HBM)Whole GPU~600 cyclesLarge arrays, model weights, activations
Constant CacheRead-only~4 cyclesKernel parameters, weights that don't change

GPU Evolution — From Graphics to AI

GenerationYearKey AdditionPeak FP16 (AI)
NVIDIA Pascal (GP100)2016NVLink, FP16 support21 TFLOPS
NVIDIA Volta (GV100)2017Tensor Cores (V1)112 TFLOPS
NVIDIA Ampere (GA100)2020BF16, TF32, A100312 TFLOPS
NVIDIA Hopper (GH100)2022FP8, Transformer Engine, H1003,958 TFLOPS (sparse)
NVIDIA Blackwell (GB200)2024FP4, NVLink 5, 2× H100 perf~8,000 TFLOPS (FP4)
// CUDA kernel — GPU vector add (each thread handles one element)
__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;  // unique thread ID
    if (i < n) c[i] = a[i] + b[i];                  // all threads run this in parallel
}

// Launch: 1M elements, 256 threads/block → 3906 blocks
vecAdd<<<(N+255)/256, 256>>>(d_a, d_b, d_c, N);

// CUDA shared memory tiled matrix multiply (simplified)
__global__ void matMul(float *A, float *B, float *C, int N) {
    __shared__ float tileA[16][16], tileB[16][16];   // shared mem: one tile per block
    int row = blockIdx.y*16 + threadIdx.y;
    int col = blockIdx.x*16 + threadIdx.x;
    float sum = 0;
    for (int t = 0; t < N/16; t++) {
        tileA[threadIdx.y][threadIdx.x] = A[row*N + t*16+threadIdx.x];  // cooperative load
        tileB[threadIdx.y][threadIdx.x] = B[(t*16+threadIdx.y)*N + col];
        __syncthreads();                             // wait for all 256 threads in block
        for (int k = 0; k < 16; k++) sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        __syncthreads();
    }
    C[row*N+col] = sum;
}

What is GPU Used For?

🧠

AI & Deep Learning

Training and inference of LLMs (ChatGPT, Gemini). Matrix multiply = the heart of transformers. H100 clusters train GPT-4-scale models.

🎮

Gaming & Graphics

Real-time ray tracing, rasterization, shading of millions of pixels at 60–240 FPS. RT Cores trace ray-triangle intersections in hardware.

🔬

Scientific HPC

Molecular dynamics, CFD, climate modeling, protein folding (AlphaFold ran on A100s). Simulations that take weeks on CPU run in hours.

🎬

Video Encode/Decode

Hardware NVENC/NVDEC encode 4K H.265 streams without using CUDA cores. Used in streaming platforms and video conferencing.

💰

Cryptocurrency

SHA-256, Ethash, and other PoW hashing algorithms exploit GPU parallelism for mining (though ASICs now dominate Bitcoin).

🔐

Security & Cracking

Password hash brute-force, SSL offload, cryptography acceleration. GPUs compute billions of SHA-1 hashes per second.

Frequently Asked Questions

Do you need a GPU to run AI models?

No — small models run on CPU. But for training large models or fast inference, a GPU is essentially required. A transformer that trains in 1 hour on an A100 takes ~100 hours on a high-end CPU. For production inference of GPT-class models, you need multiple GPUs just to hold the weights in memory.

What is the difference between a GPU and an NPU?

An NPU (Neural Processing Unit) is an ASIC specifically designed for neural network inference — fixed data flow, INT8/INT4 operations, very low power. A GPU is a general-purpose parallel processor that can do AI but also graphics, compute, and more. NPUs (Apple Neural Engine, Qualcomm Hexagon) are more efficient for on-device inference; GPUs are flexible and reprogrammable.

What is warp divergence and why does it hurt performance?

When threads in the same warp take different branches (e.g., if/else based on thread ID), the GPU must serialize the two paths — half the warp idles during each path. This halves throughput. The fix: restructure code so all 32 threads in a warp always take the same branch, or use predication to avoid branching entirely.

What is occupancy in CUDA?

Occupancy is the ratio of active warps to the maximum supported warps on an SM. Higher occupancy lets the GPU hide memory latency by switching to another warp while one waits for a DRAM fetch. Occupancy is limited by register usage per thread and shared memory per block — using fewer registers per thread allows more warps to run concurrently.