EcrioniX/ Digital Electronics/ What is a GPU?
Topic 38 · Digital Electronics

What is a GPU?
Graphics Processing Unit Explained

From rendering pixels to training AI — how the GPU became the most important chip in modern computing.

16,384
CUDA Cores — RTX 4090
80 GB
VRAM — H100 SXM
3.35 TB/s
Memory BW — H100
80B
Transistors — H100
CPU Core 1 Core 2 Core 3 Core 4 4–64 big cores PCIe GPU SM 128 cores HBM / GDDR VRAM — High Bandwidth Memory Thousands of small cores, massive parallelism

CPU: few powerful sequential cores  |  GPU: thousands of small parallel cores

What is a GPU? BASICS

A GPU (Graphics Processing Unit) is a specialized electronic circuit designed to rapidly perform thousands of mathematical operations simultaneously. Originally built to render graphics and video, modern GPUs have become the most important chip for AI, machine learning, scientific simulation, and any workload that can be parallelized.

The key insight: while a CPU is built for sequential speed (doing one complex thing very fast), a GPU is built for parallel throughput (doing thousands of simple things at the same time). This makes GPUs ideal for matrix multiplication — the core operation in neural networks.

Simple analogy: A CPU is like a formula one car — incredibly fast, one task at a time. A GPU is like a cargo ship — slower per unit, but moves 10,000 containers simultaneously. For AI training, you need the cargo ship.

GPU vs CPU — Key Differences COMPARISON

PropertyCPUGPU
Core count4 – 128 cores1,000 – 20,000+ cores
Core designLarge, complex (out-of-order, branch predict)Small, simple (in-order, scalar)
Clock speed4 – 6 GHz1.5 – 3 GHz
CacheLarge (up to 256 MB L3)Smaller (shared per SM)
MemorySystem RAM (DDR5, 100–200 GB/s)VRAM — HBM3 up to 3.35 TB/s
Best forOS, apps, sequential logic, low-latencyAI, graphics, video, simulations
ProgrammingC, C++, Python (any language)CUDA (NVIDIA), ROCm (AMD), OpenCL
Power (TDP)15 – 400W75 – 700W
Process nodeTSMC 3nm, Intel 20ATSMC 4N, TSMC 3nm

GPU Architecture Deep Dive ARCHITECTURE

A GPU is organized into a hierarchy of compute units. Understanding this hierarchy is essential for chip designers and software engineers working with GPUs.

CUDA Core (Shader Unit)

The smallest compute unit. Performs one FP32 or INT32 operation per clock. Thousands are packed into a GPU die.

Streaming Multiprocessor (SM)

Group of 128 CUDA cores + shared memory + warp schedulers + Tensor Cores. The fundamental scheduling unit of NVIDIA GPUs.

Warp

32 threads that execute the same instruction in lockstep (SIMT — Single Instruction, Multiple Threads). The GPU's execution granularity.

Tensor Core

Specialized units for matrix multiply-accumulate (MMA) operations. Critical for AI training. H100 has 528 Tensor Cores doing 3958 TFLOPS FP8.

RT Core (Ray Tracing)

Hardware-accelerated ray-box and ray-triangle intersection for photorealistic rendering. Present in RTX series from Turing onwards.

HBM / GDDR VRAM

High-Bandwidth Memory stacked on the package (HBM3) or on the board (GDDR6X). H100 uses HBM3 for 3.35 TB/s bandwidth vs 1 TB/s for GDDR6X.

NVIDIA GPU Families NVIDIA

ArchitectureSeriesProcessKey FeatureUse Case
HopperH100, H200TSMC 4NTransformer Engine, NVLink 4AI Data Center
BlackwellB100, B200, GB200TSMC 4NP5th gen Tensor Cores, FP4AI Data Center
Ada LovelaceRTX 4090–4060TSMC 4N3rd gen RT Core, DLSS 3Gaming / Prosumer
AmpereA100, RTX 30-seriesSamsung 8nmMulti-instance GPU (MIG)AI + Gaming
Jetson OrinAGX OrinTSMC 8nmAmpere GPU + Cortex-A78AEEdge AI / Automotive

AMD GPU Families AMD

ArchitectureSeriesProcessKey FeatureUse Case
RDNA 4RX 9000-seriesTSMC 3nmRay accelerators, FSR 4Gaming
CDNA 3Instinct MI300XTSMC 5nm192 GB HBM3, unified CPU+GPU dieAI / HPC
RDNA 3RX 7000-seriesTSMC 5nmChiplet designGaming

NVIDIA's moat: NVIDIA's dominance isn't just the hardware — it's CUDA. Launched in 2006, CUDA is the software ecosystem (libraries, compilers, frameworks like PyTorch/TensorFlow) that runs on NVIDIA GPUs. AMD has ROCm as an alternative, but CUDA's 15+ year head start in developer mindshare is the real barrier to entry.

GPU Specs Explained SPECS

SpecWhat it meansWhy it matters
CUDA cores / ShadersNumber of parallel compute unitsMore = higher throughput for parallel workloads
VRAMDedicated GPU memory (GB)Model size limit for AI; texture budget for games
Memory BandwidthGB/s of data fed to GPU coresCritical for memory-bound workloads like LLM inference
TDP (Watts)Thermal Design Power = power consumptionDetermines cooling, electricity cost, and data center density
TFLOPSTrillion floating-point ops/secondRaw compute throughput — higher = faster AI training
Die size (mm²)Physical area of the chipLarger die = more cores but lower yield = higher cost
Process nodeTransistor size (nm)Smaller = more transistors, better power efficiency

GPU in AI and Machine Learning AI/ML

The rise of deep learning from 2012 onwards was enabled entirely by GPUs. The math of neural networks — forward pass, backpropagation, gradient descent — reduces to matrix multiplications. GPUs can do these 100–1000x faster than CPUs.

H100
3,958 TFLOPS FP8
Industry AI Standard
MI300X
192 GB VRAM
Best for LLM Inference
RTX 4090
82.6 TFLOPS FP16
Best Consumer AI GPU
B200
4.5 PFLOPS FP4
Next-Gen Training

How GPU Chips Are Made CHIP DESIGN

From a semiconductor perspective, GPUs are among the most complex chips ever designed:

VLSI challenge: At 800mm², GPU dies are near the reticle limit of EUV lithography (~850mm²). Any defect in such a large die risks scrapping the entire chip — yield management is a major engineering challenge at NVIDIA and AMD.

Frequently Asked Questions FAQ

What is a GPU? +

A GPU (Graphics Processing Unit) is a processor with thousands of small cores designed for parallel computation. Originally for graphics, now essential for AI, ML, and scientific computing.

What is the difference between GPU and CPU? +

CPU: few powerful sequential cores (4–128), best for low-latency logic. GPU: thousands of small parallel cores (1000–20000+), best for matrix math and throughput-heavy workloads like AI training.

What does VRAM mean? +

VRAM is the GPU's dedicated high-speed memory. HBM3 in the H100 delivers 3.35 TB/s bandwidth — 20x faster than system RAM. For AI, VRAM size limits how large a model you can load onto the GPU.

What is CUDA? +

CUDA (Compute Unified Device Architecture) is NVIDIA's platform for GPU programming. It lets developers write C/C++ code that runs on GPU cores. PyTorch, TensorFlow, and most AI frameworks are built on CUDA.

Which GPU is best for AI? +

Data center: NVIDIA H100 (training) / H100/MI300X (inference). Consumer: RTX 4090 (24GB). Budget: RTX 4070 Ti. NVIDIA dominates due to the CUDA ecosystem advantage.

What process node are GPUs made on? +

NVIDIA RTX 40-series and H100: TSMC 4N (customized 4nm). AMD RDNA 4: TSMC 3nm. AMD MI300X: TSMC 5nm chiplets. Fabricated exclusively at TSMC or Samsung foundries.