PCIe — PCI Express
The universal high-speed interconnect inside every modern PC, server, and SoC — powering GPUs, NVMe SSDs, network cards, and AI accelerators with scalable, full-duplex serial lanes.
1. What is PCIe?
PCIe (PCI Express) is a high-speed serial interconnect standard that connects components inside computers — GPUs, NVMe SSDs, network interface cards, sound cards, and more. It was introduced in 2003 by Intel to replace the older parallel PCI and AGP buses.
Unlike the original PCI which was a shared parallel bus (all devices competed for the same wires), PCIe is point-to-point — each device gets its own dedicated serial link with no contention. This allows each link to operate at full bandwidth regardless of how many other devices are present.
PCI = shared parallel bus → devices take turns, bandwidth halved with each device added.
PCIe = point-to-point serial → each device has a dedicated full-duplex link. Adding more devices adds more links, not more contention.
| Feature | Old PCI | PCIe |
|---|---|---|
| Topology | Shared parallel bus | Point-to-point |
| Signal type | Parallel (32-bit or 64-bit) | Serial differential (LVDS) |
| Direction | Half-duplex (shared) | Full-duplex (separate TX/RX) |
| Max bandwidth | 533 MB/s (PCI-X 133) | >500 GB/s (x16 Gen 6) |
| Hot-plug | Limited | Native support |
| Power management | Basic | ASPM (L0/L0s/L1/L2) |
2. PCIe Topology
PCIe uses a hierarchical tree topology. Every device in the system is reachable from the CPU through a tree of Root Complex → Switches → Endpoints.
Fig 1 — PCIe system topology: Root Complex → optional Switch → Endpoints
Key topology rules:
- Root Complex (RC) — the host controller, integrated in the CPU or chipset. Owns the top of the tree.
- Switch — a transparent bridge that fans one upstream port into multiple downstream ports. Adds more PCIe slots/ports without changing software visibility.
- Endpoint — a leaf device (GPU, SSD, NIC). Has no downstream ports. Initiates and responds to TLPs.
- Requester ID — every device is addressed by Bus:Device:Function (BDF), assigned during PCI enumeration at boot.
3. Lanes — x1, x4, x8, x16
A PCIe lane is one full-duplex serial link. It contains two differential pairs: one for transmit (TX+ / TX−) and one for receive (RX+ / RX−) — 4 wires total per lane. TX and RX operate simultaneously → full-duplex.
Devices use multiple lanes in parallel to multiply bandwidth. The link width is denoted x1, x4, x8, x16 (read "by one", "by four", etc.).
Orange = TX lanes · Blue = RX lanes. Each lane = 4 wires (TX+, TX−, RX+, RX−).
A x4 device fits in a x8 or x16 slot (mechanically larger slots accept smaller devices). The link negotiates down to the device's capability during link training. This is called link width negotiation.
4. PCIe Generations — Speed Table
Each PCIe generation doubles the per-lane data rate. The raw bit rate (GT/s) is higher than the effective data rate because of line encoding overhead.
| Generation | Year | Rate (per lane) | Encoding | Effective / lane | x16 total |
|---|---|---|---|---|---|
| Gen 1 | 2003 | 2.5 GT/s | 8b/10b (20% overhead) | 250 MB/s | 4 GB/s |
| Gen 2 | 2007 | 5.0 GT/s | 8b/10b (20% overhead) | 500 MB/s | 8 GB/s |
| Gen 3 | 2010 | 8.0 GT/s | 128b/130b (~1.5% overhead) | ~985 MB/s | ~16 GB/s |
| Gen 4 | 2017 | 16.0 GT/s | 128b/130b | ~1.97 GB/s | ~32 GB/s |
| Gen 5 | 2019 | 32.0 GT/s | 128b/130b | ~3.94 GB/s | ~64 GB/s |
| Gen 6 | 2022 | 64.0 GT/s | FLIT / PAM4 | ~7.88 GB/s | ~128 GB/s |
8b/10b (Gen 1–2): every 8 bits of data is sent as 10 bits on the wire — 20% overhead. Gives DC balance and clock recovery but wastes bandwidth.
128b/130b (Gen 3–5): every 128 data bits uses only a 2-bit sync header — just 1.5% overhead. Much more efficient.
FLIT + PAM4 (Gen 6): uses 4-level pulse amplitude modulation (PAM4) to encode 2 bits per symbol, doubling bandwidth without doubling frequency. Fixed-size 256-byte FLITs replace variable TLPs at the physical layer framing level.
5. PCIe 3-Layer Architecture
PCIe is organized into three layers, similar to a network stack. Each layer has a distinct role — upper layers don't care how lower layers work.
Fig 2 — PCIe 3-layer architecture: TLPs flow down the TX stack, travel over serial lanes, and up the RX stack
Transaction Layer
The topmost layer. Software interacts with PCIe here through memory-mapped I/O (MMIO) and DMA. It generates and terminates TLPs (Transaction Layer Packets) — the fundamental unit of data exchange. It also manages flow control credits so senders never overwhelm receivers.
Data Link Layer
Provides reliable delivery. It wraps each TLP with a sequence number and LCRC (Link CRC), sends it downstream, and waits for an ACK DLLP from the receiver. If a NAK arrives or a timer expires, it retransmits from the replay buffer. It also generates Flow Control DLLPs to replenish the sender's credit counters.
Physical Layer
The lowest layer — purely electrical and serialization concerns. It takes bits from the Data Link Layer, scrambles them for EMI reduction, applies 8b/10b or 128b/130b encoding, and drives differential signals across the lanes. It also handles link training (negotiating link width and speed at boot), lane reversal, and polarity inversion.
6. TLP — Transaction Layer Packet
Every read, write, interrupt, or configuration access in PCIe travels as a TLP. A TLP consists of a header (3 or 4 DWORDs), an optional data payload, and an optional ECRC.
Fig 3 — TLP structure (Transaction Layer) wrapped by DLL Sequence Number + LCRC
Header Fields (DW0 — common to all TLPs)
| Bits | Field | Description |
|---|---|---|
| [7:5] | Fmt | Format: 3DW no data / 4DW no data / 3DW with data / 4DW with data |
| [4:0] | Type | TLP type: MRd, MWr, CfgRd0, CfgWr0, Cpl, CplD, Msg… |
| [9] | TC | Traffic Class (0–7) — QoS priority |
| [15:10] | Attr | Attributes: Relaxed Ordering, No-Snoop, ID-Based Ordering |
| [25:16] | Length | Payload length in DWORDs (0 = 1024 DW = 4096 bytes) |
| [31:16] DW1 | Requester ID | Bus:Device:Function of the requester (16 bits) |
| [15:8] DW1 | Tag | Outstanding request tag (8-bit = 256 outstanding reads) |
| [7:0] DW1 | BE | First/Last DW Byte Enable — which bytes in first/last DWORD are valid |
TLP Types
| TLP Type | Abbrev | Direction | Has Data? | Use |
|---|---|---|---|---|
| Memory Read Request | MRd | Requester → Completer | No | CPU reads GPU VRAM, DMA reads host memory |
| Memory Write Request | MWr | Requester → Completer | Yes | CPU writes to device register / BAR, posted write (no completion) |
| Completion with Data | CplD | Completer → Requester | Yes | Response to MRd — carries the read data back |
| Completion (no data) | Cpl | Completer → Requester | No | Response to I/O or Config Write confirming completion |
| Config Read Type 0 | CfgRd0 | RC → Device | No | Read config space register of a device on same bus |
| Config Write Type 0 | CfgWr0 | RC → Device | Yes | Write config space (BARs, command register, etc.) |
| Message | Msg | Either direction | Optional | MSI interrupt, vendor-defined, power management events |
7. Flow Control
PCIe uses credit-based flow control at the Transaction Layer to prevent buffer overflow. Before sending a TLP, the sender must verify it has enough credits from the receiver. Unlike a simple on/off flow control, this allows the link to stay busy right up to the receiver's capacity.
There are two types of credits per traffic class:
- Header Credits (HdrFC) — one credit per TLP (covers the TLP header)
- Data Credits (DataFC) — one credit per 4 bytes (one DWORD) of payload
| Step | What happens |
|---|---|
| 1. Init | At link-up, both sides advertise their initial FC credits via InitFC1 / InitFC2 DLLPs. |
| 2. Check | Before sending a TLP, the sender checks: HdrFC ≥ 1 AND DataFC ≥ payload_dwords. |
| 3. Consume | Sender decrements its local credit counter by the amount used. |
| 4. Process | Receiver processes the TLP and frees buffer space. |
| 5. Replenish | Receiver sends an UpdateFC DLLP to return consumed credits to the sender. |
Posted requests (MWr, Msg) from the Completer side advertise credits = FFh (infinite) — the requester is never blocked waiting for MWr completions because posted writes require no completion TLP.
8. Configuration Space & BARs
Every PCIe device has a Configuration Space — a standardised register space that the OS uses to discover, configure, and control the device.
- Traditional PCI: 256 bytes per function
- PCIe extended: 4096 bytes per function (adds Extended Capability structures)
Base Address Registers (BARs)
BARs tell the OS where in the system's address space the device's registers and memory live. The BIOS/OS writes a base address into each BAR; from then on, reading/writing that memory address issues a PCIe MMIO TLP to the device.
A GPU typically has BAR0 (256 MB, GPU control registers), BAR1 (8 GB, VRAM mapped to CPU address space). When the CPU writes a value to an address in BAR0's range, it generates a MWr TLP that travels down the PCIe link to the GPU.
Interrupts — MSI and MSI-X
PCIe devices signal interrupts using MSI (Message-Signaled Interrupts) — they write a small TLP (a Memory Write) to a CPU-programmed address. This eliminates the legacy INTx shared interrupt pin and enables up to 2048 independent vectors per device with MSI-X.
9. Power Management — ASPM
ASPM (Active State Power Management) defines low-power link states that both sides can enter when the link is idle:
| State | Name | Power | Exit Latency | Description |
|---|---|---|---|---|
| L0 | Active | Full | 0 | Normal fully operational state |
| L0s | Standby | Low | < 1 μs | TX enters low-power after idle; fast resume. Only one direction can be in L0s at a time. |
| L1 | Hibernate | Very low | ~10 μs | Both TX and RX sleep. Requires both sides to negotiate entry. Deeper power savings. |
| L2/L3 | Off | Near zero | ms range | Main power removed. Device needs full re-enumeration to resume. |
10. FAQ
What is PCIe and what does it replace?
PCIe (PCI Express) is a high-speed point-to-point serial interconnect introduced in 2003. It replaced three older parallel bus standards: PCI (general expansion cards), AGP (graphics), and PCI-X (servers). All modern motherboards use PCIe for GPUs, NVMe SSDs, NICs, and Wi-Fi cards.
What is a PCIe lane?
One PCIe lane = two differential pairs: one for TX (TX+/TX−) and one for RX (RX+/RX−) = 4 wires. TX and RX operate simultaneously making it full-duplex. Bandwidth scales linearly: x4 = 4× the bandwidth of x1, x16 = 16×.
What is the Root Complex?
The Root Complex (RC) is the PCIe host controller — typically integrated inside the CPU die in modern Intel/AMD platforms, or in a separate chipset. It is the root of the PCIe tree hierarchy. It creates the memory and I/O address windows that map device BARs into the CPU's address space, and it generates the Configuration requests that enumerate all downstream devices at boot.
What is a TLP and a DLLP?
TLP (Transaction Layer Packet): Generated by the Transaction Layer. Carries application data — reads, writes, completions, config accesses. Has a 3 or 4 DW header + optional payload + optional ECRC.
DLLP (Data Link Layer Packet): Generated by the Data Link Layer. Carries ACK/NAK acknowledgements and Flow Control credit updates. DLLPs are never seen by the Transaction Layer — they are generated and consumed by the DLL.
Why does Gen 3 seem to have almost no encoding overhead vs Gen 1/2?
Gen 1 and Gen 2 use 8b/10b encoding — every 8 bits of data costs 10 bits on the wire (20% overhead). Gen 3 switched to 128b/130b — only a 2-bit overhead per 128 data bits (1.54% overhead). This is why Gen 3 at 8 GT/s gives nearly the same effective throughput as you'd expect from doubling Gen 2's raw rate, not 80% of it.
Can a x4 GPU work in a x16 slot?
Yes. PCIe supports link width negotiation during link training. A x4 device inserted into a x16 slot will train up to x4 — the extra lanes remain unused. The device runs at x4 bandwidth. Conversely, a x16 GPU in a x8 slot runs at x8 speed (many motherboards use x8 electrical on a x16 physical slot for secondary GPU slots).
What is the difference between PCIe and NVMe?
PCIe is the electrical/protocol interconnect — the physical interface and packet protocol. NVMe (Non-Volatile Memory Express) is a logical storage protocol that runs on top of PCIe. An NVMe SSD uses PCIe lanes to carry NVMe commands (read, write, trim) to/from the drive's controller. M.2 NVMe SSDs use x4 PCIe lanes. SATA SSDs, by contrast, use the SATA protocol over the SATA interface — completely separate from PCIe.
What changed in PCIe Gen 6?
Gen 6 (64 GT/s per lane) introduced two major changes:
- PAM4 signaling — instead of two voltage levels (NRZ/PAM2), Gen 6 uses four levels to encode 2 bits per symbol, doubling bandwidth without doubling frequency.
- FLIT (Fixed-size Packet) mode — replaces variable-length TLPs at the physical framing level with fixed 256-byte FLITs. Forward Error Correction (FEC) is applied per FLIT to compensate for higher bit-error rate of PAM4. This changes the reliability model from ACK/NAK retransmission to FEC correction.