PCIe (PCI Express) is a high-speed point-to-point serial interface standard used to connect components inside a computer — GPUs, NVMe SSDs, network cards, and more. It replaced the older shared-bus PCI standard in 2003 and offers much higher bandwidth, full-duplex operation, and a scalable lane-based architecture.

A PCIe lane is one full-duplex serial link consisting of two differential pairs — one for transmitting (TX+/TX-) and one for receiving (RX+/RX-). Each lane operates independently. Devices use x1, x4, x8, or x16 lanes to scale bandwidth proportionally.

What is a PCIe Root Complex?

The Root Complex (RC) is the PCIe host controller, typically integrated into the CPU or chipset. It is the top of the PCIe hierarchy — it generates and terminates memory-mapped I/O transactions, manages configuration space, and provides the PCIe ports to which all downstream devices connect.

What is a TLP in PCIe?

A TLP (Transaction Layer Packet) is the fundamental unit of data exchange in PCIe. It is generated by the Transaction Layer and carries a request (Memory Read, Memory Write, Configuration Read/Write, I/O, Message) or a completion. A TLP consists of a 3 or 4 DWORD header, an optional data payload of up to 4096 bytes, and an optional ECRC.

What is the difference between PCIe Gen 4 and Gen 5?

PCIe Gen 4 (2017) runs at 16 GT/s per lane, giving ~2 GB/s per lane and ~32 GB/s for x16. PCIe Gen 5 (2019) doubles this to 32 GT/s per lane — ~4 GB/s per lane and ~64 GB/s for x16. Both use 128b/130b encoding. Gen 5 is now standard in Intel Raptor Lake and AMD Zen 4 platforms.

What is PCIe flow control?

PCIe uses credit-based flow control at the Transaction Layer. Before sending a TLP, the sender must have sufficient credits (Header credits + Data credits) from the receiver. Credits are advertised during link initialization and replenished via Flow Control DLLPs as the receiver processes packets. This prevents buffer overflow without requiring retransmission.

PCIe Protocol – PCI Express Explained

1. What is PCIe?

PCIe (PCI Express) is a high-speed serial interconnect standard that connects components inside computers — GPUs, NVMe SSDs, network interface cards, sound cards, and more. It was introduced in 2003 by Intel to replace the older parallel PCI and AGP buses.

Unlike the original PCI which was a shared parallel bus (all devices competed for the same wires), PCIe is point-to-point — each device gets its own dedicated serial link with no contention. This allows each link to operate at full bandwidth regardless of how many other devices are present.

Key change from PCI:

PCI = shared parallel bus → devices take turns, bandwidth halved with each device added.

PCIe = point-to-point serial → each device has a dedicated full-duplex link. Adding more devices adds more links, not more contention.

Feature	Old PCI	PCIe
Topology	Shared parallel bus	Point-to-point
Signal type	Parallel (32-bit or 64-bit)	Serial differential (LVDS)
Direction	Half-duplex (shared)	Full-duplex (separate TX/RX)
Max bandwidth	533 MB/s (PCI-X 133)	>500 GB/s (x16 Gen 6)
Hot-plug	Limited	Native support
Power management	Basic	ASPM (L0/L0s/L1/L2)

2. PCIe Topology

PCIe uses a hierarchical tree topology. Every device in the system is reachable from the CPU through a tree of Root Complex → Switches → Endpoints.

Fig 1 — PCIe system topology: Root Complex → optional Switch → Endpoints

Key topology rules:

Root Complex (RC) — the host controller, integrated in the CPU or chipset. Owns the top of the tree.
Switch — a transparent bridge that fans one upstream port into multiple downstream ports. Adds more PCIe slots/ports without changing software visibility.
Endpoint — a leaf device (GPU, SSD, NIC). Has no downstream ports. Initiates and responds to TLPs.
Requester ID — every device is addressed by Bus:Device:Function (BDF), assigned during PCI enumeration at boot.

3. Lanes — x1, x4, x8, x16

A PCIe lane is one full-duplex serial link. It contains two differential pairs: one for transmit (TX+ / TX−) and one for receive (RX+ / RX−) — 4 wires total per lane. TX and RX operate simultaneously → full-duplex.

Devices use multiple lanes in parallel to multiply bandwidth. The link width is denoted x1, x4, x8, x16 (read "by one", "by four", etc.).

x1

TX

RX

2 GB/s (Gen 5)

x4

TX

RX

~8 GB/s (Gen 5)

x8

TX

RX

~16 GB/s (Gen 5)

x16

TX

RX

~64 GB/s (Gen 5)

Orange = TX lanes · Blue = RX lanes. Each lane = 4 wires (TX+, TX−, RX+, RX−).

Physical compatibility:

A x4 device fits in a x8 or x16 slot (mechanically larger slots accept smaller devices). The link negotiates down to the device's capability during link training. This is called link width negotiation.

4. PCIe Generations — Speed Table

Each PCIe generation doubles the per-lane data rate. The raw bit rate (GT/s) is higher than the effective data rate because of line encoding overhead.

Generation	Year	Rate (per lane)	Encoding	Effective / lane	x16 total
Gen 1	2003	2.5 GT/s	8b/10b (20% overhead)	250 MB/s	4 GB/s
Gen 2	2007	5.0 GT/s	8b/10b (20% overhead)	500 MB/s	8 GB/s
Gen 3	2010	8.0 GT/s	128b/130b (~1.5% overhead)	~985 MB/s	~16 GB/s
Gen 4	2017	16.0 GT/s	128b/130b	~1.97 GB/s	~32 GB/s
Gen 5	2019	32.0 GT/s	128b/130b	~3.94 GB/s	~64 GB/s
Gen 6	2022	64.0 GT/s	FLIT / PAM4	~7.88 GB/s	~128 GB/s

Encoding explained:

8b/10b (Gen 1–2): every 8 bits of data is sent as 10 bits on the wire — 20% overhead. Gives DC balance and clock recovery but wastes bandwidth.

128b/130b (Gen 3–5): every 128 data bits uses only a 2-bit sync header — just 1.5% overhead. Much more efficient.

FLIT + PAM4 (Gen 6): uses 4-level pulse amplitude modulation (PAM4) to encode 2 bits per symbol, doubling bandwidth without doubling frequency. Fixed-size 256-byte FLITs replace variable TLPs at the physical layer framing level.

5. PCIe 3-Layer Architecture

PCIe is organized into three layers, similar to a network stack. Each layer has a distinct role — upper layers don't care how lower layers work.

Fig 2 — PCIe 3-layer architecture: TLPs flow down the TX stack, travel over serial lanes, and up the RX stack

Transaction Layer

The topmost layer. Software interacts with PCIe here through memory-mapped I/O (MMIO) and DMA. It generates and terminates TLPs (Transaction Layer Packets) — the fundamental unit of data exchange. It also manages flow control credits so senders never overwhelm receivers.

Data Link Layer

Provides reliable delivery. It wraps each TLP with a sequence number and LCRC (Link CRC), sends it downstream, and waits for an ACK DLLP from the receiver. If a NAK arrives or a timer expires, it retransmits from the replay buffer. It also generates Flow Control DLLPs to replenish the sender's credit counters.

Physical Layer

The lowest layer — purely electrical and serialization concerns. It takes bits from the Data Link Layer, scrambles them for EMI reduction, applies 8b/10b or 128b/130b encoding, and drives differential signals across the lanes. It also handles link training (negotiating link width and speed at boot), lane reversal, and polarity inversion.

6. TLP — Transaction Layer Packet

Every read, write, interrupt, or configuration access in PCIe travels as a TLP. A TLP consists of a header (3 or 4 DWORDs), an optional data payload, and an optional ECRC.

Fig 3 — TLP structure (Transaction Layer) wrapped by DLL Sequence Number + LCRC

Header Fields (DW0 — common to all TLPs)

Bits	Field	Description
[7:5]	`Fmt`	Format: 3DW no data / 4DW no data / 3DW with data / 4DW with data
[4:0]	`Type`	TLP type: MRd, MWr, CfgRd0, CfgWr0, Cpl, CplD, Msg…
[9]	`TC`	Traffic Class (0–7) — QoS priority
[15:10]	`Attr`	Attributes: Relaxed Ordering, No-Snoop, ID-Based Ordering
[25:16]	`Length`	Payload length in DWORDs (0 = 1024 DW = 4096 bytes)
[31:16] DW1	`Requester ID`	Bus:Device:Function of the requester (16 bits)
[15:8] DW1	`Tag`	Outstanding request tag (8-bit = 256 outstanding reads)
[7:0] DW1	`BE`	First/Last DW Byte Enable — which bytes in first/last DWORD are valid

TLP Types

TLP Type	Abbrev	Direction	Has Data?	Use
Memory Read Request	`MRd`	Requester → Completer	No	CPU reads GPU VRAM, DMA reads host memory
Memory Write Request	`MWr`	Requester → Completer	Yes	CPU writes to device register / BAR, posted write (no completion)
Completion with Data	`CplD`	Completer → Requester	Yes	Response to MRd — carries the read data back
Completion (no data)	`Cpl`	Completer → Requester	No	Response to I/O or Config Write confirming completion
Config Read Type 0	`CfgRd0`	RC → Device	No	Read config space register of a device on same bus
Config Write Type 0	`CfgWr0`	RC → Device	Yes	Write config space (BARs, command register, etc.)
Message	`Msg`	Either direction	Optional	MSI interrupt, vendor-defined, power management events

7. Flow Control

PCIe uses credit-based flow control at the Transaction Layer to prevent buffer overflow. Before sending a TLP, the sender must verify it has enough credits from the receiver. Unlike a simple on/off flow control, this allows the link to stay busy right up to the receiver's capacity.

There are two types of credits per traffic class:

Header Credits (HdrFC) — one credit per TLP (covers the TLP header)
Data Credits (DataFC) — one credit per 4 bytes (one DWORD) of payload

Step	What happens
1. Init	At link-up, both sides advertise their initial FC credits via `InitFC1` / `InitFC2` DLLPs.
2. Check	Before sending a TLP, the sender checks: `HdrFC ≥ 1` AND `DataFC ≥ payload_dwords`.
3. Consume	Sender decrements its local credit counter by the amount used.
4. Process	Receiver processes the TLP and frees buffer space.
5. Replenish	Receiver sends an `UpdateFC DLLP` to return consumed credits to the sender.

Infinite credits:

Posted requests (MWr, Msg) from the Completer side advertise credits = FFh (infinite) — the requester is never blocked waiting for MWr completions because posted writes require no completion TLP.

8. Configuration Space & BARs

Every PCIe device has a Configuration Space — a standardised register space that the OS uses to discover, configure, and control the device.

Traditional PCI: 256 bytes per function
PCIe extended: 4096 bytes per function (adds Extended Capability structures)

Base Address Registers (BARs)

BARs tell the OS where in the system's address space the device's registers and memory live. The BIOS/OS writes a base address into each BAR; from then on, reading/writing that memory address issues a PCIe MMIO TLP to the device.

Example — GPU BAR:

A GPU typically has BAR0 (256 MB, GPU control registers), BAR1 (8 GB, VRAM mapped to CPU address space). When the CPU writes a value to an address in BAR0's range, it generates a MWr TLP that travels down the PCIe link to the GPU.

Interrupts — MSI and MSI-X

PCIe devices signal interrupts using MSI (Message-Signaled Interrupts) — they write a small TLP (a Memory Write) to a CPU-programmed address. This eliminates the legacy INTx shared interrupt pin and enables up to 2048 independent vectors per device with MSI-X.

9. Power Management — ASPM

ASPM (Active State Power Management) defines low-power link states that both sides can enter when the link is idle:

State	Name	Power	Exit Latency	Description
L0	Active	Full	0	Normal fully operational state
L0s	Standby	Low	< 1 μs	TX enters low-power after idle; fast resume. Only one direction can be in L0s at a time.
L1	Hibernate	Very low	~10 μs	Both TX and RX sleep. Requires both sides to negotiate entry. Deeper power savings.
L2/L3	Off	Near zero	ms range	Main power removed. Device needs full re-enumeration to resume.

10. FAQ

What is PCIe and what does it replace?

PCIe (PCI Express) is a high-speed point-to-point serial interconnect introduced in 2003. It replaced three older parallel bus standards: PCI (general expansion cards), AGP (graphics), and PCI-X (servers). All modern motherboards use PCIe for GPUs, NVMe SSDs, NICs, and Wi-Fi cards.

What is a PCIe lane?

One PCIe lane = two differential pairs: one for TX (TX+/TX−) and one for RX (RX+/RX−) = 4 wires. TX and RX operate simultaneously making it full-duplex. Bandwidth scales linearly: x4 = 4× the bandwidth of x1, x16 = 16×.

What is the Root Complex?

The Root Complex (RC) is the PCIe host controller — typically integrated inside the CPU die in modern Intel/AMD platforms, or in a separate chipset. It is the root of the PCIe tree hierarchy. It creates the memory and I/O address windows that map device BARs into the CPU's address space, and it generates the Configuration requests that enumerate all downstream devices at boot.

What is a TLP and a DLLP?

TLP (Transaction Layer Packet): Generated by the Transaction Layer. Carries application data — reads, writes, completions, config accesses. Has a 3 or 4 DW header + optional payload + optional ECRC.

DLLP (Data Link Layer Packet): Generated by the Data Link Layer. Carries ACK/NAK acknowledgements and Flow Control credit updates. DLLPs are never seen by the Transaction Layer — they are generated and consumed by the DLL.

Why does Gen 3 seem to have almost no encoding overhead vs Gen 1/2?

Gen 1 and Gen 2 use 8b/10b encoding — every 8 bits of data costs 10 bits on the wire (20% overhead). Gen 3 switched to 128b/130b — only a 2-bit overhead per 128 data bits (1.54% overhead). This is why Gen 3 at 8 GT/s gives nearly the same effective throughput as you'd expect from doubling Gen 2's raw rate, not 80% of it.

Can a x4 GPU work in a x16 slot?

Yes. PCIe supports link width negotiation during link training. A x4 device inserted into a x16 slot will train up to x4 — the extra lanes remain unused. The device runs at x4 bandwidth. Conversely, a x16 GPU in a x8 slot runs at x8 speed (many motherboards use x8 electrical on a x16 physical slot for secondary GPU slots).

What is the difference between PCIe and NVMe?

PCIe is the electrical/protocol interconnect — the physical interface and packet protocol. NVMe (Non-Volatile Memory Express) is a logical storage protocol that runs on top of PCIe. An NVMe SSD uses PCIe lanes to carry NVMe commands (read, write, trim) to/from the drive's controller. M.2 NVMe SSDs use x4 PCIe lanes. SATA SSDs, by contrast, use the SATA protocol over the SATA interface — completely separate from PCIe.

What changed in PCIe Gen 6?

Gen 6 (64 GT/s per lane) introduced two major changes:

PAM4 signaling — instead of two voltage levels (NRZ/PAM2), Gen 6 uses four levels to encode 2 bits per symbol, doubling bandwidth without doubling frequency.
FLIT (Fixed-size Packet) mode — replaces variable-length TLPs at the physical framing level with fixed 256-byte FLITs. Forward Error Correction (FEC) is applied per FLIT to compensate for higher bit-error rate of PAM4. This changes the reliability model from ACK/NAK retransmission to FEC correction.

PCIe — PCI Express

1. What is PCIe?

2. PCIe Topology

3. Lanes — x1, x4, x8, x16

4. PCIe Generations — Speed Table

5. PCIe 3-Layer Architecture

Transaction Layer

Data Link Layer

Physical Layer

6. TLP — Transaction Layer Packet

Header Fields (DW0 — common to all TLPs)

TLP Types

7. Flow Control

8. Configuration Space & BARs

Base Address Registers (BARs)

Interrupts — MSI and MSI-X

9. Power Management — ASPM

10. FAQ