Interconnect Protocol

PCIe — PCI Express

The universal high-speed interconnect inside every modern PC, server, and SoC — powering GPUs, NVMe SSDs, network cards, and AI accelerators with scalable, full-duplex serial lanes.

Point-to-Point Serial Full-Duplex Gen 1 → Gen 6 x1 / x4 / x8 / x16

1. What is PCIe?

PCIe (PCI Express) is a high-speed serial interconnect standard that connects components inside computers — GPUs, NVMe SSDs, network interface cards, sound cards, and more. It was introduced in 2003 by Intel to replace the older parallel PCI and AGP buses.

Unlike the original PCI which was a shared parallel bus (all devices competed for the same wires), PCIe is point-to-point — each device gets its own dedicated serial link with no contention. This allows each link to operate at full bandwidth regardless of how many other devices are present.

Key change from PCI:

PCI = shared parallel bus → devices take turns, bandwidth halved with each device added.

PCIe = point-to-point serial → each device has a dedicated full-duplex link. Adding more devices adds more links, not more contention.

FeatureOld PCIPCIe
TopologyShared parallel busPoint-to-point
Signal typeParallel (32-bit or 64-bit)Serial differential (LVDS)
DirectionHalf-duplex (shared)Full-duplex (separate TX/RX)
Max bandwidth533 MB/s (PCI-X 133)>500 GB/s (x16 Gen 6)
Hot-plugLimitedNative support
Power managementBasicASPM (L0/L0s/L1/L2)

2. PCIe Topology

PCIe uses a hierarchical tree topology. Every device in the system is reachable from the CPU through a tree of Root Complex → Switches → Endpoints.

CPU / HOST Intel / AMD / ARM — contains Root Complex ROOT COMPLEX (RC) Manages Config Space · Generates TLPs · Owns PCIe Ports · Provides MMIO window GPU Endpoint · x16 ~64 GB/s (Gen 5 x16) PCIe SWITCH Fan-out · 1 upstream, N downstream Upstream Port + Downstream Ports NVMe SSD Endpoint · x4 ~8 GB/s (Gen 4 x4) NIC Endpoint · x4 25GbE / 100GbE CAPTURE CARD Endpoint · x1 Low-bandwidth device

Fig 1 — PCIe system topology: Root Complex → optional Switch → Endpoints

Key topology rules:


3. Lanes — x1, x4, x8, x16

A PCIe lane is one full-duplex serial link. It contains two differential pairs: one for transmit (TX+ / TX−) and one for receive (RX+ / RX−) — 4 wires total per lane. TX and RX operate simultaneously → full-duplex.

Devices use multiple lanes in parallel to multiply bandwidth. The link width is denoted x1, x4, x8, x16 (read "by one", "by four", etc.).

x1
TX
RX
2 GB/s (Gen 5)
x4
TX
TX
TX
TX
RX
RX
RX
RX
~8 GB/s (Gen 5)
x8
TX
TX
TX
TX
TX
TX
TX
TX
RX
RX
RX
RX
RX
RX
RX
RX
~16 GB/s (Gen 5)
x16
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
TX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
RX
~64 GB/s (Gen 5)

Orange = TX lanes · Blue = RX lanes. Each lane = 4 wires (TX+, TX−, RX+, RX−).

Physical compatibility:

A x4 device fits in a x8 or x16 slot (mechanically larger slots accept smaller devices). The link negotiates down to the device's capability during link training. This is called link width negotiation.


4. PCIe Generations — Speed Table

Each PCIe generation doubles the per-lane data rate. The raw bit rate (GT/s) is higher than the effective data rate because of line encoding overhead.

Generation Year Rate (per lane) Encoding Effective / lane x16 total
Gen 120032.5 GT/s8b/10b (20% overhead)250 MB/s4 GB/s
Gen 220075.0 GT/s8b/10b (20% overhead)500 MB/s8 GB/s
Gen 320108.0 GT/s128b/130b (~1.5% overhead)~985 MB/s~16 GB/s
Gen 4201716.0 GT/s128b/130b~1.97 GB/s~32 GB/s
Gen 5201932.0 GT/s128b/130b~3.94 GB/s~64 GB/s
Gen 6202264.0 GT/sFLIT / PAM4~7.88 GB/s~128 GB/s
Encoding explained:

8b/10b (Gen 1–2): every 8 bits of data is sent as 10 bits on the wire — 20% overhead. Gives DC balance and clock recovery but wastes bandwidth.

128b/130b (Gen 3–5): every 128 data bits uses only a 2-bit sync header — just 1.5% overhead. Much more efficient.

FLIT + PAM4 (Gen 6): uses 4-level pulse amplitude modulation (PAM4) to encode 2 bits per symbol, doubling bandwidth without doubling frequency. Fixed-size 256-byte FLITs replace variable TLPs at the physical layer framing level.


5. PCIe 3-Layer Architecture

PCIe is organized into three layers, similar to a network stack. Each layer has a distinct role — upper layers don't care how lower layers work.

TRANSMIT RECEIVE TRANSACTION LAYER Generates / consumes TLPs · Memory, Config, I/O, Message requests Flow control credits · Ordering rules (Relaxed Ordering, No-Snoop) DATA LINK LAYER Adds Sequence Number + LCRC to TLP · ACK/NAK protocol Generates DLLPs (ACK, NAK, FC Update) · Replay buffer for retransmission PHYSICAL LAYER Serialization / Deserialization (SERDES) · 8b/10b or 128b/130b encoding Lane polarity / reversal · Link training · Scrambling · Electrical specs SERIAL DIFFERENTIAL LANES

Fig 2 — PCIe 3-layer architecture: TLPs flow down the TX stack, travel over serial lanes, and up the RX stack

Transaction Layer

The topmost layer. Software interacts with PCIe here through memory-mapped I/O (MMIO) and DMA. It generates and terminates TLPs (Transaction Layer Packets) — the fundamental unit of data exchange. It also manages flow control credits so senders never overwhelm receivers.

Data Link Layer

Provides reliable delivery. It wraps each TLP with a sequence number and LCRC (Link CRC), sends it downstream, and waits for an ACK DLLP from the receiver. If a NAK arrives or a timer expires, it retransmits from the replay buffer. It also generates Flow Control DLLPs to replenish the sender's credit counters.

Physical Layer

The lowest layer — purely electrical and serialization concerns. It takes bits from the Data Link Layer, scrambles them for EMI reduction, applies 8b/10b or 128b/130b encoding, and drives differential signals across the lanes. It also handles link training (negotiating link width and speed at boot), lane reversal, and polarity inversion.


6. TLP — Transaction Layer Packet

Every read, write, interrupt, or configuration access in PCIe travels as a TLP. A TLP consists of a header (3 or 4 DWORDs), an optional data payload, and an optional ECRC.

HEADER 3 DW (96 bits) or 4 DW (128 bits) for 64-bit addressing DATA PAYLOAD 0 to 4096 bytes (MemWr, Completion with Data) absent for MemRd request, CfgRd, etc. ECRC 4 bytes (optional) end-to-end CRC DATA LINK LAYER WRAPPER (added by DLL) SEQ NUM (2B) ← TLP (Header + Payload + ECRC) → LCRC (4B)

Fig 3 — TLP structure (Transaction Layer) wrapped by DLL Sequence Number + LCRC

Header Fields (DW0 — common to all TLPs)

BitsFieldDescription
[7:5]FmtFormat: 3DW no data / 4DW no data / 3DW with data / 4DW with data
[4:0]TypeTLP type: MRd, MWr, CfgRd0, CfgWr0, Cpl, CplD, Msg…
[9]TCTraffic Class (0–7) — QoS priority
[15:10]AttrAttributes: Relaxed Ordering, No-Snoop, ID-Based Ordering
[25:16]LengthPayload length in DWORDs (0 = 1024 DW = 4096 bytes)
[31:16] DW1Requester IDBus:Device:Function of the requester (16 bits)
[15:8] DW1TagOutstanding request tag (8-bit = 256 outstanding reads)
[7:0] DW1BEFirst/Last DW Byte Enable — which bytes in first/last DWORD are valid

TLP Types

TLP TypeAbbrevDirectionHas Data?Use
Memory Read RequestMRdRequester → CompleterNoCPU reads GPU VRAM, DMA reads host memory
Memory Write RequestMWrRequester → CompleterYesCPU writes to device register / BAR, posted write (no completion)
Completion with DataCplDCompleter → RequesterYesResponse to MRd — carries the read data back
Completion (no data)CplCompleter → RequesterNoResponse to I/O or Config Write confirming completion
Config Read Type 0CfgRd0RC → DeviceNoRead config space register of a device on same bus
Config Write Type 0CfgWr0RC → DeviceYesWrite config space (BARs, command register, etc.)
MessageMsgEither directionOptionalMSI interrupt, vendor-defined, power management events

7. Flow Control

PCIe uses credit-based flow control at the Transaction Layer to prevent buffer overflow. Before sending a TLP, the sender must verify it has enough credits from the receiver. Unlike a simple on/off flow control, this allows the link to stay busy right up to the receiver's capacity.

There are two types of credits per traffic class:

StepWhat happens
1. InitAt link-up, both sides advertise their initial FC credits via InitFC1 / InitFC2 DLLPs.
2. CheckBefore sending a TLP, the sender checks: HdrFC ≥ 1 AND DataFC ≥ payload_dwords.
3. ConsumeSender decrements its local credit counter by the amount used.
4. ProcessReceiver processes the TLP and frees buffer space.
5. ReplenishReceiver sends an UpdateFC DLLP to return consumed credits to the sender.
Infinite credits:

Posted requests (MWr, Msg) from the Completer side advertise credits = FFh (infinite) — the requester is never blocked waiting for MWr completions because posted writes require no completion TLP.


8. Configuration Space & BARs

Every PCIe device has a Configuration Space — a standardised register space that the OS uses to discover, configure, and control the device.

Base Address Registers (BARs)

BARs tell the OS where in the system's address space the device's registers and memory live. The BIOS/OS writes a base address into each BAR; from then on, reading/writing that memory address issues a PCIe MMIO TLP to the device.

Example — GPU BAR:

A GPU typically has BAR0 (256 MB, GPU control registers), BAR1 (8 GB, VRAM mapped to CPU address space). When the CPU writes a value to an address in BAR0's range, it generates a MWr TLP that travels down the PCIe link to the GPU.

Interrupts — MSI and MSI-X

PCIe devices signal interrupts using MSI (Message-Signaled Interrupts) — they write a small TLP (a Memory Write) to a CPU-programmed address. This eliminates the legacy INTx shared interrupt pin and enables up to 2048 independent vectors per device with MSI-X.


9. Power Management — ASPM

ASPM (Active State Power Management) defines low-power link states that both sides can enter when the link is idle:

StateNamePowerExit LatencyDescription
L0ActiveFull0Normal fully operational state
L0sStandbyLow< 1 μsTX enters low-power after idle; fast resume. Only one direction can be in L0s at a time.
L1HibernateVery low~10 μsBoth TX and RX sleep. Requires both sides to negotiate entry. Deeper power savings.
L2/L3OffNear zeroms rangeMain power removed. Device needs full re-enumeration to resume.

10. FAQ

What is PCIe and what does it replace?

PCIe (PCI Express) is a high-speed point-to-point serial interconnect introduced in 2003. It replaced three older parallel bus standards: PCI (general expansion cards), AGP (graphics), and PCI-X (servers). All modern motherboards use PCIe for GPUs, NVMe SSDs, NICs, and Wi-Fi cards.

What is a PCIe lane?

One PCIe lane = two differential pairs: one for TX (TX+/TX−) and one for RX (RX+/RX−) = 4 wires. TX and RX operate simultaneously making it full-duplex. Bandwidth scales linearly: x4 = 4× the bandwidth of x1, x16 = 16×.

What is the Root Complex?

The Root Complex (RC) is the PCIe host controller — typically integrated inside the CPU die in modern Intel/AMD platforms, or in a separate chipset. It is the root of the PCIe tree hierarchy. It creates the memory and I/O address windows that map device BARs into the CPU's address space, and it generates the Configuration requests that enumerate all downstream devices at boot.

What is a TLP and a DLLP?

TLP (Transaction Layer Packet): Generated by the Transaction Layer. Carries application data — reads, writes, completions, config accesses. Has a 3 or 4 DW header + optional payload + optional ECRC.

DLLP (Data Link Layer Packet): Generated by the Data Link Layer. Carries ACK/NAK acknowledgements and Flow Control credit updates. DLLPs are never seen by the Transaction Layer — they are generated and consumed by the DLL.

Why does Gen 3 seem to have almost no encoding overhead vs Gen 1/2?

Gen 1 and Gen 2 use 8b/10b encoding — every 8 bits of data costs 10 bits on the wire (20% overhead). Gen 3 switched to 128b/130b — only a 2-bit overhead per 128 data bits (1.54% overhead). This is why Gen 3 at 8 GT/s gives nearly the same effective throughput as you'd expect from doubling Gen 2's raw rate, not 80% of it.

Can a x4 GPU work in a x16 slot?

Yes. PCIe supports link width negotiation during link training. A x4 device inserted into a x16 slot will train up to x4 — the extra lanes remain unused. The device runs at x4 bandwidth. Conversely, a x16 GPU in a x8 slot runs at x8 speed (many motherboards use x8 electrical on a x16 physical slot for secondary GPU slots).

What is the difference between PCIe and NVMe?

PCIe is the electrical/protocol interconnect — the physical interface and packet protocol. NVMe (Non-Volatile Memory Express) is a logical storage protocol that runs on top of PCIe. An NVMe SSD uses PCIe lanes to carry NVMe commands (read, write, trim) to/from the drive's controller. M.2 NVMe SSDs use x4 PCIe lanes. SATA SSDs, by contrast, use the SATA protocol over the SATA interface — completely separate from PCIe.

What changed in PCIe Gen 6?

Gen 6 (64 GT/s per lane) introduced two major changes:

  • PAM4 signaling — instead of two voltage levels (NRZ/PAM2), Gen 6 uses four levels to encode 2 bits per symbol, doubling bandwidth without doubling frequency.
  • FLIT (Fixed-size Packet) mode — replaces variable-length TLPs at the physical framing level with fixed 256-byte FLITs. Forward Error Correction (FEC) is applied per FLIT to compensate for higher bit-error rate of PAM4. This changes the reliability model from ACK/NAK retransmission to FEC correction.