Cache-Coherent Interconnect

CXL – Compute Express Link

CXL is an open industry standard that extends PCIe with cache coherency and memory semantics. It lets CPUs, GPUs, and AI accelerators share a unified coherent memory space — eliminating costly software-managed data copies and enabling memory expansion beyond DIMM slot limits. Built on the PCIe physical layer, CXL reuses existing PCIe connectors and SerDes while adding the coherency intelligence the data center era demands.

Built on PCIe PHY
3 Sub-Protocols: .io / .cache / .mem
CXL 3.0: 256 GB/s (PCIe 6 PHY)
Intel, AMD, Arm, Samsung, Micron
Overview

What is CXL and Why Does It Exist?

Modern workloads — AI training, in-memory databases, scientific computing — hit two walls that PCIe alone cannot solve: coherency and memory capacity.

The Coherency Problem
With plain PCIe, a GPU cannot directly cache host DRAM. Every time a GPU needs host data, the driver must orchestrate a DMA transfer — allocating pinned buffers, issuing DMA commands, synchronizing completion. CXL.cache makes that invisible: the GPU's caches are coherently integrated into the CPU's MESI protocol, just like another CPU socket.
The Memory Capacity Wall
A dual-socket server has 24–32 DIMM slots — typically 6–12 TB max. AI models and in-memory databases are outgrowing this. CXL Type 3 memory expansion modules attach to a PCIe slot and appear to the OS as normal system RAM, breaking the DIMM slot ceiling without processor socket changes.
Why PCIe Physical Layer?
PCIe is already in every server. Reusing its physical layer (connector, cable, SerDes, lane training) means CXL devices plug into standard PCIe slots — zero new hardware infrastructure. CXL is negotiated at link bring-up: if both sides support CXL, they upgrade the link; otherwise it falls back to PCIe.
Software Transparency
CXL memory (Type 3) appears as a standard NUMA node to the OS. Applications do not need modification — the kernel's memory allocator can place pages on CXL-attached memory automatically. Similarly, CXL-coherent accelerators (Type 2) can share pointers with CPU code with no explicit marshalling.
CXL in one sentence: CXL is PCIe with two superpowers added — coherent caching across the link (CXL.cache) and load/store-accessible device memory (CXL.mem) — delivered over the exact same physical connector.
Numbers

CXL at a Glance

2019
CXL 1.0 published
3
Sub-protocols (.io/.cache/.mem)
3
Device types (1 / 2 / 3)
256
GB/s bidirectional (CXL 3.0 x16)
~100
ns latency (Type 3 vs ~80 ns DRAM)
PCIe
Physical layer (same connector)
Architecture

CXL Three Sub-Protocols

CXL multiplexes three independent sub-protocols over a single PCIe physical link. Each has its own channel, flow control, and semantics.

Host CPU CXL Root Port CXL.io (PCIe-like I/O) CXL.cache (Snoop Engine) CXL.mem (DVSEC / HDM) PCIe Physical Layer CXL Device Type 1 / 2 / 3 CXL.io (Config / DMA) CXL.cache (Device Cache) CXL.mem (Device DRAM) PCIe Physical Layer CXL.io Config · INTx · DMA · MSI-X CXL.cache H2D Req · D2H Resp · D2H Data CXL.mem M2S Req · S2M Resp · S2M Data Shared PCIe Physical Layer (x1–x16 lanes) Three sub-protocols multiplexed over one PCIe physical link Not all sub-protocols are active simultaneously — depends on device Type (1, 2, or 3)

Fig 1 — CXL's three sub-protocols share a single PCIe physical link. CXL.io provides standard PCIe semantics; CXL.cache adds device-side coherent caching; CXL.mem adds host-accessible device memory. Which channels are active depends on the device type.

CXL.io
Non-coherent I/O protocol — functionally identical to PCIe TLPs. Used for device enumeration (PCI config space), MSI/MSI-X interrupt delivery, and DMA. All CXL devices must support CXL.io; it is the mandatory baseline. Backward-compatible with PCIe software stacks.
CXL.cache
Allows the device to cache host memory coherently. The device sends requests (D2H Request) to access host memory; the host's snoop engine responds (H2D Response) with data and coherency state. Uses a modified MESI protocol — the device's cache lines participate in the host's coherency domain without any CPU driver intervention.
CXL.mem
Allows the host CPU to access device-attached memory using load/store instructions — the memory appears as a NUMA node. The host issues M2S (Master-to-Subordinate) requests; the device responds S2M (Subordinate-to-Master) with data. Host-Device Memory (HDM) is the region descriptor registered in DVSEC config space during enumeration.
Device Classification

CXL Device Types 1, 2, and 3

A device's type is determined by which sub-protocols it implements. This defines its role in the system.

T1
Type 1 Device
Has a cache, no device memory
CXL.io CXL.cache
The device maintains a coherent cache of host memory. It can hold copies of host DRAM lines and participate in snoop traffic. The CPU does NOT access any device-local memory — there is none. Type 1 is ideal for devices that need to process host data locally without expensive copies.
Examples: Smart NICs, FPGAs used as processing accelerators, CXL-attached cryptographic engines
T2
Type 2 Device
Has a cache AND device memory
CXL.io CXL.cache CXL.mem
The most capable type. The device caches host memory AND the host can load/store into device-local memory (HBM, GDDR). Full bidirectional coherency — a CPU thread can write to device memory and a GPU shader can read it without any explicit synchronization primitive. This is the holy grail for zero-copy ML inference.
Examples: AI accelerators, discrete GPUs with CXL, computational storage devices with DRAM
T3
Type 3 Device
Memory only, no device cache
CXL.io CXL.mem
A pure memory expansion device. It has no on-device cache and does not participate in coherency snoops. Its sole purpose is to add addressable memory capacity to the host system — appearing as a NUMA node in the OS. Latency is higher than local DRAM (~100 ns vs ~80 ns) but bandwidth is competitive with a DDR5 channel.
Examples: CXL DRAM DIMMs, Samsung CMM-D/CMM-H modules, Micron CZ120 CXL memory, SK Hynix AXDIMM
Interview tip: Type 3 is the most commercially deployed in 2024–2025 (memory expansion). Type 2 is the most architecturally interesting (full coherency). When asked "what type is a GPU?", the answer is Type 2 — it has both its own memory (GDDR/HBM) and a cache that can coherently access host DRAM.
System Topology

CXL System Topologies

CXL 1.1 — Direct Attach Host CPU CXL Root Port Type 1 Smart NIC .io + .cache Type 2 AI Accelerator .io + .cache + .mem Type 3 Mem Expander .io + .mem CXL 2.0 — Memory Pooling via Switch Host CPU A Server 1 Host CPU B Server 2 CXL Switch Fabric Manager Mem Pool 1 256 GB DDR5 Mem Pool 2 256 GB DDR5 Mem Pool 3 256 GB DDR5 Fabric Manager dynamically assigns memory pools to hosts Host A can use Pool 1+2, Host B uses Pool 3 — reconfigurable at runtime

Fig 2 — Left: CXL 1.1 direct attach — a single host connects to Type 1/2/3 devices. Right: CXL 2.0 memory pooling — a CXL switch allows multiple server hosts to share a pool of CXL memory modules, with a Fabric Manager controlling partitioning.

Version History

CXL Versions 1.0 → 3.1

1.0
CXL 1.0
2019 · PCIe 5.0 PHY · 32 GT/s per lane
Initial specification. Defined the three sub-protocols (CXL.io, CXL.cache, CXL.mem) and the three device types. Single host-to-device connection only. No switch support. Established the DVSEC (Designated Vendor-Specific Extended Capability) registers for CXL device enumeration.
1.1
CXL 1.1
2020 · PCIe 5.0 PHY · minor errata release
Bug fixes and clarifications to CXL 1.0. Added viral alert support (error propagation), improved RAS (Reliability, Availability, Serviceability) definitions. Most first-generation CXL IP is implemented against CXL 1.1. Intel Sapphire Rapids (2023) implements CXL 1.1.
2.0
CXL 2.0
2020 · PCIe 5.0 PHY · memory pooling
The major architectural upgrade. Added CXL switching — a CXL switch enables multiple hosts to share a pool of memory devices. Added persistent memory (PMem) support via CXL.mem. Introduced the Fabric Manager (FM) API for dynamic resource allocation. Hot-plug and surprise removal support. Memory interleave across multiple CXL devices.
3.0
CXL 3.0
2022 · PCIe 6.0 PHY (PAM4) · 64 GT/s per lane
Doubled raw bandwidth via PCIe 6.0 PHY (256 GB/s on x16). Added multi-level switch topologies for large CXL fabric deployment. Introduced peer-to-peer (P2P) coherency — two accelerators can share memory coherently without CPU involvement. Back-invalidation enhancements for large coherent domains. Shared Fabric-Attached Memory (FAM) accessible by multiple hosts simultaneously. Enhanced RAS with link-level integrity.
3.1
CXL 3.1
2023 · PCIe 6.0 PHY · FAM enhancements
Refined shared FAM semantics for large-scale memory disaggregation. Added Trust and Security (T&S) model for CXL memory confidential computing. Improved MLD (Multi-Logical Device) support allowing a single physical CXL device to appear as multiple logical devices to different hosts. Targeted at hyperscale data center rack-scale memory architectures.
Comparison

CXL vs PCIe vs DRAM vs NVLink

PropertyLocal DRAMCXL (Type 3)PCIe (DMA)NVLink-C2C
Coherent?Yes (native)Yes (CXL.mem)No (DMA only)Yes (NVIDIA)
Load/Store Accessible?YesYes (NUMA node)No (requires DMA)Yes
Latency~80 ns~100–120 ns~1–5 µs (DMA round trip)~40 ns (on-package)
Bandwidth (x16)~300 GB/s (DDR5 8-ch)128–256 GB/s (CXL 2/3)64–128 GB/s (PCIe 5/6)~900 GB/s (NVLink 4)
StandardJEDEC DDR5CXL Consortium (open)PCI-SIG (open)NVIDIA (proprietary)
PhysicalDIMM slotPCIe slot / E3.S / EDSFFPCIe slotOn-package bumps
Use caseAll workloadsMemory expansion, AIGPU compute, NVMeGrace-Hopper HPC
Applications

Real-World CXL Use Cases

Memory Capacity Expansion
The dominant 2024–2025 use case. In-memory databases (SAP HANA, Redis, MemSQL) and AI inference servers need more RAM than DIMM slots allow. CXL Type 3 modules (Samsung CMM-D, Micron CZ120) plug into PCIe slots and appear as NUMA nodes — Linux places cold pages there transparently via tiered memory (DAMON/AutoNUMA).
AI/ML Accelerator Coherency
Type 2 devices (AI chips, FPGAs) can load model weights from host DRAM without any explicit DMA setup. Zero-copy inference: CPU writes tokenized inputs to a shared buffer, the AI accelerator reads them directly via CXL.cache. Reduces host CPU load and inference latency simultaneously.
Smart NIC / DPU Offload
Type 1 CXL smart NICs cache relevant network packet headers and connection tracking tables from host memory into their on-chip SRAM — coherently. No DMA, no cache invalidation overhead. The NIC sees the same view of connection state as the host kernel's TCP stack.
Rack-Scale Memory Disaggregation
CXL 2.0/3.0 switches enable a rack of compute blades to dynamically share a pool of CXL memory blades. A latency-sensitive job claims more memory at runtime; when done, releases it to another blade. Hyperscalers (Meta, Google, Microsoft) are deploying this to reduce stranded memory waste across server fleets.
FAQ

Frequently Asked Questions

CXL (Compute Express Link) runs on the PCIe physical layer — same connector, same SerDes, same lane widths. But where PCIe only provides non-coherent I/O (you must use DMA to move data), CXL adds two capabilities: CXL.cache lets the device coherently cache host DRAM, and CXL.mem lets the host load/store into device memory as if it were system RAM. The coherency intelligence lives in the CXL protocol layers above the PCIe PHY, negotiated during link initialization.

CXL.io is the non-coherent baseline — functionally equivalent to PCIe TLPs for config, DMA, and interrupts. CXL.cache gives the device a coherent window into host memory using a Req/Snoop/Data handshake compatible with MESI. CXL.mem gives the host a load/store window into device-attached memory (HBM, DDR5) that appears as a NUMA node. All three share the same physical PCIe lanes simultaneously via arbitration in the CXL Flex Bus layer.

Type 1 (CXL.io + CXL.cache): device has a coherent cache of host memory, no device memory accessible by host. Smart NICs, processing FPGAs. Type 2 (all three): device has both a coherent cache and device memory the host can address. AI accelerators, GPUs. Type 3 (CXL.io + CXL.mem): pure memory expansion — the device adds addressable memory capacity to the host as a NUMA node but has no on-device cache. DDR5/LPDDR5 memory expansion modules.

CXL 2.0 introduced CXL switching — a CXL switch allows multiple host CPUs to share a pool of CXL memory devices, enabling memory disaggregation. A Fabric Manager (FM) API controls how memory pools are dynamically partitioned between hosts at runtime. CXL 2.0 also added persistent memory support (CXL.mem to PMem), hot-plug, surprise removal, and memory interleaving across multiple Type 3 devices.

CXL 3.0 (2022) moved to the PCIe 6.0 physical layer (PAM4, 64 GT/s per lane) — doubling bandwidth to 256 GB/s on x16. Beyond bandwidth, it added peer-to-peer coherency between accelerators (not just host-to-device), multi-level switch fabric topology, back-invalidation for large coherent domains, and Shared Fabric-Attached Memory (FAM) simultaneously accessible by multiple hosts. CXL 3.1 refined FAM semantics and added confidential computing support.

CXL Type 3 memory appears to the OS as an additional NUMA (Non-Uniform Memory Access) node — distinct from local DRAM NUMA nodes. The BIOS/UEFI enumerates the CXL device's Host-Device Memory (HDM) range and registers it in the ACPI SRAT and HMAT tables. Linux's tiered memory subsystem (using DAMON or AutoNUMA) can automatically migrate hot pages to local DRAM and cold pages to slower CXL memory, making the capacity expansion nearly transparent to applications.

Local DDR5 DRAM has ~80 ns read latency. CXL Type 3 memory (same DRAM behind a CXL controller on a PCIe slot) adds ~20–40 ns of protocol overhead, landing at roughly 100–120 ns total. This is significantly better than RDMA over Ethernet (~1–5 µs) and sufficient for workloads whose working set fits in local DRAM with cold data spilled to CXL. The gap narrows as CXL PHY generations improve; CXL 3.0's higher bandwidth also helps latency-sensitive streaming workloads.