CXL is an open industry standard that extends PCIe with cache coherency and memory semantics. It lets CPUs, GPUs, and AI accelerators share a unified coherent memory space — eliminating costly software-managed data copies and enabling memory expansion beyond DIMM slot limits. Built on the PCIe physical layer, CXL reuses existing PCIe connectors and SerDes while adding the coherency intelligence the data center era demands.
Modern workloads — AI training, in-memory databases, scientific computing — hit two walls that PCIe alone cannot solve: coherency and memory capacity.
CXL multiplexes three independent sub-protocols over a single PCIe physical link. Each has its own channel, flow control, and semantics.
Fig 1 — CXL's three sub-protocols share a single PCIe physical link. CXL.io provides standard PCIe semantics; CXL.cache adds device-side coherent caching; CXL.mem adds host-accessible device memory. Which channels are active depends on the device type.
A device's type is determined by which sub-protocols it implements. This defines its role in the system.
Fig 2 — Left: CXL 1.1 direct attach — a single host connects to Type 1/2/3 devices. Right: CXL 2.0 memory pooling — a CXL switch allows multiple server hosts to share a pool of CXL memory modules, with a Fabric Manager controlling partitioning.
| Property | Local DRAM | CXL (Type 3) | PCIe (DMA) | NVLink-C2C |
|---|---|---|---|---|
| Coherent? | Yes (native) | Yes (CXL.mem) | No (DMA only) | Yes (NVIDIA) |
| Load/Store Accessible? | Yes | Yes (NUMA node) | No (requires DMA) | Yes |
| Latency | ~80 ns | ~100–120 ns | ~1–5 µs (DMA round trip) | ~40 ns (on-package) |
| Bandwidth (x16) | ~300 GB/s (DDR5 8-ch) | 128–256 GB/s (CXL 2/3) | 64–128 GB/s (PCIe 5/6) | ~900 GB/s (NVLink 4) |
| Standard | JEDEC DDR5 | CXL Consortium (open) | PCI-SIG (open) | NVIDIA (proprietary) |
| Physical | DIMM slot | PCIe slot / E3.S / EDSFF | PCIe slot | On-package bumps |
| Use case | All workloads | Memory expansion, AI | GPU compute, NVMe | Grace-Hopper HPC |
CXL (Compute Express Link) runs on the PCIe physical layer — same connector, same SerDes, same lane widths. But where PCIe only provides non-coherent I/O (you must use DMA to move data), CXL adds two capabilities: CXL.cache lets the device coherently cache host DRAM, and CXL.mem lets the host load/store into device memory as if it were system RAM. The coherency intelligence lives in the CXL protocol layers above the PCIe PHY, negotiated during link initialization.
CXL.io is the non-coherent baseline — functionally equivalent to PCIe TLPs for config, DMA, and interrupts. CXL.cache gives the device a coherent window into host memory using a Req/Snoop/Data handshake compatible with MESI. CXL.mem gives the host a load/store window into device-attached memory (HBM, DDR5) that appears as a NUMA node. All three share the same physical PCIe lanes simultaneously via arbitration in the CXL Flex Bus layer.
Type 1 (CXL.io + CXL.cache): device has a coherent cache of host memory, no device memory accessible by host. Smart NICs, processing FPGAs. Type 2 (all three): device has both a coherent cache and device memory the host can address. AI accelerators, GPUs. Type 3 (CXL.io + CXL.mem): pure memory expansion — the device adds addressable memory capacity to the host as a NUMA node but has no on-device cache. DDR5/LPDDR5 memory expansion modules.
CXL 2.0 introduced CXL switching — a CXL switch allows multiple host CPUs to share a pool of CXL memory devices, enabling memory disaggregation. A Fabric Manager (FM) API controls how memory pools are dynamically partitioned between hosts at runtime. CXL 2.0 also added persistent memory support (CXL.mem to PMem), hot-plug, surprise removal, and memory interleaving across multiple Type 3 devices.
CXL 3.0 (2022) moved to the PCIe 6.0 physical layer (PAM4, 64 GT/s per lane) — doubling bandwidth to 256 GB/s on x16. Beyond bandwidth, it added peer-to-peer coherency between accelerators (not just host-to-device), multi-level switch fabric topology, back-invalidation for large coherent domains, and Shared Fabric-Attached Memory (FAM) simultaneously accessible by multiple hosts. CXL 3.1 refined FAM semantics and added confidential computing support.
CXL Type 3 memory appears to the OS as an additional NUMA (Non-Uniform Memory Access) node — distinct from local DRAM NUMA nodes. The BIOS/UEFI enumerates the CXL device's Host-Device Memory (HDM) range and registers it in the ACPI SRAT and HMAT tables. Linux's tiered memory subsystem (using DAMON or AutoNUMA) can automatically migrate hot pages to local DRAM and cold pages to slower CXL memory, making the capacity expansion nearly transparent to applications.
Local DDR5 DRAM has ~80 ns read latency. CXL Type 3 memory (same DRAM behind a CXL controller on a PCIe slot) adds ~20–40 ns of protocol overhead, landing at roughly 100–120 ns total. This is significantly better than RDMA over Ethernet (~1–5 µs) and sufficient for workloads whose working set fits in local DRAM with cold data spilled to CXL. The gap narrows as CXL PHY generations improve; CXL 3.0's higher bandwidth also helps latency-sensitive streaming workloads.