CXL (Compute Express Link) is an open industry standard interconnect built on the PCIe physical layer that adds cache coherency and memory semantics to the link. It lets CPUs, accelerators (GPUs, FPGAs, AI chips), and memory devices share a coherent memory space — so an accelerator can directly cache host DRAM and the host CPU can address the accelerator's local memory as system RAM, all without expensive software-managed data copies.

What is the difference between CXL Type 1, Type 2, and Type 3?

CXL Type 1 devices use CXL.io + CXL.cache — they have a cache that can hold host memory, but no device-local memory accessible to the host. Smart NICs and FPGAs are typical examples. CXL Type 2 devices use all three: CXL.io + CXL.cache + CXL.mem — they have both a device cache and local memory the host can address. AI/ML accelerators and GPUs are the target. CXL Type 3 devices use CXL.io + CXL.mem only — they add memory capacity to the host with no device cache. Memory expansion modules (DIMMs with CXL controllers) are the most common Type 3 product.

How does CXL differ from PCIe?

CXL runs on the same physical layer as PCIe (same connector, same cable, same SerDes) but adds two capabilities PCIe lacks: cache coherency (CXL.cache) and host-addressable device memory (CXL.mem). With plain PCIe, the CPU cannot cache device memory and the device cannot coherently cache host memory — all data exchange requires explicit DMA operations. CXL eliminates this software overhead by making cross-device memory sharing transparent to the CPU's memory model.

What does CXL 2.0 add over CXL 1.1?

CXL 2.0 (2020) added CXL switching — a CXL switch allows multiple host processors to share a pool of CXL memory devices, enabling memory disaggregation in server racks. It also added support for persistent memory (CXL.mem to CXL-attached PMem), hot-plug support, and a fabric manager interface for dynamic resource allocation.

CXL 3.0 (2022) runs on PCIe 6.0 physical layer (64 GT/s, PAM4) doubling the raw bandwidth. It adds multi-level switching for CXL fabric topologies, peer-to-peer coherency between accelerators (not just host-to-device), enhanced back-invalidation for large coherent domains, and shared Fabric-Attached Memory (FAM) accessible by multiple hosts simultaneously without host software involvement.

What are the main use cases for CXL?

Three major use cases drive CXL adoption. First, memory capacity expansion: data-hungry workloads (in-memory databases, SAP HANA, Redis) use CXL Type 3 modules to add terabytes of addressable RAM beyond DIMM slot limits. Second, AI/ML accelerator coherency: GPUs and inference chips (Type 2) can access model weights in host memory without copying, reducing latency and host CPU load. Third, disaggregated memory pools: in hyperscale data centers, CXL 2.0/3.0 switches let compute nodes dynamically claim memory from a shared pool based on workload demand.

CXL – Compute Express Link Explained

Q: What are CXL.io, CXL.cache, and CXL.mem?

CXL has three sub-protocols that run concurrently over the same physical PCIe link. CXL.io provides non-coherent I/O — device enumeration, configuration, and DMA, exactly like standard PCIe. CXL.cache allows the device to cache host memory coherently using a Req/Snoop/Response handshake. CXL.mem allows the host to access the device's attached memory as load/store addressable system RAM. Which sub-protocols a device uses determines its Type (1, 2, or 3).

Overview

What is CXL and Why Does It Exist?

Modern workloads — AI training, in-memory databases, scientific computing — hit two walls that PCIe alone cannot solve: coherency and memory capacity.

The Coherency Problem

With plain PCIe, a GPU cannot directly cache host DRAM. Every time a GPU needs host data, the driver must orchestrate a DMA transfer — allocating pinned buffers, issuing DMA commands, synchronizing completion. CXL.cache makes that invisible: the GPU's caches are coherently integrated into the CPU's MESI protocol, just like another CPU socket.

The Memory Capacity Wall

A dual-socket server has 24–32 DIMM slots — typically 6–12 TB max. AI models and in-memory databases are outgrowing this. CXL Type 3 memory expansion modules attach to a PCIe slot and appear to the OS as normal system RAM, breaking the DIMM slot ceiling without processor socket changes.

Why PCIe Physical Layer?

PCIe is already in every server. Reusing its physical layer (connector, cable, SerDes, lane training) means CXL devices plug into standard PCIe slots — zero new hardware infrastructure. CXL is negotiated at link bring-up: if both sides support CXL, they upgrade the link; otherwise it falls back to PCIe.

Software Transparency

CXL memory (Type 3) appears as a standard NUMA node to the OS. Applications do not need modification — the kernel's memory allocator can place pages on CXL-attached memory automatically. Similarly, CXL-coherent accelerators (Type 2) can share pointers with CPU code with no explicit marshalling.

CXL in one sentence: CXL is PCIe with two superpowers added — coherent caching across the link (CXL.cache) and load/store-accessible device memory (CXL.mem) — delivered over the exact same physical connector.

Numbers

CXL at a Glance

2019

CXL 1.0 published

3

Sub-protocols (.io/.cache/.mem)

3

Device types (1 / 2 / 3)

256

GB/s bidirectional (CXL 3.0 x16)

~100

ns latency (Type 3 vs ~80 ns DRAM)

PCIe

Physical layer (same connector)

Architecture

CXL Three Sub-Protocols

CXL multiplexes three independent sub-protocols over a single PCIe physical link. Each has its own channel, flow control, and semantics.

Fig 1 — CXL's three sub-protocols share a single PCIe physical link. CXL.io provides standard PCIe semantics; CXL.cache adds device-side coherent caching; CXL.mem adds host-accessible device memory. Which channels are active depends on the device type.

CXL.io

Non-coherent I/O protocol — functionally identical to PCIe TLPs. Used for device enumeration (PCI config space), MSI/MSI-X interrupt delivery, and DMA. All CXL devices must support CXL.io; it is the mandatory baseline. Backward-compatible with PCIe software stacks.

CXL.cache

Allows the device to cache host memory coherently. The device sends requests (D2H Request) to access host memory; the host's snoop engine responds (H2D Response) with data and coherency state. Uses a modified MESI protocol — the device's cache lines participate in the host's coherency domain without any CPU driver intervention.

CXL.mem

Allows the host CPU to access device-attached memory using load/store instructions — the memory appears as a NUMA node. The host issues M2S (Master-to-Subordinate) requests; the device responds S2M (Subordinate-to-Master) with data. Host-Device Memory (HDM) is the region descriptor registered in DVSEC config space during enumeration.

Device Classification

CXL Device Types 1, 2, and 3

A device's type is determined by which sub-protocols it implements. This defines its role in the system.

T1

Type 1 Device

Has a cache, no device memory

CXL.io CXL.cache

The device maintains a coherent cache of host memory. It can hold copies of host DRAM lines and participate in snoop traffic. The CPU does NOT access any device-local memory — there is none. Type 1 is ideal for devices that need to process host data locally without expensive copies.

Examples: Smart NICs, FPGAs used as processing accelerators, CXL-attached cryptographic engines

T2

Type 2 Device

Has a cache AND device memory

CXL.io CXL.cache CXL.mem

The most capable type. The device caches host memory AND the host can load/store into device-local memory (HBM, GDDR). Full bidirectional coherency — a CPU thread can write to device memory and a GPU shader can read it without any explicit synchronization primitive. This is the holy grail for zero-copy ML inference.

Examples: AI accelerators, discrete GPUs with CXL, computational storage devices with DRAM

T3

Type 3 Device

Memory only, no device cache

CXL.io CXL.mem

A pure memory expansion device. It has no on-device cache and does not participate in coherency snoops. Its sole purpose is to add addressable memory capacity to the host system — appearing as a NUMA node in the OS. Latency is higher than local DRAM (~100 ns vs ~80 ns) but bandwidth is competitive with a DDR5 channel.

Examples: CXL DRAM DIMMs, Samsung CMM-D/CMM-H modules, Micron CZ120 CXL memory, SK Hynix AXDIMM

Interview tip: Type 3 is the most commercially deployed in 2024–2025 (memory expansion). Type 2 is the most architecturally interesting (full coherency). When asked "what type is a GPU?", the answer is Type 2 — it has both its own memory (GDDR/HBM) and a cache that can coherently access host DRAM.

System Topology

CXL System Topologies

Fig 2 — Left: CXL 1.1 direct attach — a single host connects to Type 1/2/3 devices. Right: CXL 2.0 memory pooling — a CXL switch allows multiple server hosts to share a pool of CXL memory modules, with a Fabric Manager controlling partitioning.

Version History

CXL Versions 1.0 → 3.1

1.0

CXL 1.0

2019 · PCIe 5.0 PHY · 32 GT/s per lane

Initial specification. Defined the three sub-protocols (CXL.io, CXL.cache, CXL.mem) and the three device types. Single host-to-device connection only. No switch support. Established the DVSEC (Designated Vendor-Specific Extended Capability) registers for CXL device enumeration.

1.1

CXL 1.1

2020 · PCIe 5.0 PHY · minor errata release

Bug fixes and clarifications to CXL 1.0. Added viral alert support (error propagation), improved RAS (Reliability, Availability, Serviceability) definitions. Most first-generation CXL IP is implemented against CXL 1.1. Intel Sapphire Rapids (2023) implements CXL 1.1.

2.0

CXL 2.0

2020 · PCIe 5.0 PHY · memory pooling

The major architectural upgrade. Added CXL switching — a CXL switch enables multiple hosts to share a pool of memory devices. Added persistent memory (PMem) support via CXL.mem. Introduced the Fabric Manager (FM) API for dynamic resource allocation. Hot-plug and surprise removal support. Memory interleave across multiple CXL devices.

3.0

CXL 3.0

2022 · PCIe 6.0 PHY (PAM4) · 64 GT/s per lane

Doubled raw bandwidth via PCIe 6.0 PHY (256 GB/s on x16). Added multi-level switch topologies for large CXL fabric deployment. Introduced peer-to-peer (P2P) coherency — two accelerators can share memory coherently without CPU involvement. Back-invalidation enhancements for large coherent domains. Shared Fabric-Attached Memory (FAM) accessible by multiple hosts simultaneously. Enhanced RAS with link-level integrity.

3.1

CXL 3.1

2023 · PCIe 6.0 PHY · FAM enhancements

Refined shared FAM semantics for large-scale memory disaggregation. Added Trust and Security (T&S) model for CXL memory confidential computing. Improved MLD (Multi-Logical Device) support allowing a single physical CXL device to appear as multiple logical devices to different hosts. Targeted at hyperscale data center rack-scale memory architectures.

Comparison

CXL vs PCIe vs DRAM vs NVLink

Property	Local DRAM	CXL (Type 3)	PCIe (DMA)	NVLink-C2C
Coherent?	Yes (native)	Yes (CXL.mem)	No (DMA only)	Yes (NVIDIA)
Load/Store Accessible?	Yes	Yes (NUMA node)	No (requires DMA)	Yes
Latency	~80 ns	~100–120 ns	~1–5 µs (DMA round trip)	~40 ns (on-package)
Bandwidth (x16)	~300 GB/s (DDR5 8-ch)	128–256 GB/s (CXL 2/3)	64–128 GB/s (PCIe 5/6)	~900 GB/s (NVLink 4)
Standard	JEDEC DDR5	CXL Consortium (open)	PCI-SIG (open)	NVIDIA (proprietary)
Physical	DIMM slot	PCIe slot / E3.S / EDSFF	PCIe slot	On-package bumps
Use case	All workloads	Memory expansion, AI	GPU compute, NVMe	Grace-Hopper HPC

Applications

Real-World CXL Use Cases

Memory Capacity Expansion

The dominant 2024–2025 use case. In-memory databases (SAP HANA, Redis, MemSQL) and AI inference servers need more RAM than DIMM slots allow. CXL Type 3 modules (Samsung CMM-D, Micron CZ120) plug into PCIe slots and appear as NUMA nodes — Linux places cold pages there transparently via tiered memory (DAMON/AutoNUMA).

AI/ML Accelerator Coherency

Type 2 devices (AI chips, FPGAs) can load model weights from host DRAM without any explicit DMA setup. Zero-copy inference: CPU writes tokenized inputs to a shared buffer, the AI accelerator reads them directly via CXL.cache. Reduces host CPU load and inference latency simultaneously.

Smart NIC / DPU Offload

Type 1 CXL smart NICs cache relevant network packet headers and connection tracking tables from host memory into their on-chip SRAM — coherently. No DMA, no cache invalidation overhead. The NIC sees the same view of connection state as the host kernel's TCP stack.

Rack-Scale Memory Disaggregation

CXL 2.0/3.0 switches enable a rack of compute blades to dynamically share a pool of CXL memory blades. A latency-sensitive job claims more memory at runtime; when done, releases it to another blade. Hyperscalers (Meta, Google, Microsoft) are deploying this to reduce stranded memory waste across server fleets.

FAQ

Frequently Asked Questions

CXL (Compute Express Link) runs on the PCIe physical layer — same connector, same SerDes, same lane widths. But where PCIe only provides non-coherent I/O (you must use DMA to move data), CXL adds two capabilities: CXL.cache lets the device coherently cache host DRAM, and CXL.mem lets the host load/store into device memory as if it were system RAM. The coherency intelligence lives in the CXL protocol layers above the PCIe PHY, negotiated during link initialization.

CXL.io is the non-coherent baseline — functionally equivalent to PCIe TLPs for config, DMA, and interrupts. CXL.cache gives the device a coherent window into host memory using a Req/Snoop/Data handshake compatible with MESI. CXL.mem gives the host a load/store window into device-attached memory (HBM, DDR5) that appears as a NUMA node. All three share the same physical PCIe lanes simultaneously via arbitration in the CXL Flex Bus layer.

Type 1 (CXL.io + CXL.cache): device has a coherent cache of host memory, no device memory accessible by host. Smart NICs, processing FPGAs. Type 2 (all three): device has both a coherent cache and device memory the host can address. AI accelerators, GPUs. Type 3 (CXL.io + CXL.mem): pure memory expansion — the device adds addressable memory capacity to the host as a NUMA node but has no on-device cache. DDR5/LPDDR5 memory expansion modules.

CXL 2.0 introduced CXL switching — a CXL switch allows multiple host CPUs to share a pool of CXL memory devices, enabling memory disaggregation. A Fabric Manager (FM) API controls how memory pools are dynamically partitioned between hosts at runtime. CXL 2.0 also added persistent memory support (CXL.mem to PMem), hot-plug, surprise removal, and memory interleaving across multiple Type 3 devices.

CXL 3.0 (2022) moved to the PCIe 6.0 physical layer (PAM4, 64 GT/s per lane) — doubling bandwidth to 256 GB/s on x16. Beyond bandwidth, it added peer-to-peer coherency between accelerators (not just host-to-device), multi-level switch fabric topology, back-invalidation for large coherent domains, and Shared Fabric-Attached Memory (FAM) simultaneously accessible by multiple hosts. CXL 3.1 refined FAM semantics and added confidential computing support.

CXL Type 3 memory appears to the OS as an additional NUMA (Non-Uniform Memory Access) node — distinct from local DRAM NUMA nodes. The BIOS/UEFI enumerates the CXL device's Host-Device Memory (HDM) range and registers it in the ACPI SRAT and HMAT tables. Linux's tiered memory subsystem (using DAMON or AutoNUMA) can automatically migrate hot pages to local DRAM and cold pages to slower CXL memory, making the capacity expansion nearly transparent to applications.

Local DDR5 DRAM has ~80 ns read latency. CXL Type 3 memory (same DRAM behind a CXL controller on a PCIe slot) adds ~20–40 ns of protocol overhead, landing at roughly 100–120 ns total. This is significantly better than RDMA over Ethernet (~1–5 µs) and sufficient for workloads whose working set fits in local DRAM with cold data spilled to CXL. The gap narrows as CXL PHY generations improve; CXL 3.0's higher bandwidth also helps latency-sensitive streaming workloads.