How CCIX enables CPU–accelerator cache coherency over PCIe — architecture, transaction types, coherency model, and comparison with CXL.
CCIX (Cache Coherent Interconnect for Accelerators) is an open industry standard protocol that adds hardware cache coherency between host CPUs and attached accelerators over a PCIe physical link. Developed by the CCIX Consortium (founded 2016 by AMD, ARM, Huawei, IBM, Mellanox, Qualcomm, Xilinx), it solves the fundamental problem of heterogeneous computing: keeping data consistent between CPU caches and accelerator caches without software intervention.
Without coherency, a programmer must explicitly flush CPU caches before the accelerator reads the data, and flush accelerator memory before the CPU reads results. CCIX eliminates this overhead in hardware, enabling zero-copy data sharing.
Key insight: CCIX = PCIe physical layer + ARM CHI-like coherency protocol on top. It reuses the existing PCIe infrastructure (same connectors, same PHY) and adds a new Transaction Layer that carries coherency messages — no new hardware slots needed.
| Property | CCIX | CXL | CAPI (IBM) |
|---|---|---|---|
| Founded | 2016, CCIX Consortium | 2019, Intel-led | 2013, IBM |
| Physical layer | PCIe Gen3/Gen4 | PCIe Gen5/Gen6 | PCIe Gen3+ |
| Protocol basis | ARM CHI-like | CXL.cache, CXL.mem, CXL.io | IBM POWER architecture |
| Coherency model | Full MOESI snoop-based | Full coherency (CXL.cache) | Full coherency |
| Memory semantic | Host memory + device memory | CXL.mem for device memory | Host memory |
| Industry status (2026) | Legacy / limited new adoption | Dominant standard | IBM Power-only |
| Key adopters | AMD EPYC, Xilinx FPGAs, ARM servers | Intel, AMD, NVIDIA, Samsung, all major SoC | IBM Power servers |
Industry shift: CXL has become the dominant cache-coherent interconnect standard. The CXL Consortium (which absorbed many CCIX members) has broader industry backing and integration into PCI-SIG. New designs in 2024+ are almost exclusively using CXL. CCIX remains relevant for understanding existing AMD EPYC and ARM server deployments.
| Node Type | Role | Example |
|---|---|---|
| RN-F (Request Node - Full) | Fully coherent agent — has a cache, participates in snoops | GPU with cache, FPGA compute engine |
| RN-I (Request Node - I/O) | Non-caching agent — issues read/write without snoops | DMA engine, I/O device |
| HN-F (Home Node - Full) | Point of coherency — receives requests, issues snoops, orders transactions | CPU LLC controller, SoC NIC |
| HN-I (Home Node - I/O) | Manages I/O address space, non-coherent | Peripheral fabric controller |
| SN (Slave Node) | DRAM controller — serves data to HN | DDR5 memory controller |
| Transaction | Direction | Coherency | Purpose |
|---|---|---|---|
| ReadNoSnoop | RN → HN | Non-coherent | DMA-style read, no cache involvement |
| ReadOnce | RN → HN | Coherent, transient | Read data once, don't cache long-term |
| ReadShared | RN → HN | Coherent Shared | Read and cache in Shared state |
| ReadUnique | RN → HN | Exclusive (write intent) | Read with intent to modify — invalidates other copies |
| MakeUnique | RN → HN | Exclusive (upgrade) | Upgrade Shared → Exclusive without data transfer |
| WriteNoSnoop | RN → HN | Non-coherent | Non-coherent write to memory |
| WriteUnique | RN → HN | Coherent write | Write to unique copy — invalidates Shared copies |
| Evict | RN → HN | Cache management | Notify HN that a clean Shared line is being evicted |
| SnpShared | HN → RN | Snoop | HN-initiated: downgrade cache line to Shared |
| SnpUnique | HN → RN | Snoop | HN-initiated: invalidate cache line (for new exclusive owner) |
| SnpCleanInvalid | HN → RN | Snoop | HN-initiated: writeback dirty data and invalidate |
| State | Meaning | Can Read? | Can Write? | Must Writeback? |
|---|---|---|---|---|
| M — Modified | Only copy, dirty (differs from memory) | Yes | Yes | Yes (on eviction) |
| O — Owned | Dirty, shared with others — owner must supply data on snoop | Yes | No | Yes |
| E — Exclusive | Only copy, clean (matches memory) | Yes | Yes (silent → M) | No |
| S — Shared | Clean, multiple caches may hold | Yes | No (must upgrade) | No |
| I — Invalid | Not present in cache | No (must fetch) | No | No |
| Layer | CCIX Definition | Standard Equivalent |
|---|---|---|
| Application | Coherent memory transactions | Custom per use case |
| Transaction Layer | CCIX TLP extensions (over PCIe TLP) | PCIe TLP + CCIX header |
| Data Link Layer | PCIe DLLP (unchanged) | PCIe standard |
| Physical Layer | PCIe Gen3/Gen4 SerDes | PCIe standard |
How CCIX reuses PCIe: CCIX negotiates capability during PCIe link training using an extended capability structure. When both endpoints support CCIX, the link switches a portion of its bandwidth to carry CCIX TLPs alongside standard PCIe TLPs. No new connectors or cables are required.
CCIX (Cache Coherent Interconnect for Accelerators) is an open standard that extends PCIe with hardware cache coherency between CPUs and accelerators (GPUs, FPGAs, AI chips), eliminating software-managed cache flushes for shared data.
Both add coherency over PCIe. CCIX (2016) was ARM/AMD-led; CXL (2019) is Intel-led and now the dominant standard with full PCI-SIG integration. CXL has three sub-protocols (CXL.io, CXL.cache, CXL.mem) and broader industry adoption. New designs in 2024+ use CXL.
AMD EPYC Rome/Milan CPUs, Xilinx Alveo FPGAs (U280, U250), Marvell ThunderX2, Ampere Altra ARM servers, and various AI accelerator ASICs. Many are shifting to CXL for next-generation designs.
CCIX uses a snoop-based MOESI protocol. The CPU's Home Node (HN) tracks which nodes cache which lines. When an accelerator requests a cache line, the HN snoops CPU caches, forces writebacks if needed, then supplies data — all in hardware, transparent to software.
CCIX remains relevant for understanding existing AMD EPYC server deployments and Xilinx FPGA accelerator platforms. For new designs, CXL has largely replaced it. Understanding CCIX is still valuable for SoC architects and VLSI engineers working with heterogeneous compute.