CXL Memory Is Rewriting Server Architecture for AI | IRCNF

For most of computing history, memory has been physically attached to the processor that uses it. CPUs have their DIMMs, GPUs have their HBM stacks, and the two pools don't talk to each other efficiently. This architecture worked fine when workloads fit neatly inside a single server's memory budget. AI changed that. Large language model inference requires terabytes of memory for the KV cache alone, and a single server's attached DRAM is nowhere near enough. Compute Express Link (CXL) is the industry's answer to this mismatch — and its adoption is accelerating fast enough to matter for anyone building or buying data center infrastructure in the next two years.

CXL is not a product. It is a protocol — specifically, an open interconnect standard built on the PCIe 5.0 physical layer that allows processors to access memory on external devices with the same low latency and cache coherence they expect from directly attached DRAM. The practical implication is large: memory can be installed in a CXL memory module on the other side of a PCIe slot, or pooled across an entire rack via a CXL switch, and the CPU treats it as if it were local memory.

Three Sub-Protocols, One Use Case Driving Adoption

CXL defines three sub-protocols that serve different functions. CXL.io handles basic device I/O — roughly equivalent to PCIe. CXL.cache allows a device to cache portions of host memory, enabling accelerators like GPUs to access CPU-side data efficiently without explicit data copies. CXL.mem is the one getting the most investment: it lets a host CPU read and write to memory installed on an external CXL device, expanding the effective memory capacity available to any single processor far beyond what motherboard DIMM slots allow.

CXL 1.0 appeared in 2019. CXL 2.0 (2020) added memory pooling — the ability for multiple host processors to share a common CXL memory pool — and switching, so a single pool can be accessed by multiple servers. CXL 3.0 (2022) extended this to fabric topologies: multi-host access where any compute node in a rack can reach any memory module, with peer-to-peer coherence. The bandwidth ceiling hit 256 GB/s per port in CXL 3.0, approaching what HBM provides for GPU-attached memory.

Why AI Inference Is the Forcing Function

LLM inference has a specific memory problem that CXL is well-positioned to solve. When a model generates text, it maintains a key-value (KV) cache that stores the attention state for every token in the context window. For a model with a 128K-token context window running on a multi-tenant inference server, the KV cache alone can consume hundreds of gigabytes — dynamically, depending on active sessions.

Managing this with GPU HBM is expensive and capacity-constrained. HBM4 modules top out at around 48 GB per stack; even an 8-GPU server maxes out around 384 GB of GPU memory. CXL memory expansion offers a cost-effective overflow: KV cache data that doesn't need the raw bandwidth of HBM can live in CXL-attached DRAM at roughly 10–20% of the cost per gigabyte, with latency around 100–200 nanoseconds versus 20–30 ns for HBM. The latency penalty is real but acceptable for data that is accessed infrequently during inference.

Memory-disaggregated inference — where a pool of CXL memory is shared across multiple GPU servers — takes this further. Instead of each GPU server maintaining its own oversized DRAM buffer, a CXL fabric allows 10 inference servers to share a single 4 TB memory pool that is dynamically allocated based on load. Utilization improves, stranded capacity decreases, and the cost per inference falls.

Who Is Building the Hardware

Samsung's CXL Memory Module DRAM (CMM-D) offers up to 128 GB per module at 256 GB/s bandwidth and is already in qualification with hyperscalers. SK Hynix has its own CXL DRAM lineup, with a 128 GB module targeting AI inference servers. Micron entered CXL DRAM production in 2024. All three major DRAM manufacturers are now shipping or qualifying CXL product — the supply side is maturing.

On the connectivity side, Astera Labs went public in 2024 specifically on the strength of its CXL and PCIe connectivity chips. Its Aries retimers are inside most CXL-capable servers shipping today, and its Leo CXL Memory Connectivity ICs enable memory pooling fabrics at the rack scale. Marvell and Synopsys also supply CXL controller IP that goes into server processors.

Intel Xeon Scalable processors have supported CXL since the Sapphire Rapids generation. AMD EPYC processors added CXL support in the Genoa generation. Arm-based server processors from Ampere and Nvidia's Grace CPU include CXL support. The ecosystem is broad enough that CXL is no longer an exotic option — it is a standard checkbox on enterprise server SKUs.

What Is Available Today vs. What Is Coming

CXL Type 3 memory expansion (single-host expansion of a server's memory beyond DIMM slot limits) is the most mature use case and is available in production today. A server with 12 DIMM slots topping out at 3 TB of DDR5 can add another 4 TB via a CXL memory expansion card — useful for in-memory databases, large analytics workloads, and LLM KV caches.

CXL memory pooling (multiple hosts sharing a common CXL memory resource) is in customer trials at hyperscalers as of 2025–2026 but is not yet in broad production. The software stack — operating system support for CXL memory tiers, hypervisor integration, memory management policies — is still maturing. Linux kernel support for CXL is improving rapidly (Linux 6.x series has progressively stronger CXL support), but orchestration tooling is behind.

Full CXL fabric (rack-scale memory disaggregation with multi-host coherent access) remains largely at the hyperscaler proof-of-concept stage. Google, Microsoft, and AWS are all testing CXL fabric architectures internally, but customer-facing deployments are 18–24 months away.

What This Means for Infrastructure Buyers

For organizations buying servers today, CXL Type 3 memory expansion is worth evaluating for specific workloads: in-memory databases like SAP HANA or Redis that need large memory footprints, analytics workloads that don't fit in standard DRAM, and LLM serving infrastructure where KV cache management is a bottleneck.

The economics only make sense when the cost of CXL-attached DRAM (roughly $10–20 per GB in current modules, compared to $3–5 per GB for standard DDR5 DIMMs) is weighed against the alternative, which is buying more servers with more DIMM slots. For memory-intensive workloads, the consolidation savings typically pay back the CXL premium in 12–18 months.

For cloud buyers, the more relevant question is when hyperscalers will expose CXL-backed memory tiers as distinct pricing options — allowing customers to specify cheaper, higher-capacity CXL memory for latency-tolerant data alongside fast HBM or DDR5 for latency-critical paths. AWS and Google both have internal CXL programs, and customer-visible features are likely in 2027.

CXL is not a technology looking for a use case. The use case — AI memory expansion — arrived before the hardware was fully ready. The hardware is now catching up, and the next two years will determine whether disaggregated memory becomes a standard feature of AI infrastructure or remains a specialist tool for the largest hyperscalers.

CXL Is Rewriting Server Memory Architecture — and AI Workloads Are Why

Three Sub-Protocols, One Use Case Driving Adoption

Why AI Inference Is the Forcing Function

Who Is Building the Hardware

What Is Available Today vs. What Is Coming

What This Means for Infrastructure Buyers