Why is memory the bottleneck in training AI?

Neural networks must stream billions of weights and activations between memory and the compute units constantly. Moving a number from DRAM can cost far more energy and time than the multiply itself, and the compute units sit idle waiting for data. This data-movement bottleneck, known as the memory wall or von Neumann bottleneck, is why AI accelerators are often limited by memory bandwidth rather than raw arithmetic, and why high-bandwidth memory (HBM) is so critical.

What is a memristor crossbar and how does it do matrix multiply?

A memristor crossbar is a grid of resistive memory devices (such as ReRAM or PCM) whose conductance stores a weight. Applying input voltages to the rows produces, by Ohm's law, currents proportional to voltage times conductance, and by Kirchhoff's current law these sum along each column. The column current is therefore the dot product of the inputs and the stored weights - a full matrix-vector multiply performed in the analog domain in essentially one step, where the data is stored.

Can anything replace HBM for AI today?

Not yet for mainstream large-scale training. HBM remains the workhorse because it is mature, dense and delivers terabytes per second. Alternatives such as compute-in-memory, memristor crossbars, MRAM, photonics and SRAM-only wafer-scale designs are promising and already used in niches (especially low-power edge inference), but face challenges in precision, endurance, capacity or manufacturability. The likely near-term path is HBM plus more on-chip SRAM and processing-in-memory, with analog and photonic approaches maturing over a longer horizon.

What emerging memories are candidates for AI?

Key candidates include ReRAM/memristors and PCM (phase-change memory) for analog in-memory matrix multiply, MRAM (STT/SOT) for fast non-volatile storage, and FeFET/FeRAM for efficient weight storage. Each trades off speed, density, endurance, retention and precision differently, so different technologies suit different parts of the AI memory hierarchy.

Beyond HBM: Are There Memory Alternatives to Train AI? (Deep Guide)

Q: What is compute-in-memory?

Compute-in-memory (CIM), also called in-memory computing, performs the arithmetic - especially multiply-accumulate - inside or right next to the memory array instead of moving data to a separate processor. By doing the math where the data already lives, it largely eliminates the costly data movement that dominates AI energy use. It can be built with SRAM (digital CIM) or with analog non-volatile devices such as memristors arranged in a crossbar.

The problem

Why memory — not compute — is the wall

A modern AI model has hundreds of billions of weights. To compute anything, the chip must stream those weights from memory to the multiply units, over and over. The uncomfortable truth: fetching a number from DRAM can cost far more time and energy than the multiply that uses it. The expensive compute sits idle, starving for data.

This is the memory wall (or von Neumann bottleneck): the separation of "where data lives" (memory) from "where data is processed" (the cores), connected by a pipe that's never fat enough. Today's answer is HBM — stacked DRAM next to the compute die delivering terabytes per second. (For the full picture, see What Is an AI Chip? and why HBM demand exploded.)

But HBM has limits — capacity, cost, power, and the simple fact that data still has to move. So researchers ask a deeper question: what if we changed memory so the data barely moves at all?

THE ALTERNATIVES:
Why "just more DRAM" fails Compute-in-memory Processing-in-memory Memristor crossbars PCM, MRAM, FeFET SRAM-only & wafer-scale Photonics CXL pooling & 3D Comparison The honest verdict

The framing

Why "just add more DRAM" isn't the fix

Scaling conventional DRAM bandwidth gets exponentially harder and more power-hungry. Even HBM, brilliant as it is, still moves every weight off the memory die and across an interposer to reach the MACs. The energy of that round trip — not the arithmetic — is the dominant cost. So the alternatives fall into two big ideas:

Move the compute to the data — do the math inside or beside the memory so data barely travels (compute/processing-in-memory).
Use new physics for memory — devices that store a weight and participate in the computation (memristor crossbars), or that are denser/faster/non-volatile (PCM, MRAM).

A third, more conventional track just rebalances the hierarchy — massive on-chip SRAM, 3D stacking, or pooled CXL memory.

Idea 1 — move compute to the data

Compute-in-memory (CIM)

🧮

Compute-in-Memory / In-Memory Computing

Emerging

Instead of reading data out of the memory array into a processor, CIM performs the multiply-accumulate inside the array itself. Because the operands never make the expensive trip to a separate compute unit, it can slash the energy of data movement — often the single biggest win available. It comes in two flavours: digital CIM (built from SRAM bit-cells with added logic) and analog CIM (using the physics of resistive devices — see crossbars below).

CIM is already shipping in edge AI chips for ultra-low-power, always-on inference, where its efficiency matters most. The hard part is precision and integrating it cleanly into existing design flows — but it directly attacks the root cause of the memory wall.

Idea 1, continued

Processing-in-memory (PIM)

🧠

Processing-in-Memory

Emerging / shipping

A close cousin: put simple compute units right next to the DRAM banks (or inside the HBM stack) so each bank can do work locally and only send results out. Samsung's HBM-PIM and DRAM-based accelerators like UPMEM are real products demonstrating this. PIM doesn't replace HBM — it augments it, doing reduction/elementwise work where the data already sits and cutting the traffic over the main bus.

PIM is arguably the most practical near-term alternative because it bolts onto the existing DRAM ecosystem rather than replacing it.

Idea 2 — new physics

Memristor / ReRAM analog crossbars — the star candidate

This is the one that excites researchers most. Imagine a grid of resistive memory devices (memristors / ReRAM) where each device's conductance stores a weight. Now the laws of physics do the matrix multiply for free:

Apply input voltages to the rows.
Ohm's law (I = V·G): each device produces a current = input voltage × stored conductance (i.e. a multiply).
Kirchhoff's current law: currents sum down each column (i.e. an accumulate).

The current coming out of each column is therefore the dot product of the inputs and the stored weights — a full matrix-vector multiply in a single step, in the analog domain, exactly where the weights are stored. No weights are ever fetched. This is in-memory computing in its purest form.

Figure 1 — A memristor crossbar. Conductances store the weight matrix; input voltages on the rows produce column currents that are exactly the matrix-vector product — computed in analog, in place.

⚠ The catch

Analog is messy: device-to-device variation, noise, limited precision (a few bits), drift, and write endurance all fight you. Converting between analog and digital (ADC/DAC) at the array edges can eat the energy you saved. These are exactly the problems labs (IBM, many universities) are working to tame — and why analog CIM is powerful but not yet mainstream for training.

Idea 2, the devices

Emerging non-volatile memories: PCM, MRAM, FeFET

Several new memory devices can either build those crossbars or replace parts of the hierarchy. Each stores data without power (non-volatile) and trades off differently:

🔥

PCM — Phase-Change Memory

Research/niche

Stores a value in the crystalline vs amorphous state of a material; supports multiple levels, making it a strong candidate for analog in-memory matrix multiply (IBM's analog-AI work uses PCM). Challenge: drift and write energy.

🧲

MRAM — Magnetoresistive RAM (STT/SOT)

Shipping

Stores bits in magnetic orientation: fast, durable, non-volatile. Already used as embedded memory; promising for fast weight storage close to compute. Challenge: density vs DRAM.

⚡

FeFET / FeRAM — Ferroelectric

Research/niche

Uses a ferroelectric layer to store charge state efficiently; attractive for compact, low-energy weight storage and CIM cells. Challenge: maturity and CMOS integration.

None of these will simply "replace DRAM" overnight — but each can occupy a part of the AI memory hierarchy where it beats DRAM/SRAM on energy, density or non-volatility.

Idea 3 — rebalance the hierarchy

Go SRAM-heavy: wafer-scale & on-chip everything

🟦

Massive on-chip SRAM / wafer-scale

Shipping

SRAM is the fastest, lowest-energy memory — but it's big per bit, so you normally get little of it. The radical alternative: build a gigantic chip with enormous SRAM so weights live on-chip and HBM is barely needed. Cerebras takes this to the extreme with a wafer-scale engine; Groq uses an SRAM-only deterministic design for fast inference.

The trade-off is obvious — SRAM capacity is limited and expensive, so this suits models (or layers) that fit on-chip, and shines on latency-critical inference more than giant-model training. But it sidesteps the memory wall by making "external memory" almost unnecessary.

Idea 2 — even newer physics

Photonics: compute & move data with light

💡

Photonic / optical computing

Long-term

Light can perform matrix multiply (via interferometer meshes) with near-zero interconnect energy and at the speed of light, and optical interconnect can move data between chips far more efficiently than copper. Silicon photonics is already used for links; a full photonic compute fabric is still maturing.

Photonics targets both halves of the problem — the compute and the data movement — but faces challenges in precision, optical memory, and integrating lasers/modulators at scale. Deep dive: how photonics works →

Idea 3, continued

Capacity tricks: CXL pooling & 3D stacking

CXL memory pooling — a standard that lets accelerators share a large pool of memory over a fast link, expanding capacity (great for serving huge models) though not raw bandwidth. A pragmatic, shipping answer to "we need more memory, not necessarily faster."
3D DRAM / stacking compute on memory — stack memory directly on top of logic (true 3D), shortening the distance data travels. HBM is the 2.5D step; full 3D integration is the next frontier and a major focus of advanced packaging.

At a glance

The alternatives compared

Approach	What it changes	Maturity	Main challenge
HBM (baseline)	Fast stacked DRAM	Mainstream	Data still moves; cost/power
Processing-in-memory	Compute beside DRAM banks	Shipping/niche	Programming model, limited ops
Digital CIM (SRAM)	MAC inside SRAM array	Emerging/edge	Precision, design flow
Memristor/ReRAM crossbar	Analog MVM in memory	Research/edge	Noise, precision, endurance, ADC
PCM analog	Multi-level analog weights	Research	Drift, write energy
MRAM	Fast non-volatile storage	Shipping (embedded)	Density vs DRAM
SRAM-only / wafer-scale	Keep weights fully on-chip	Shipping (niche)	Capacity, cost
Photonics	Optical compute & links	Long-term	Precision, integration, optical memory
CXL pooling	More shared capacity	Shipping	Adds capacity, not bandwidth

The honest answer

So — is there an alternative to memory for training AI?

Short answer: for today's frontier training, no single technology has dethroned HBM — it's mature, dense, and fast, and the whole software/hardware ecosystem is built around it. The realistic near-term path is HBM + more on-chip SRAM + processing-in-memory, with 3D stacking shrinking the distance data travels.

But the most exciting long game is clear: stop moving the data. Compute-in-memory and analog memristor crossbars attack the root cause — and they're already winning in low-power edge inference, the natural beachhead. MRAM is filling fast non-volatile niches now; photonics is the further-out bet on both compute and interconnect.

✅ The takeaway in one line

The best "alternative to memory" for AI isn't a new kind of DRAM — it's not moving the data at all: computing inside the memory. HBM rules training today; in-memory and analog approaches are the future, starting at the edge.

This is a fast-moving research area; specific products and their status change quickly. Treat this as a conceptual map, not a procurement guide — and verify the latest before making decisions.

Reference

FAQ

Why is memory the bottleneck in AI?

Models must stream billions of weights to the compute units constantly, and moving data from DRAM costs more energy than the multiply itself. The cores starve for data — the memory wall.

What is compute-in-memory?

Doing the multiply-accumulate inside or beside the memory array so data barely moves — the biggest available energy win. Built with SRAM (digital) or analog devices (memristors).

How does a memristor crossbar multiply matrices?

Conductances store the weights; input voltages on rows give currents (Ohm's law = multiply) that sum down columns (Kirchhoff = accumulate). Each column current is a dot product — a matrix-vector multiply in one analog step.

Can anything replace HBM today?

Not for mainstream training yet. HBM is mature and fast. Alternatives (CIM, memristors, MRAM, photonics, SRAM-only) shine in niches, especially edge inference, and are maturing.