AI isn't really limited by compute โ it's limited by moving data to and from memory. So the most exciting research isn't a faster GPU. It's rethinking memory itself. Here are the real alternatives, and how close they actually are.
A modern AI model has hundreds of billions of weights. To compute anything, the chip must stream those weights from memory to the multiply units, over and over. The uncomfortable truth: fetching a number from DRAM can cost far more time and energy than the multiply that uses it. The expensive compute sits idle, starving for data.
This is the memory wall (or von Neumann bottleneck): the separation of "where data lives" (memory) from "where data is processed" (the cores), connected by a pipe that's never fat enough. Today's answer is HBM โ stacked DRAM next to the compute die delivering terabytes per second. (For the full picture, see What Is an AI Chip? and why HBM demand exploded.)
But HBM has limits โ capacity, cost, power, and the simple fact that data still has to move. So researchers ask a deeper question: what if we changed memory so the data barely moves at all?
Scaling conventional DRAM bandwidth gets exponentially harder and more power-hungry. Even HBM, brilliant as it is, still moves every weight off the memory die and across an interposer to reach the MACs. The energy of that round trip โ not the arithmetic โ is the dominant cost. So the alternatives fall into two big ideas:
A third, more conventional track just rebalances the hierarchy โ massive on-chip SRAM, 3D stacking, or pooled CXL memory.
Instead of reading data out of the memory array into a processor, CIM performs the multiply-accumulate inside the array itself. Because the operands never make the expensive trip to a separate compute unit, it can slash the energy of data movement โ often the single biggest win available. It comes in two flavours: digital CIM (built from SRAM bit-cells with added logic) and analog CIM (using the physics of resistive devices โ see crossbars below).
CIM is already shipping in edge AI chips for ultra-low-power, always-on inference, where its efficiency matters most. The hard part is precision and integrating it cleanly into existing design flows โ but it directly attacks the root cause of the memory wall.
A close cousin: put simple compute units right next to the DRAM banks (or inside the HBM stack) so each bank can do work locally and only send results out. Samsung's HBM-PIM and DRAM-based accelerators like UPMEM are real products demonstrating this. PIM doesn't replace HBM โ it augments it, doing reduction/elementwise work where the data already sits and cutting the traffic over the main bus.
PIM is arguably the most practical near-term alternative because it bolts onto the existing DRAM ecosystem rather than replacing it.
This is the one that excites researchers most. Imagine a grid of resistive memory devices (memristors / ReRAM) where each device's conductance stores a weight. Now the laws of physics do the matrix multiply for free:
The current coming out of each column is therefore the dot product of the inputs and the stored weights โ a full matrix-vector multiply in a single step, in the analog domain, exactly where the weights are stored. No weights are ever fetched. This is in-memory computing in its purest form.
Analog is messy: device-to-device variation, noise, limited precision (a few bits), drift, and write endurance all fight you. Converting between analog and digital (ADC/DAC) at the array edges can eat the energy you saved. These are exactly the problems labs (IBM, many universities) are working to tame โ and why analog CIM is powerful but not yet mainstream for training.
Several new memory devices can either build those crossbars or replace parts of the hierarchy. Each stores data without power (non-volatile) and trades off differently:
Stores a value in the crystalline vs amorphous state of a material; supports multiple levels, making it a strong candidate for analog in-memory matrix multiply (IBM's analog-AI work uses PCM). Challenge: drift and write energy.
Stores bits in magnetic orientation: fast, durable, non-volatile. Already used as embedded memory; promising for fast weight storage close to compute. Challenge: density vs DRAM.
Uses a ferroelectric layer to store charge state efficiently; attractive for compact, low-energy weight storage and CIM cells. Challenge: maturity and CMOS integration.
None of these will simply "replace DRAM" overnight โ but each can occupy a part of the AI memory hierarchy where it beats DRAM/SRAM on energy, density or non-volatility.
SRAM is the fastest, lowest-energy memory โ but it's big per bit, so you normally get little of it. The radical alternative: build a gigantic chip with enormous SRAM so weights live on-chip and HBM is barely needed. Cerebras takes this to the extreme with a wafer-scale engine; Groq uses an SRAM-only deterministic design for fast inference.
The trade-off is obvious โ SRAM capacity is limited and expensive, so this suits models (or layers) that fit on-chip, and shines on latency-critical inference more than giant-model training. But it sidesteps the memory wall by making "external memory" almost unnecessary.
Light can perform matrix multiply (via interferometer meshes) with near-zero interconnect energy and at the speed of light, and optical interconnect can move data between chips far more efficiently than copper. Silicon photonics is already used for links; a full photonic compute fabric is still maturing.
Photonics targets both halves of the problem โ the compute and the data movement โ but faces challenges in precision, optical memory, and integrating lasers/modulators at scale. Deep dive: how photonics works โ
| Approach | What it changes | Maturity | Main challenge |
|---|---|---|---|
| HBM (baseline) | Fast stacked DRAM | Mainstream | Data still moves; cost/power |
| Processing-in-memory | Compute beside DRAM banks | Shipping/niche | Programming model, limited ops |
| Digital CIM (SRAM) | MAC inside SRAM array | Emerging/edge | Precision, design flow |
| Memristor/ReRAM crossbar | Analog MVM in memory | Research/edge | Noise, precision, endurance, ADC |
| PCM analog | Multi-level analog weights | Research | Drift, write energy |
| MRAM | Fast non-volatile storage | Shipping (embedded) | Density vs DRAM |
| SRAM-only / wafer-scale | Keep weights fully on-chip | Shipping (niche) | Capacity, cost |
| Photonics | Optical compute & links | Long-term | Precision, integration, optical memory |
| CXL pooling | More shared capacity | Shipping | Adds capacity, not bandwidth |
Short answer: for today's frontier training, no single technology has dethroned HBM โ it's mature, dense, and fast, and the whole software/hardware ecosystem is built around it. The realistic near-term path is HBM + more on-chip SRAM + processing-in-memory, with 3D stacking shrinking the distance data travels.
But the most exciting long game is clear: stop moving the data. Compute-in-memory and analog memristor crossbars attack the root cause โ and they're already winning in low-power edge inference, the natural beachhead. MRAM is filling fast non-volatile niches now; photonics is the further-out bet on both compute and interconnect.
The best "alternative to memory" for AI isn't a new kind of DRAM โ it's not moving the data at all: computing inside the memory. HBM rules training today; in-memory and analog approaches are the future, starting at the edge.
This is a fast-moving research area; specific products and their status change quickly. Treat this as a conceptual map, not a procurement guide โ and verify the latest before making decisions.
Models must stream billions of weights to the compute units constantly, and moving data from DRAM costs more energy than the multiply itself. The cores starve for data โ the memory wall.
Doing the multiply-accumulate inside or beside the memory array so data barely moves โ the biggest available energy win. Built with SRAM (digital) or analog devices (memristors).
Conductances store the weights; input voltages on rows give currents (Ohm's law = multiply) that sum down columns (Kirchhoff = accumulate). Each column current is a dot product โ a matrix-vector multiply in one analog step.
Not for mainstream training yet. HBM is mature and fast. Alternatives (CIM, memristors, MRAM, photonics, SRAM-only) shine in niches, especially edge inference, and are maturing.
Related: What Is an AI Chip? ยท Systolic Array Lab ยท AI Semiconductor Boom ยท GPU Lab