Your CPU core is a sprinter; main memory (DRAM) is a slow postal service. If the core waited on DRAM for every access, a multi-GHz chip would crawl. The cache is the fix — and understanding it is the single biggest lever for writing fast code on any ARM system.
A modern ARM core can issue an instruction in a fraction of a nanosecond. A DRAM access takes tens to hundreds of cycles. That growing gap is the memory wall: the processor is starved waiting for data. Caches exist to hide that latency.
Rather than one cache, ARM uses a hierarchy — small/fast near the core, large/slow further out:
| Level | Typical size | Typical latency | Scope |
|---|---|---|---|
| L1 | 32–64 KB | ~1–4 cycles | per core, split I/D |
| L2 | 256 KB–1 MB | ~10–20 cycles | per core or cluster |
| L3 | 2–32 MB | ~30–50 cycles | shared across cores |
| DRAM | GBs | ~100–300 cycles | main memory |
Each level catches what the one above missed. A well-behaved program finds almost everything in L1/L2 and rarely pays the DRAM price.
Caches would be useless if access were random. They work because real programs have locality:
DRAM is the library across campus. The cache is your desk. You keep the books you're using (temporal) and the next few on the shelf (spatial) right on your desk, so you rarely walk to the library. Write code that "stays on the desk" and it flies.
The cache doesn't store single bytes — it works in cache lines, typically 64 bytes on ARM. Touch one byte that isn't cached (a miss) and the whole 64-byte line is fetched. The next 63 bytes are now free to access (a hit) — that's spatial locality paying off.
Same work, wildly different speed — the second version misses on almost every access. This is why data layout often matters more than clever algorithms.
What happens on a store?
At L1, ARM usually splits the cache into a separate I-cache (instructions) and D-cache (data) — a Harvard arrangement that lets the core fetch an instruction and a data word in the same cycle. (A subtle consequence: self-modifying or freshly loaded code needs explicit cache maintenance so the I-cache sees the new bytes.)
In a multicore chip, each core has its own L1/L2, so the same address could hold different values in different caches. Cache coherency hardware — protocols like MESI plus ARM's snoop/interconnect — keeps every core seeing one consistent value, invalidating stale copies when a core writes. This is the foundation for the multicore lessons ahead.
Caches hide the memory wall by keeping hot data in small fast memory close to the core, organised as a hierarchy (L1→L2→L3→DRAM). They win because of locality, transfer data in 64-byte lines, use write-back for speed, split I/D at L1, and rely on coherency to stay consistent across cores. Write cache-friendly code and everything gets faster.
DRAM is far slower than the core (the memory wall); caches keep hot data close so most accesses are fast.
The fixed block (≈64 bytes on ARM) loaded as a unit, exploiting spatial locality.
Write-through updates memory on every store; write-back updates only the cache and flushes dirty lines on eviction (faster).
← Back to the full course roadmap · Try the Cache Simulator →