HomeARM CourseDay 22
DAY 22 · SYSTEM & MEMORY

Caches in ARM — and Why They Matter

By EcrioniX · Updated Jun 6, 2026

Your CPU core is a sprinter; main memory (DRAM) is a slow postal service. If the core waited on DRAM for every access, a multi-GHz chip would crawl. The cache is the fix — and understanding it is the single biggest lever for writing fast code on any ARM system.

1. The memory wall

A modern ARM core can issue an instruction in a fraction of a nanosecond. A DRAM access takes tens to hundreds of cycles. That growing gap is the memory wall: the processor is starved waiting for data. Caches exist to hide that latency.

2. The cache hierarchy

Rather than one cache, ARM uses a hierarchy — small/fast near the core, large/slow further out:

LevelTypical sizeTypical latencyScope
L132–64 KB~1–4 cyclesper core, split I/D
L2256 KB–1 MB~10–20 cyclesper core or cluster
L32–32 MB~30–50 cyclesshared across cores
DRAMGBs~100–300 cyclesmain memory

Each level catches what the one above missed. A well-behaved program finds almost everything in L1/L2 and rarely pays the DRAM price.

3. Why caches work: locality

Caches would be useless if access were random. They work because real programs have locality:

💡 The desk and the library

DRAM is the library across campus. The cache is your desk. You keep the books you're using (temporal) and the next few on the shelf (spatial) right on your desk, so you rarely walk to the library. Write code that "stays on the desk" and it flies.

4. Cache lines, hits & misses

The cache doesn't store single bytes — it works in cache lines, typically 64 bytes on ARM. Touch one byte that isn't cached (a miss) and the whole 64-byte line is fetched. The next 63 bytes are now free to access (a hit) — that's spatial locality paying off.

// Cache-friendly: walks memory in order (uses every byte of each line) for (i = 0; i < N; i++) sum += a[i]; // Cache-hostile: huge stride → one useful byte per fetched line for (i = 0; i < N; i += 4096) sum += a[i];

Same work, wildly different speed — the second version misses on almost every access. This is why data layout often matters more than clever algorithms.

5. Write policies

What happens on a store?

6. The Harvard split & coherency

At L1, ARM usually splits the cache into a separate I-cache (instructions) and D-cache (data) — a Harvard arrangement that lets the core fetch an instruction and a data word in the same cycle. (A subtle consequence: self-modifying or freshly loaded code needs explicit cache maintenance so the I-cache sees the new bytes.)

In a multicore chip, each core has its own L1/L2, so the same address could hold different values in different caches. Cache coherency hardware — protocols like MESI plus ARM's snoop/interconnect — keeps every core seeing one consistent value, invalidating stale copies when a core writes. This is the foundation for the multicore lessons ahead.

✅ The mental model

Caches hide the memory wall by keeping hot data in small fast memory close to the core, organised as a hierarchy (L1→L2→L3→DRAM). They win because of locality, transfer data in 64-byte lines, use write-back for speed, split I/D at L1, and rely on coherency to stay consistent across cores. Write cache-friendly code and everything gets faster.

🎯 Day 22 takeaways

Quick check

  1. What two kinds of locality make caches effective?
  2. Why can a 64-byte line make sequential array access so fast?
  3. What problem does cache coherency solve?

FAQ

Why do we need caches?

DRAM is far slower than the core (the memory wall); caches keep hot data close so most accesses are fast.

What is a cache line?

The fixed block (≈64 bytes on ARM) loaded as a unit, exploiting spatial locality.

Write-back vs write-through?

Write-through updates memory on every store; write-back updates only the cache and flushes dirty lines on eviction (faster).

Previous
← Day 21: The MMU & virtual memory

← Back to the full course roadmap  ·  Try the Cache Simulator →