Why do processors need caches?

Because main memory is far slower than the CPU. A core can execute an instruction in well under a nanosecond, but a DRAM access takes tens to hundreds of cycles. This gap is the memory wall. A cache is a small, very fast memory close to the core that holds recently and likely-soon-needed data, so most accesses are served quickly instead of stalling on DRAM.

What is the difference between write-back and write-through caches?

A write-through cache writes data to both the cache and main memory on every store, keeping memory always up to date but using more bandwidth. A write-back cache writes only to the cache and marks the line dirty, writing it back to memory only when the line is evicted. Write-back is faster and is the common policy, at the cost of more complex coherency.

What is cache coherency?

In a multicore processor each core has its own cache, so the same memory location could have different values in different caches. Cache coherency is the hardware mechanism, using protocols like MESI and ARM's snoop control or interconnect, that ensures all cores see a single consistent value, invalidating or updating other copies when one core writes.

DAY 22 · SYSTEM & MEMORY

Caches in ARM — and Why They Matter

Q: What is a cache line?

A cache line is the fixed-size block, commonly 64 bytes on ARM, that the cache loads and stores as a unit. When you read a single byte that misses, the whole line around it is fetched, which exploits spatial locality because nearby data is often used soon after.

By EcrioniX · Updated Jun 6, 2026

Your CPU core is a sprinter; main memory (DRAM) is a slow postal service. If the core waited on DRAM for every access, a multi-GHz chip would crawl. The cache is the fix — and understanding it is the single biggest lever for writing fast code on any ARM system.

1. The memory wall

A modern ARM core can issue an instruction in a fraction of a nanosecond. A DRAM access takes tens to hundreds of cycles. That growing gap is the memory wall: the processor is starved waiting for data. Caches exist to hide that latency.

2. The cache hierarchy

Rather than one cache, ARM uses a hierarchy — small/fast near the core, large/slow further out:

Level	Typical size	Typical latency	Scope
L1	32–64 KB	~1–4 cycles	per core, split I/D
L2	256 KB–1 MB	~10–20 cycles	per core or cluster
L3	2–32 MB	~30–50 cycles	shared across cores
DRAM	GBs	~100–300 cycles	main memory

Each level catches what the one above missed. A well-behaved program finds almost everything in L1/L2 and rarely pays the DRAM price.

3. Why caches work: locality

Caches would be useless if access were random. They work because real programs have locality:

Temporal locality — if you used an address, you'll likely use it again soon (a loop counter, a hot variable).
Spatial locality — if you used an address, you'll likely use its neighbours soon (walking an array).

💡 The desk and the library

DRAM is the library across campus. The cache is your desk. You keep the books you're using (temporal) and the next few on the shelf (spatial) right on your desk, so you rarely walk to the library. Write code that "stays on the desk" and it flies.

4. Cache lines, hits & misses

The cache doesn't store single bytes — it works in cache lines, typically 64 bytes on ARM. Touch one byte that isn't cached (a miss) and the whole 64-byte line is fetched. The next 63 bytes are now free to access (a hit) — that's spatial locality paying off.

// Cache-friendly: walks memory in order (uses every byte of each line) for (i = 0; i < N; i++) sum += a[i]; // Cache-hostile: huge stride → one useful byte per fetched line for (i = 0; i < N; i += 4096) sum += a[i];

Same work, wildly different speed — the second version misses on almost every access. This is why data layout often matters more than clever algorithms.

5. Write policies

What happens on a store?

Write-through — write to cache and memory every time. Simple, always consistent, but uses more bandwidth.
Write-back — write only to the cache, mark the line dirty, and flush to memory only on eviction. Faster and the common choice — but needs coherency care.

6. The Harvard split & coherency

At L1, ARM usually splits the cache into a separate I-cache (instructions) and D-cache (data) — a Harvard arrangement that lets the core fetch an instruction and a data word in the same cycle. (A subtle consequence: self-modifying or freshly loaded code needs explicit cache maintenance so the I-cache sees the new bytes.)

In a multicore chip, each core has its own L1/L2, so the same address could hold different values in different caches. Cache coherency hardware — protocols like MESI plus ARM's snoop/interconnect — keeps every core seeing one consistent value, invalidating stale copies when a core writes. This is the foundation for the multicore lessons ahead.

✅ The mental model

Caches hide the memory wall by keeping hot data in small fast memory close to the core, organised as a hierarchy (L1→L2→L3→DRAM). They win because of locality, transfer data in 64-byte lines, use write-back for speed, split I/D at L1, and rely on coherency to stay consistent across cores. Write cache-friendly code and everything gets faster.

🎯 Day 22 takeaways

The memory wall: DRAM is ~100×+ slower than the core; caches hide it.
Hierarchy: L1 (fast, small) → L2 → L3 → DRAM (slow, big).
Caches work via temporal & spatial locality; they move 64-byte lines.
Write-back (dirty + flush on evict) is faster than write-through.
L1 is usually split I/D; coherency keeps multicore caches consistent.

Quick check

What two kinds of locality make caches effective?
Why can a 64-byte line make sequential array access so fast?
What problem does cache coherency solve?

FAQ

Why do we need caches?

DRAM is far slower than the core (the memory wall); caches keep hot data close so most accesses are fast.

What is a cache line?

The fixed block (≈64 bytes on ARM) loaded as a unit, exploiting spatial locality.

Write-back vs write-through?

Write-through updates memory on every store; write-back updates only the cache and flushes dirty lines on eviction (faster).

← Back to the full course roadmap · Try the Cache Simulator →