What is the MESI protocol?

MESI is a cache coherency protocol where each cache line is tagged with one of four states: Modified (this cache has the only, dirty copy), Exclusive (this cache has the only, clean copy), Shared (multiple caches hold the same clean copy), or Invalid (the line is stale). Transitions between these states keep all caches consistent as cores read and write.

Why does ARM need memory barriers?

ARM uses a weakly ordered memory model, meaning the hardware may reorder memory accesses for performance. Most of the time this is invisible, but when multiple cores or devices communicate through shared memory the order can matter. Barriers like DMB, DSB and ISB force the required ordering so that, for example, data is visible before a flag that signals it is set.

big.LITTLE is ARM's heterogeneous multiprocessing design that combines high-performance big cores with energy-efficient LITTLE cores in one chip. The scheduler runs light or background tasks on the efficient cores to save power and moves demanding work to the powerful cores when needed. The modern evolution is DynamIQ, which allows flexible clusters mixing different core types.

DAY 29 · ADVANCED (64-BIT & BEYOND)

Multicore, Cache Coherency & big.LITTLE

Q: What is cache coherency?

In a multicore processor each core has its own cache, so the same memory address can be cached with different values in different cores. Cache coherency is the hardware mechanism, using protocols such as MESI or MOESI and a coherent interconnect, that ensures every core sees a single consistent value, invalidating or updating other copies whenever one core writes.

By EcrioniX · Updated Jun 6, 2026

Your phone has eight CPU cores; a server SoC can have a hundred. Putting many cores on one chip sounds easy — but making them share memory correctly and efficiently is one of the deepest problems in computer architecture. Today: how ARM cores stay coherent, why memory barriers exist, how locks really work, and the clever big.LITTLE idea that gives you power and battery life.

1. From one core to many (SMP)

The simplest multicore arrangement is SMP — Symmetric Multi-Processing: several identical cores share a single view of main memory, and the OS schedules threads across them. More cores means more work done in parallel — if the software is parallel and if the hardware keeps their shared memory consistent. That second "if" is the hard part.

2. The coherency problem

Recall from Day 22 that each core has its own L1 (and often L2) cache. Now imagine two cores both cache address X = 5. Core 0 writes X = 10 into its cache. Core 1 still sees the stale 5 in its cache. Disaster.

Figure — When Core 0 writes X, the interconnect invalidates other cached copies so Core 1 can't read stale data.

3. Cache coherency & MESI

Cache coherency hardware solves this automatically. Caches snoop a shared coherent interconnect (ARM's CCI/CMN families) and follow a protocol — classically MESI — that tags every cache line with one of four states:

State	Meaning
M — Modified	this cache has the only copy, and it's dirty (newer than memory)
E — Exclusive	this cache has the only copy, and it's clean
S — Shared	multiple caches hold the same clean copy
I — Invalid	the line is stale and must be refetched

When a core writes, the protocol moves its line to Modified and forces every other copy to Invalid — so no one ever reads stale data. (ARM also uses MOESI, which adds an Owned state to share dirty data more efficiently.) All of this is invisible to your code; it Just Works — but it has a performance cost, which is why false sharing (two cores hammering different variables that happen to sit in the same cache line) can quietly wreck scaling.

4. Memory ordering & barriers

For speed, ARM uses a weakly ordered memory model: the hardware may reorder loads and stores as long as a single thread's own logic looks correct. On one core that's invisible. Across cores it can break hand-written synchronization — e.g., another core might see your "data ready" flag set before the data it points to is actually visible.

Memory barriers enforce the order you need:

Barrier	Forces
DMB	Data Memory Barrier — orders memory accesses before/after it
DSB	Data Synchronization Barrier — waits until prior accesses actually complete
ISB	Instruction Synchronization Barrier — flushes the pipeline so newly-changed context takes effect

// Producer: write data, then publish the flag — order matters! STR x1, [x0] // write the data DMB ISH // ensure data is visible BEFORE the flag STR x2, [x3] // set "ready" flag

In practice you rarely write barriers by hand — you use the C/C++ atomics or OS primitives that emit them correctly. But knowing they exist explains a whole category of "impossible" multithreading bugs.

5. Atomics — how locks are built

A lock or counter shared by cores needs atomic read-modify-write. Classic ARM provides load-exclusive / store-exclusive:

atomic_inc: LDXR w1, [x0] // load-exclusive: read & mark address ADD w1, w1, #1 STXR w2, w1, [x0] // store-exclusive: succeeds only if no one else wrote CBNZ w2, atomic_inc // w2!=0 → someone interfered, retry

If another core touched the address in between, the store-exclusive fails and you retry — guaranteeing atomicity. ARMv8.1 added the LSE (Large System Extensions) single-instruction atomics (LDADD, SWP, CAS) that scale far better on many-core systems by doing the operation near the data rather than bouncing the cache line between cores.

6. big.LITTLE & DynamIQ

Not all work needs a powerful core. Checking email shouldn't drain the same power as gaming. ARM's answer is heterogeneous multiprocessing:

big cores — wide, out-of-order, high performance, higher power (e.g. Cortex-X/A7xx).
LITTLE cores — smaller, in-order, very energy-efficient (e.g. Cortex-A5xx).

The OS scheduler runs light/background tasks on the efficient cores and migrates demanding threads to the big cores on demand — big performance when you need it, long battery life when you don't. The modern evolution, DynamIQ, puts mixed core types in a shared, coherent cluster with a common L3, allowing flexible combinations (e.g. 1 prime + 3 big + 4 little) — exactly the layout in today's flagship phone SoCs.

💡 A kitchen brigade

big.LITTLE is like a restaurant kitchen: the head chef (big core) handles the complex dishes fast but is expensive to keep busy, while prep cooks (LITTLE cores) handle the steady, simple work cheaply. The manager (scheduler) assigns each task to the right person — and the whole kitchen shares one pantry (coherent memory).

✅ The mental model

Multicore = many cores sharing memory (SMP). Per-core caches need coherency (MESI/MOESI over a snooping interconnect) so no one reads stale data. ARM's weak memory model reorders accesses, so cross-core code needs barriers (DMB/DSB/ISB) and atomics (LDXR/STXR or LSE) for correct locks. big.LITTLE/DynamIQ mixes powerful and efficient cores for performance and battery life.

🎯 Day 29 takeaways

SMP: identical cores share memory; the OS schedules across them.
Cache coherency (MESI: Modified/Exclusive/Shared/Invalid) stops stale reads.
Beware false sharing — unrelated vars in one cache line kill scaling.
ARM is weakly ordered; use DMB/DSB/ISB barriers for cross-core ordering.
Atomics via LDXR/STXR retry loops, or scalable LSE instructions.
big.LITTLE/DynamIQ = heterogeneous big + efficient cores, coherent cluster.

Quick check

What do the four MESI states mean?
Why does a weakly ordered model need memory barriers?
How does store-exclusive (STXR) guarantee atomicity?
What problem does big.LITTLE solve?

FAQ

What is cache coherency?

Hardware that keeps per-core caches consistent so every core sees one value for each address, via protocols like MESI over a coherent interconnect.

What is MESI?

A coherency protocol tagging each cache line Modified, Exclusive, Shared or Invalid to coordinate reads and writes across cores.

Why memory barriers?

ARM may reorder memory accesses; barriers (DMB/DSB/ISB) enforce ordering when cores or devices communicate through shared memory.

What is big.LITTLE?

A heterogeneous design pairing high-performance big cores with energy-efficient LITTLE cores; DynamIQ is its flexible modern evolution.

← Back to the full course roadmap · Try the Cache Simulator →