HomeARM CourseDay 29
DAY 29 · ADVANCED (64-BIT & BEYOND)

Multicore, Cache Coherency & big.LITTLE

By EcrioniX · Updated Jun 6, 2026

Your phone has eight CPU cores; a server SoC can have a hundred. Putting many cores on one chip sounds easy — but making them share memory correctly and efficiently is one of the deepest problems in computer architecture. Today: how ARM cores stay coherent, why memory barriers exist, how locks really work, and the clever big.LITTLE idea that gives you power and battery life.

1. From one core to many (SMP)

The simplest multicore arrangement is SMP — Symmetric Multi-Processing: several identical cores share a single view of main memory, and the OS schedules threads across them. More cores means more work done in parallel — if the software is parallel and if the hardware keeps their shared memory consistent. That second "if" is the hard part.

2. The coherency problem

Recall from Day 22 that each core has its own L1 (and often L2) cache. Now imagine two cores both cache address X = 5. Core 0 writes X = 10 into its cache. Core 1 still sees the stale 5 in its cache. Disaster.

Coherent interconnect keeps caches in sync Core 0L1: X=10 (Modified) Core 1L1: X (Invalid→refetch) Coherent Interconnect (snoop) Shared L3 / DRAM Core 0's write invalidates Core 1's stale copy
Figure — When Core 0 writes X, the interconnect invalidates other cached copies so Core 1 can't read stale data.

3. Cache coherency & MESI

Cache coherency hardware solves this automatically. Caches snoop a shared coherent interconnect (ARM's CCI/CMN families) and follow a protocol — classically MESI — that tags every cache line with one of four states:

StateMeaning
M — Modifiedthis cache has the only copy, and it's dirty (newer than memory)
E — Exclusivethis cache has the only copy, and it's clean
S — Sharedmultiple caches hold the same clean copy
I — Invalidthe line is stale and must be refetched

When a core writes, the protocol moves its line to Modified and forces every other copy to Invalid — so no one ever reads stale data. (ARM also uses MOESI, which adds an Owned state to share dirty data more efficiently.) All of this is invisible to your code; it Just Works — but it has a performance cost, which is why false sharing (two cores hammering different variables that happen to sit in the same cache line) can quietly wreck scaling.

4. Memory ordering & barriers

For speed, ARM uses a weakly ordered memory model: the hardware may reorder loads and stores as long as a single thread's own logic looks correct. On one core that's invisible. Across cores it can break hand-written synchronization — e.g., another core might see your "data ready" flag set before the data it points to is actually visible.

Memory barriers enforce the order you need:

BarrierForces
DMBData Memory Barrier — orders memory accesses before/after it
DSBData Synchronization Barrier — waits until prior accesses actually complete
ISBInstruction Synchronization Barrier — flushes the pipeline so newly-changed context takes effect
// Producer: write data, then publish the flag — order matters! STR x1, [x0] // write the data DMB ISH // ensure data is visible BEFORE the flag STR x2, [x3] // set "ready" flag

In practice you rarely write barriers by hand — you use the C/C++ atomics or OS primitives that emit them correctly. But knowing they exist explains a whole category of "impossible" multithreading bugs.

5. Atomics — how locks are built

A lock or counter shared by cores needs atomic read-modify-write. Classic ARM provides load-exclusive / store-exclusive:

atomic_inc: LDXR w1, [x0] // load-exclusive: read & mark address ADD w1, w1, #1 STXR w2, w1, [x0] // store-exclusive: succeeds only if no one else wrote CBNZ w2, atomic_inc // w2!=0 → someone interfered, retry

If another core touched the address in between, the store-exclusive fails and you retry — guaranteeing atomicity. ARMv8.1 added the LSE (Large System Extensions) single-instruction atomics (LDADD, SWP, CAS) that scale far better on many-core systems by doing the operation near the data rather than bouncing the cache line between cores.

6. big.LITTLE & DynamIQ

Not all work needs a powerful core. Checking email shouldn't drain the same power as gaming. ARM's answer is heterogeneous multiprocessing:

The OS scheduler runs light/background tasks on the efficient cores and migrates demanding threads to the big cores on demand — big performance when you need it, long battery life when you don't. The modern evolution, DynamIQ, puts mixed core types in a shared, coherent cluster with a common L3, allowing flexible combinations (e.g. 1 prime + 3 big + 4 little) — exactly the layout in today's flagship phone SoCs.

💡 A kitchen brigade

big.LITTLE is like a restaurant kitchen: the head chef (big core) handles the complex dishes fast but is expensive to keep busy, while prep cooks (LITTLE cores) handle the steady, simple work cheaply. The manager (scheduler) assigns each task to the right person — and the whole kitchen shares one pantry (coherent memory).

✅ The mental model

Multicore = many cores sharing memory (SMP). Per-core caches need coherency (MESI/MOESI over a snooping interconnect) so no one reads stale data. ARM's weak memory model reorders accesses, so cross-core code needs barriers (DMB/DSB/ISB) and atomics (LDXR/STXR or LSE) for correct locks. big.LITTLE/DynamIQ mixes powerful and efficient cores for performance and battery life.

🎯 Day 29 takeaways

Quick check

  1. What do the four MESI states mean?
  2. Why does a weakly ordered model need memory barriers?
  3. How does store-exclusive (STXR) guarantee atomicity?
  4. What problem does big.LITTLE solve?

FAQ

What is cache coherency?

Hardware that keeps per-core caches consistent so every core sees one value for each address, via protocols like MESI over a coherent interconnect.

What is MESI?

A coherency protocol tagging each cache line Modified, Exclusive, Shared or Invalid to coordinate reads and writes across cores.

Why memory barriers?

ARM may reorder memory accesses; barriers (DMB/DSB/ISB) enforce ordering when cores or devices communicate through shared memory.

What is big.LITTLE?

A heterogeneous design pairing high-performance big cores with energy-efficient LITTLE cores; DynamIQ is its flexible modern evolution.

Previous
← Day 28: Floating point (VFP)

← Back to the full course roadmap  ·  Try the Cache Simulator →