Your phone has eight CPU cores; a server SoC can have a hundred. Putting many cores on one chip sounds easy — but making them share memory correctly and efficiently is one of the deepest problems in computer architecture. Today: how ARM cores stay coherent, why memory barriers exist, how locks really work, and the clever big.LITTLE idea that gives you power and battery life.
The simplest multicore arrangement is SMP — Symmetric Multi-Processing: several identical cores share a single view of main memory, and the OS schedules threads across them. More cores means more work done in parallel — if the software is parallel and if the hardware keeps their shared memory consistent. That second "if" is the hard part.
Recall from Day 22 that each core has its own L1 (and often L2) cache. Now imagine two cores both cache address X = 5. Core 0 writes X = 10 into its cache. Core 1 still sees the stale 5 in its cache. Disaster.
Cache coherency hardware solves this automatically. Caches snoop a shared coherent interconnect (ARM's CCI/CMN families) and follow a protocol — classically MESI — that tags every cache line with one of four states:
| State | Meaning |
|---|---|
| M — Modified | this cache has the only copy, and it's dirty (newer than memory) |
| E — Exclusive | this cache has the only copy, and it's clean |
| S — Shared | multiple caches hold the same clean copy |
| I — Invalid | the line is stale and must be refetched |
When a core writes, the protocol moves its line to Modified and forces every other copy to Invalid — so no one ever reads stale data. (ARM also uses MOESI, which adds an Owned state to share dirty data more efficiently.) All of this is invisible to your code; it Just Works — but it has a performance cost, which is why false sharing (two cores hammering different variables that happen to sit in the same cache line) can quietly wreck scaling.
For speed, ARM uses a weakly ordered memory model: the hardware may reorder loads and stores as long as a single thread's own logic looks correct. On one core that's invisible. Across cores it can break hand-written synchronization — e.g., another core might see your "data ready" flag set before the data it points to is actually visible.
Memory barriers enforce the order you need:
| Barrier | Forces |
|---|---|
| DMB | Data Memory Barrier — orders memory accesses before/after it |
| DSB | Data Synchronization Barrier — waits until prior accesses actually complete |
| ISB | Instruction Synchronization Barrier — flushes the pipeline so newly-changed context takes effect |
In practice you rarely write barriers by hand — you use the C/C++ atomics or OS primitives that emit them correctly. But knowing they exist explains a whole category of "impossible" multithreading bugs.
A lock or counter shared by cores needs atomic read-modify-write. Classic ARM provides load-exclusive / store-exclusive:
If another core touched the address in between, the store-exclusive fails and you retry — guaranteeing atomicity. ARMv8.1 added the LSE (Large System Extensions) single-instruction atomics (LDADD, SWP, CAS) that scale far better on many-core systems by doing the operation near the data rather than bouncing the cache line between cores.
Not all work needs a powerful core. Checking email shouldn't drain the same power as gaming. ARM's answer is heterogeneous multiprocessing:
The OS scheduler runs light/background tasks on the efficient cores and migrates demanding threads to the big cores on demand — big performance when you need it, long battery life when you don't. The modern evolution, DynamIQ, puts mixed core types in a shared, coherent cluster with a common L3, allowing flexible combinations (e.g. 1 prime + 3 big + 4 little) — exactly the layout in today's flagship phone SoCs.
big.LITTLE is like a restaurant kitchen: the head chef (big core) handles the complex dishes fast but is expensive to keep busy, while prep cooks (LITTLE cores) handle the steady, simple work cheaply. The manager (scheduler) assigns each task to the right person — and the whole kitchen shares one pantry (coherent memory).
Multicore = many cores sharing memory (SMP). Per-core caches need coherency (MESI/MOESI over a snooping interconnect) so no one reads stale data. ARM's weak memory model reorders accesses, so cross-core code needs barriers (DMB/DSB/ISB) and atomics (LDXR/STXR or LSE) for correct locks. big.LITTLE/DynamIQ mixes powerful and efficient cores for performance and battery life.
Hardware that keeps per-core caches consistent so every core sees one value for each address, via protocols like MESI over a coherent interconnect.
A coherency protocol tagging each cache line Modified, Exclusive, Shared or Invalid to coordinate reads and writes across cores.
ARM may reorder memory accesses; barriers (DMB/DSB/ISB) enforce ordering when cores or devices communicate through shared memory.
A heterogeneous design pairing high-performance big cores with energy-efficient LITTLE cores; DynamIQ is its flexible modern evolution.
← Back to the full course roadmap · Try the Cache Simulator →