You've learned how an ARM chip works from the transistor up to multicore. The final skill is making code fast on it. Performance work is part science, part discipline — and the golden rule is humbling: never optimise by guessing. Today we turn everything from the course into a practical optimisation toolkit, and close out the 30-day journey.
The single most expensive mistake in performance work is optimising the wrong thing. Programs spend most of their time in a tiny fraction of the code, so you must find the real hotspots before touching anything. ARM cores include a PMU (Performance Monitor Unit) — hardware counters for cycles, instructions, cache misses, branch mispredictions and TLB misses. Linux perf reads them:
The counters tell you why code is slow: a high cache-miss rate means you're memory-bound; high branch-misses mean unpredictable control flow; low IPC with neither suggests dependency stalls. Diagnosis dictates the fix.
From Day 22: DRAM is ~100× slower than the core. So how you touch memory often matters more than how many instructions you run.
Same result; the first can be many times faster purely from cache behaviour. You can build intuition for this in our interactive Cache Simulator.
Modern ARM cores (recall the pipeline, Day 7) guess branch outcomes to keep the pipeline full. A misprediction flushes the pipeline — a dozen-plus wasted cycles. To help it:
CSEL, Day 26).Performance comes from doing more at once, at three levels:
| Level | Mechanism | How to exploit |
|---|---|---|
| Instruction (ILP) | superscalar, out-of-order | avoid long dependency chains; unroll loops |
| Data (SIMD) | NEON / SVE | vectorise array loops (auto or intrinsics) |
| Thread (cores) | multicore | parallelise across cores; avoid false sharing |
Use them together: a well-optimised kernel is multithreaded across cores, vectorised with NEON within each thread, and written so the out-of-order engine never starves.
Modern compilers are excellent — give them the right flags and clean code:
| Flag | Effect |
|---|---|
| -O2 | strong, safe optimisation — a good default |
| -O3 | adds aggressive vectorisation/inlining (benchmark — not always faster) |
| -mcpu=name | tune scheduling & instructions for a specific core |
| -flto | link-time optimisation across translation units |
Write simple, vectorizable loops and the compiler will often emit better NEON than hand-tuned assembly — and keep doing so as it improves. Reserve hand assembly for the rare hotspot the compiler can't crack.
A good doctor runs tests before prescribing. A bad one guesses. Performance work is the same: the PMU is your diagnostic scan. Most "obvious" optimisations engineers try without measuring make no difference — or make things worse. Measure, fix, measure.
Measure first with perf/PMU, then attack the real bottleneck: cache-friendly access (often the biggest win), predictable/branchless control flow, and parallelism at the ILP, NEON and multicore levels — helped by the right compiler flags. Avoid false sharing and premature micro-optimisation. Fast code is measured code.
perf + the PMU; optimise only real hotspots.Thirty lessons ago you started with "what is ARM?" Now you understand the register model, instruction sets, the pipeline, exceptions and interrupts, the NVIC, the memory map and buses, MPU and MMU, caches, TrustZone, CP15, the boot process, AArch64, NEON, floating point, multicore coherency — and how to make it all run fast. That's the full stack of a modern processor. 👏
Where to go next: apply it. Write some AArch64 assembly, profile a real program with perf, or build hardware in our online Verilog simulator. Explore RISC-V vs ARM, AI chips, or the VLSI track to keep going deeper.
Measure with perf/PMU first, then fix the real hotspot: cache-friendly data access, predictable branches, NEON/multicore parallelism, and good compiler flags.
The Performance Monitor Unit — hardware counters for cycles, cache misses, branch mispredictions etc., read by tools like perf.
DRAM is ~100× slower than the core, so sequential access that stays in cache avoids costly misses and dominates real-world speed.
-O2/-O3 for optimisation and auto-vectorisation, -mcpu to target your core, -flto for cross-file optimisation — always benchmarked.