HomeARM CourseDay 30
DAY 30 · ADVANCED (64-BIT & BEYOND) · FINALE

Performance & Optimisation on ARM

By EcrioniX · Updated Jun 6, 2026

You've learned how an ARM chip works from the transistor up to multicore. The final skill is making code fast on it. Performance work is part science, part discipline — and the golden rule is humbling: never optimise by guessing. Today we turn everything from the course into a practical optimisation toolkit, and close out the 30-day journey.

1. Rule zero: measure first

The single most expensive mistake in performance work is optimising the wrong thing. Programs spend most of their time in a tiny fraction of the code, so you must find the real hotspots before touching anything. ARM cores include a PMU (Performance Monitor Unit) — hardware counters for cycles, instructions, cache misses, branch mispredictions and TLB misses. Linux perf reads them:

# Profile where time goes, and why perf record ./myapp # sample the hotspots perf report # see the hottest functions/lines perf stat ./myapp # cycles, IPC, cache-misses, branch-misses

The counters tell you why code is slow: a high cache-miss rate means you're memory-bound; high branch-misses mean unpredictable control flow; low IPC with neither suggests dependency stalls. Diagnosis dictates the fix.

2. Cache-friendly code — usually the #1 win

From Day 22: DRAM is ~100× slower than the core. So how you touch memory often matters more than how many instructions you run.

// Cache-friendly: row-major order matches memory layout (fast) for(i=0;i<N;i++) for(j=0;j<N;j++) sum+=a[i][j]; // Cache-hostile: column-major jumps a full row each step (slow) for(j=0;j<N;j++) for(i=0;i<N;i++) sum+=a[i][j];

Same result; the first can be many times faster purely from cache behaviour. You can build intuition for this in our interactive Cache Simulator.

3. Help the branch predictor

Modern ARM cores (recall the pipeline, Day 7) guess branch outcomes to keep the pipeline full. A misprediction flushes the pipeline — a dozen-plus wasted cycles. To help it:

4. Exploit parallelism — ILP, SIMD, threads

Performance comes from doing more at once, at three levels:

LevelMechanismHow to exploit
Instruction (ILP)superscalar, out-of-orderavoid long dependency chains; unroll loops
Data (SIMD)NEON / SVEvectorise array loops (auto or intrinsics)
Thread (cores)multicoreparallelise across cores; avoid false sharing

Use them together: a well-optimised kernel is multithreaded across cores, vectorised with NEON within each thread, and written so the out-of-order engine never starves.

5. Let the compiler do the heavy lifting

Modern compilers are excellent — give them the right flags and clean code:

FlagEffect
-O2strong, safe optimisation — a good default
-O3adds aggressive vectorisation/inlining (benchmark — not always faster)
-mcpu=nametune scheduling & instructions for a specific core
-fltolink-time optimisation across translation units

Write simple, vectorizable loops and the compiler will often emit better NEON than hand-tuned assembly — and keep doing so as it improves. Reserve hand assembly for the rare hotspot the compiler can't crack.

6. Avoid the classic traps

7. The optimisation workflow

  1. Measure — profile with perf/PMU to find the real hotspot.
  2. Understand — is it memory-bound, branch-bound, or compute-bound?
  3. Fix the biggest thing — usually algorithm, then data layout, then vectorisation.
  4. Measure again — confirm the speed-up is real and didn't break correctness.
  5. Repeat until "fast enough" — then stop. Diminishing returns are real.

💡 Optimise like a doctor, not a guesser

A good doctor runs tests before prescribing. A bad one guesses. Performance work is the same: the PMU is your diagnostic scan. Most "obvious" optimisations engineers try without measuring make no difference — or make things worse. Measure, fix, measure.

✅ The mental model

Measure first with perf/PMU, then attack the real bottleneck: cache-friendly access (often the biggest win), predictable/branchless control flow, and parallelism at the ILP, NEON and multicore levels — helped by the right compiler flags. Avoid false sharing and premature micro-optimisation. Fast code is measured code.

🎯 Day 30 takeaways

🎓 You finished the course!

Thirty lessons ago you started with "what is ARM?" Now you understand the register model, instruction sets, the pipeline, exceptions and interrupts, the NVIC, the memory map and buses, MPU and MMU, caches, TrustZone, CP15, the boot process, AArch64, NEON, floating point, multicore coherency — and how to make it all run fast. That's the full stack of a modern processor. 👏

Where to go next: apply it. Write some AArch64 assembly, profile a real program with perf, or build hardware in our online Verilog simulator. Explore RISC-V vs ARM, AI chips, or the VLSI track to keep going deeper.

Quick check

  1. What's "rule zero" of optimisation, and what tool supports it on ARM?
  2. Why is sequential array access often the biggest single speed-up?
  3. Name the three levels of parallelism you can exploit.
  4. What is false sharing and how do you avoid it?

FAQ

How do I optimise ARM code?

Measure with perf/PMU first, then fix the real hotspot: cache-friendly data access, predictable branches, NEON/multicore parallelism, and good compiler flags.

What is the PMU?

The Performance Monitor Unit — hardware counters for cycles, cache misses, branch mispredictions etc., read by tools like perf.

Why cache-friendly code?

DRAM is ~100× slower than the core, so sequential access that stays in cache avoids costly misses and dominates real-world speed.

Which compiler flags help?

-O2/-O3 for optimisation and auto-vectorisation, -mcpu to target your core, -flto for cross-file optimisation — always benchmarked.

Previous
← Day 29: Multicore & coherency

← Back to the full course roadmap