How do I optimise code for ARM?

Always measure first using a profiler such as perf and the on-chip performance monitor unit (PMU), then focus only on the real hotspots. The biggest wins usually come from cache-friendly data access, helping the branch predictor, vectorising hot loops with NEON, and choosing appropriate compiler flags. Optimise the algorithm before micro-optimising instructions.

What is the PMU on ARM?

The PMU, or Performance Monitor Unit, is hardware built into ARM cores that counts events such as cycles, instructions retired, cache misses, branch mispredictions and TLB misses. Tools like Linux perf read these counters so you can see exactly why code is slow, for example whether it is memory-bound from cache misses or limited by branch mispredictions.

Why is cache-friendly code faster?

Main memory is far slower than the core, so most performance is decided by how often data is found in cache. Accessing memory sequentially and keeping working sets small lets the CPU serve data from fast L1/L2 cache instead of stalling on DRAM. Poor access patterns cause cache misses that can each cost a hundred or more cycles.

What compiler flags improve ARM performance?

Optimisation levels like -O2 and -O3 enable inlining, loop optimisation and auto-vectorisation. Targeting the specific core with -mcpu lets the compiler tune scheduling and use available instructions, and link-time optimisation (-flto) optimises across files. Always benchmark, because -O3 is not always faster than -O2 for every workload.

DAY 30 · ADVANCED (64-BIT & BEYOND) · FINALE

Performance & Optimisation on ARM

By EcrioniX · Updated Jun 6, 2026

You've learned how an ARM chip works from the transistor up to multicore. The final skill is making code fast on it. Performance work is part science, part discipline — and the golden rule is humbling: never optimise by guessing. Today we turn everything from the course into a practical optimisation toolkit, and close out the 30-day journey.

1. Rule zero: measure first

The single most expensive mistake in performance work is optimising the wrong thing. Programs spend most of their time in a tiny fraction of the code, so you must find the real hotspots before touching anything. ARM cores include a PMU (Performance Monitor Unit) — hardware counters for cycles, instructions, cache misses, branch mispredictions and TLB misses. Linux perf reads them:

# Profile where time goes, and why perf record ./myapp # sample the hotspots perf report # see the hottest functions/lines perf stat ./myapp # cycles, IPC, cache-misses, branch-misses

The counters tell you why code is slow: a high cache-miss rate means you're memory-bound; high branch-misses mean unpredictable control flow; low IPC with neither suggests dependency stalls. Diagnosis dictates the fix.

2. Cache-friendly code — usually the #1 win

From Day 22: DRAM is ~100× slower than the core. So how you touch memory often matters more than how many instructions you run.

Access sequentially — walk arrays in order so each 64-byte cache line is fully used.
Keep working sets small — block/tile large computations to fit in L1/L2.
Prefer arrays over pointer-chasing structures (linked lists murder the prefetcher).
Structure-of-arrays over array-of-structures when you only use some fields.

// Cache-friendly: row-major order matches memory layout (fast) for(i=0;i<N;i++) for(j=0;j<N;j++) sum+=a[i][j]; // Cache-hostile: column-major jumps a full row each step (slow) for(j=0;j<N;j++) for(i=0;i<N;i++) sum+=a[i][j];

Same result; the first can be many times faster purely from cache behaviour. You can build intuition for this in our interactive Cache Simulator.

3. Help the branch predictor

Modern ARM cores (recall the pipeline, Day 7) guess branch outcomes to keep the pipeline full. A misprediction flushes the pipeline — a dozen-plus wasted cycles. To help it:

Make branches predictable — sort/group data so conditions are mostly one way.
Replace small data-dependent branches with branchless code (CSEL, Day 26).
Hoist invariant conditions out of hot loops.

4. Exploit parallelism — ILP, SIMD, threads

Performance comes from doing more at once, at three levels:

Level	Mechanism	How to exploit
Instruction (ILP)	superscalar, out-of-order	avoid long dependency chains; unroll loops
Data (SIMD)	NEON / SVE	vectorise array loops (auto or intrinsics)
Thread (cores)	multicore	parallelise across cores; avoid false sharing

Use them together: a well-optimised kernel is multithreaded across cores, vectorised with NEON within each thread, and written so the out-of-order engine never starves.

5. Let the compiler do the heavy lifting

Modern compilers are excellent — give them the right flags and clean code:

Flag	Effect
-O2	strong, safe optimisation — a good default
-O3	adds aggressive vectorisation/inlining (benchmark — not always faster)
-mcpu=name	tune scheduling & instructions for a specific core
-flto	link-time optimisation across translation units

Write simple, vectorizable loops and the compiler will often emit better NEON than hand-tuned assembly — and keep doing so as it improves. Reserve hand assembly for the rare hotspot the compiler can't crack.

6. Avoid the classic traps

False sharing — two threads writing different variables that share one cache line cause constant coherency traffic (Day 29). Pad/align hot per-thread data to separate lines.
Denormals in float-heavy DSP — enable flush-to-zero (Day 28).
Misalignment — keep data naturally aligned to avoid penalties.
Premature micro-optimisation — improving a function that's 1% of runtime buys you 1%. Fix the algorithm and the hot 90% first.

7. The optimisation workflow

Measure — profile with perf/PMU to find the real hotspot.
Understand — is it memory-bound, branch-bound, or compute-bound?
Fix the biggest thing — usually algorithm, then data layout, then vectorisation.
Measure again — confirm the speed-up is real and didn't break correctness.
Repeat until "fast enough" — then stop. Diminishing returns are real.

💡 Optimise like a doctor, not a guesser

A good doctor runs tests before prescribing. A bad one guesses. Performance work is the same: the PMU is your diagnostic scan. Most "obvious" optimisations engineers try without measuring make no difference — or make things worse. Measure, fix, measure.

✅ The mental model

Measure first with perf/PMU, then attack the real bottleneck: cache-friendly access (often the biggest win), predictable/branchless control flow, and parallelism at the ILP, NEON and multicore levels — helped by the right compiler flags. Avoid false sharing and premature micro-optimisation. Fast code is measured code.

🎯 Day 30 takeaways

Measure first — use perf + the PMU; optimise only real hotspots.
Cache-friendly sequential access is usually the #1 win.
Make branches predictable or branchless to avoid pipeline flushes.
Stack parallelism: ILP + NEON + multicore.
Use -O2/-O3, -mcpu, -flto; clean loops vectorise well.
Beware false sharing, denormals, misalignment, and premature tuning.

🎓 You finished the course!

Thirty lessons ago you started with "what is ARM?" Now you understand the register model, instruction sets, the pipeline, exceptions and interrupts, the NVIC, the memory map and buses, MPU and MMU, caches, TrustZone, CP15, the boot process, AArch64, NEON, floating point, multicore coherency — and how to make it all run fast. That's the full stack of a modern processor. 👏

Where to go next: apply it. Write some AArch64 assembly, profile a real program with perf, or build hardware in our online Verilog simulator. Explore RISC-V vs ARM, AI chips, or the VLSI track to keep going deeper.

Quick check

What's "rule zero" of optimisation, and what tool supports it on ARM?
Why is sequential array access often the biggest single speed-up?
Name the three levels of parallelism you can exploit.
What is false sharing and how do you avoid it?

FAQ

How do I optimise ARM code?

Measure with perf/PMU first, then fix the real hotspot: cache-friendly data access, predictable branches, NEON/multicore parallelism, and good compiler flags.

What is the PMU?

The Performance Monitor Unit — hardware counters for cycles, cache misses, branch mispredictions etc., read by tools like perf.

Why cache-friendly code?

DRAM is ~100× slower than the core, so sequential access that stays in cache avoids costly misses and dominates real-world speed.

Which compiler flags help?

-O2/-O3 for optimisation and auto-vectorisation, -mcpu to target your core, -flto for cross-file optimisation — always benchmarked.

← Back to the full course roadmap