HomeARM CourseDay 27
DAY 27 · ADVANCED (64-BIT & BEYOND)

NEON & SIMD — Data Parallelism

By EcrioniX · Updated Jun 6, 2026

Your phone decodes 4K video, applies camera filters in real time and runs on-device AI — all without melting the battery. The secret weapon behind much of that is NEON, ARM's SIMD engine. Today you'll understand how doing the same operation on many pieces of data at once turns a slow loop into a blazing one.

1. The problem: scalar code is wasteful

Consider adding two arrays of 1,000 numbers. Ordinary "scalar" code loads one element, adds one element, stores one element — 1,000 times. But the data elements are often small (8- or 16-bit pixels, 32-bit samples) while the processor's datapath is 64 or 128 bits wide. You're driving a truck to deliver one parcel at a time.

SIMD — Single Instruction, Multiple Data — fixes this. One instruction operates on a whole vector of elements packed side by side. Add eight 16-bit numbers in a single instruction and your loop runs roughly 8× fewer iterations. This is data parallelism, and it's distinct from the thread/core parallelism of multicore (Day 29) — here a single core does more per instruction.

2. NEON: ARM's SIMD engine

NEON (officially Advanced SIMD) is the SIMD extension found in virtually every modern ARM application processor. It adds:

The same 128-bit register can be sliced different ways depending on your data — that flexibility is the heart of NEON.

One 128-bit V register, sliced into lanes 16 × 8-bit 8 × 16-bit 4 × 32-bit …or 2 × 64-bit. One instruction processes every lane at once.
Figure — A 128-bit NEON register holds 16×8-bit, 8×16-bit, 4×32-bit or 2×64-bit elements.

3. Lanes and data types

A 128-bit register can hold:

Lane sizeLanesTypical data
8-bit16image pixels, bytes
16-bit8audio samples, half-floats
32-bit4RGBA, single-precision floats
64-bit2doubles, large integers

An instruction like "add, 8 lanes of 16-bit" adds all eight pairs in parallel. The number of lanes is exactly your speed-up factor for that data size.

4. A worked example: vector add

Compare scalar and NEON adds of four 32-bit integers:

// Scalar: four separate adds c[0]=a[0]+b[0]; c[1]=a[1]+b[1]; c[2]=a[2]+b[2]; c[3]=a[3]+b[3]; // NEON A64: one add does all four lanes LD1 {v0.4s}, [x0] // load 4×32-bit from a LD1 {v1.4s}, [x1] // load 4×32-bit from b ADD v2.4s, v0.4s, v1.4s // 4 adds in ONE instruction ST1 {v2.4s}, [x2] // store 4 results to c

The .4s suffix means "4 lanes of 32-bit (S = single word)". One ADD replaced four. Scale that across a megapixel image and the savings are enormous.

5. Where NEON shines

6. Three ways to use NEON

MethodEffortControl
Auto-vectorizationjust add -O3compiler decides; easiest
IntrinsicsC functions (vaddq_s32)explicit, portable, no asm
Hand-written assemblyhighestmaximum control for hot loops
// NEON intrinsics — readable C that maps to NEON instructions #include <arm_neon.h> int32x4_t va = vld1q_s32(a); // load 4 ints int32x4_t vb = vld1q_s32(b); int32x4_t vc = vaddq_s32(va, vb); // 4-lane add vst1q_s32(c, vc); // store 4 results

Most developers start with auto-vectorization (write clean, simple loops and let -O3 vectorize them), then reach for intrinsics on the few hotspots that matter most. Hand assembly is reserved for the last few percent.

💡 The assembly-line analogy

Scalar code is one worker building one product at a time. SIMD is an assembly line where one command ("attach this part") is executed on 16 products at once. You didn't add workers (cores) — you made each instruction do more. That's why SIMD is cheap performance: same silicon, far more throughput on the right workloads.

7. NEON vs SVE

NEON's vectors are a fixed 128 bits — simple and universal. ARM's newer SVE/SVE2 (Scalable Vector Extension, hinted at in Day 26) makes the vector length variable (128–2048 bits), chosen by the hardware, so one binary automatically uses wider vectors on bigger chips. SVE targets HPC and large ML; NEON remains the everyday workhorse you'll meet first and most often.

✅ The mental model

SIMD does one operation on many data elements at once; NEON is ARM's implementation with 32 × 128-bit V registers sliced into lanes (16×8b, 8×16b, 4×32b, 2×64b). It crushes per-element loops in image, audio, video and ML code. Use it via auto-vectorization, intrinsics, or assembly. SVE is the scalable, variable-length big sibling.

🎯 Day 27 takeaways

Quick check

  1. What does SIMD stand for, and how does it differ from multicore parallelism?
  2. How many 16-bit lanes fit in a 128-bit NEON register?
  3. Name the three ways to get NEON into your program.
  4. What's the key difference between NEON and SVE?

FAQ

What is SIMD?

Single Instruction, Multiple Data — one instruction operates on several packed data elements at once for data parallelism.

What is NEON?

ARM's Advanced SIMD engine: 32 128-bit vector registers (V0–V31) with lane-based instructions for multimedia and ML.

How do I use NEON?

Auto-vectorization with -O3, NEON intrinsics in C, or hand-written assembly for critical loops.

NEON vs SVE?

NEON is fixed 128-bit; SVE has a hardware-defined variable length (128–2048 bits) so one binary scales across chips.

Previous
← Day 26: AArch64 register model

← Back to the full course roadmap  ·  What is an AI chip? →