SIMD stands for Single Instruction, Multiple Data. It is a form of data parallelism where one instruction operates on several data elements at once, packed side by side in a wide register. Instead of adding two numbers per instruction, a SIMD add can process, for example, four or eight numbers simultaneously, dramatically speeding up loops over arrays.

NEON, also called Advanced SIMD, is ARM's mainstream SIMD engine. It provides thirty-two 128-bit vector registers (V0 to V31) that can be divided into lanes of 8, 16, 32 or 64-bit elements, plus instructions that operate on all lanes at once. NEON accelerates multimedia, signal processing and machine-learning workloads on ARM processors.

How do I use NEON in my code?

There are three common ways. Auto-vectorization lets the compiler turn ordinary loops into NEON instructions automatically with optimization flags. NEON intrinsics are C functions that map directly to NEON instructions, giving control without writing assembly. Hand-written NEON assembly offers maximum control for the most performance-critical inner loops.

What is the difference between NEON and SVE?

NEON uses fixed 128-bit vectors, so the data width is known at compile time. SVE, the Scalable Vector Extension, has a vector length that is not fixed in the program and can range from 128 to 2048 bits depending on the hardware, so the same binary scales across chips. NEON is ubiquitous for multimedia; SVE targets high-performance computing and machine learning.

DAY 27 · ADVANCED (64-BIT & BEYOND)

NEON & SIMD — Data Parallelism

By EcrioniX · Updated Jun 6, 2026

Your phone decodes 4K video, applies camera filters in real time and runs on-device AI — all without melting the battery. The secret weapon behind much of that is NEON, ARM's SIMD engine. Today you'll understand how doing the same operation on many pieces of data at once turns a slow loop into a blazing one.

1. The problem: scalar code is wasteful

Consider adding two arrays of 1,000 numbers. Ordinary "scalar" code loads one element, adds one element, stores one element — 1,000 times. But the data elements are often small (8- or 16-bit pixels, 32-bit samples) while the processor's datapath is 64 or 128 bits wide. You're driving a truck to deliver one parcel at a time.

SIMD — Single Instruction, Multiple Data — fixes this. One instruction operates on a whole vector of elements packed side by side. Add eight 16-bit numbers in a single instruction and your loop runs roughly 8× fewer iterations. This is data parallelism, and it's distinct from the thread/core parallelism of multicore (Day 29) — here a single core does more per instruction.

2. NEON: ARM's SIMD engine

NEON (officially Advanced SIMD) is the SIMD extension found in virtually every modern ARM application processor. It adds:

32 vector registers, V0–V31, each 128 bits wide (in AArch64 these share the floating-point register file from Day 28).
Instructions that treat each register as a packed vector of lanes and operate on all lanes simultaneously.

The same 128-bit register can be sliced different ways depending on your data — that flexibility is the heart of NEON.

Figure — A 128-bit NEON register holds 16×8-bit, 8×16-bit, 4×32-bit or 2×64-bit elements.

3. Lanes and data types

A 128-bit register can hold:

Lane size	Lanes	Typical data
8-bit	16	image pixels, bytes
16-bit	8	audio samples, half-floats
32-bit	4	RGBA, single-precision floats
64-bit	2	doubles, large integers

An instruction like "add, 8 lanes of 16-bit" adds all eight pairs in parallel. The number of lanes is exactly your speed-up factor for that data size.

4. A worked example: vector add

Compare scalar and NEON adds of four 32-bit integers:

// Scalar: four separate adds c[0]=a[0]+b[0]; c[1]=a[1]+b[1]; c[2]=a[2]+b[2]; c[3]=a[3]+b[3]; // NEON A64: one add does all four lanes LD1 {v0.4s}, [x0] // load 4×32-bit from a LD1 {v1.4s}, [x1] // load 4×32-bit from b ADD v2.4s, v0.4s, v1.4s // 4 adds in ONE instruction ST1 {v2.4s}, [x2] // store 4 results to c

The .4s suffix means "4 lanes of 32-bit (S = single word)". One ADD replaced four. Scale that across a megapixel image and the savings are enormous.

5. Where NEON shines

Image & video — colour conversion, scaling, filtering, codecs (H.264/H.265). Our RGB-to-grayscale IP and Sobel edge detector are exactly the kind of per-pixel math NEON accelerates.
Audio & DSP — FIR/IIR filters, FFTs, mixing — vectors of samples processed together.
Machine learning — the multiply-accumulate heart of neural networks; NEON even adds dot-product and 8-bit instructions for quantised inference.
General hotspots — memcpy, string scanning, checksums, anything looping over arrays.

6. Three ways to use NEON

Method	Effort	Control
Auto-vectorization	just add `-O3`	compiler decides; easiest
Intrinsics	C functions (`vaddq_s32`)	explicit, portable, no asm
Hand-written assembly	highest	maximum control for hot loops

// NEON intrinsics — readable C that maps to NEON instructions #include <arm_neon.h> int32x4_t va = vld1q_s32(a); // load 4 ints int32x4_t vb = vld1q_s32(b); int32x4_t vc = vaddq_s32(va, vb); // 4-lane add vst1q_s32(c, vc); // store 4 results

Most developers start with auto-vectorization (write clean, simple loops and let -O3 vectorize them), then reach for intrinsics on the few hotspots that matter most. Hand assembly is reserved for the last few percent.

💡 The assembly-line analogy

Scalar code is one worker building one product at a time. SIMD is an assembly line where one command ("attach this part") is executed on 16 products at once. You didn't add workers (cores) — you made each instruction do more. That's why SIMD is cheap performance: same silicon, far more throughput on the right workloads.

7. NEON vs SVE

NEON's vectors are a fixed 128 bits — simple and universal. ARM's newer SVE/SVE2 (Scalable Vector Extension, hinted at in Day 26) makes the vector length variable (128–2048 bits), chosen by the hardware, so one binary automatically uses wider vectors on bigger chips. SVE targets HPC and large ML; NEON remains the everyday workhorse you'll meet first and most often.

✅ The mental model

SIMD does one operation on many data elements at once; NEON is ARM's implementation with 32 × 128-bit V registers sliced into lanes (16×8b, 8×16b, 4×32b, 2×64b). It crushes per-element loops in image, audio, video and ML code. Use it via auto-vectorization, intrinsics, or assembly. SVE is the scalable, variable-length big sibling.

🎯 Day 27 takeaways

SIMD = one instruction, many data elements → data parallelism on a single core.
NEON adds 32 × 128-bit registers (V0–V31) split into lanes.
A 128-bit reg holds 16×8b / 8×16b / 4×32b / 2×64b; lanes = your speed-up.
Best for image, audio, video, ML and any array loop.
Use it via auto-vectorization (-O3), intrinsics, or hand assembly.
SVE = scalable, variable-length vectors for HPC/ML.

Quick check

What does SIMD stand for, and how does it differ from multicore parallelism?
How many 16-bit lanes fit in a 128-bit NEON register?
Name the three ways to get NEON into your program.
What's the key difference between NEON and SVE?

FAQ

What is SIMD?

Single Instruction, Multiple Data — one instruction operates on several packed data elements at once for data parallelism.

What is NEON?

ARM's Advanced SIMD engine: 32 128-bit vector registers (V0–V31) with lane-based instructions for multimedia and ML.

How do I use NEON?

Auto-vectorization with -O3, NEON intrinsics in C, or hand-written assembly for critical loops.

NEON vs SVE?

NEON is fixed 128-bit; SVE has a hardware-defined variable length (128–2048 bits) so one binary scales across chips.

← Back to the full course roadmap · What is an AI chip? →