Your phone decodes 4K video, applies camera filters in real time and runs on-device AI — all without melting the battery. The secret weapon behind much of that is NEON, ARM's SIMD engine. Today you'll understand how doing the same operation on many pieces of data at once turns a slow loop into a blazing one.
Consider adding two arrays of 1,000 numbers. Ordinary "scalar" code loads one element, adds one element, stores one element — 1,000 times. But the data elements are often small (8- or 16-bit pixels, 32-bit samples) while the processor's datapath is 64 or 128 bits wide. You're driving a truck to deliver one parcel at a time.
SIMD — Single Instruction, Multiple Data — fixes this. One instruction operates on a whole vector of elements packed side by side. Add eight 16-bit numbers in a single instruction and your loop runs roughly 8× fewer iterations. This is data parallelism, and it's distinct from the thread/core parallelism of multicore (Day 29) — here a single core does more per instruction.
NEON (officially Advanced SIMD) is the SIMD extension found in virtually every modern ARM application processor. It adds:
The same 128-bit register can be sliced different ways depending on your data — that flexibility is the heart of NEON.
A 128-bit register can hold:
| Lane size | Lanes | Typical data |
|---|---|---|
| 8-bit | 16 | image pixels, bytes |
| 16-bit | 8 | audio samples, half-floats |
| 32-bit | 4 | RGBA, single-precision floats |
| 64-bit | 2 | doubles, large integers |
An instruction like "add, 8 lanes of 16-bit" adds all eight pairs in parallel. The number of lanes is exactly your speed-up factor for that data size.
Compare scalar and NEON adds of four 32-bit integers:
The .4s suffix means "4 lanes of 32-bit (S = single word)". One ADD replaced four. Scale that across a megapixel image and the savings are enormous.
| Method | Effort | Control |
|---|---|---|
| Auto-vectorization | just add -O3 | compiler decides; easiest |
| Intrinsics | C functions (vaddq_s32) | explicit, portable, no asm |
| Hand-written assembly | highest | maximum control for hot loops |
Most developers start with auto-vectorization (write clean, simple loops and let -O3 vectorize them), then reach for intrinsics on the few hotspots that matter most. Hand assembly is reserved for the last few percent.
Scalar code is one worker building one product at a time. SIMD is an assembly line where one command ("attach this part") is executed on 16 products at once. You didn't add workers (cores) — you made each instruction do more. That's why SIMD is cheap performance: same silicon, far more throughput on the right workloads.
NEON's vectors are a fixed 128 bits — simple and universal. ARM's newer SVE/SVE2 (Scalable Vector Extension, hinted at in Day 26) makes the vector length variable (128–2048 bits), chosen by the hardware, so one binary automatically uses wider vectors on bigger chips. SVE targets HPC and large ML; NEON remains the everyday workhorse you'll meet first and most often.
SIMD does one operation on many data elements at once; NEON is ARM's implementation with 32 × 128-bit V registers sliced into lanes (16×8b, 8×16b, 4×32b, 2×64b). It crushes per-element loops in image, audio, video and ML code. Use it via auto-vectorization, intrinsics, or assembly. SVE is the scalable, variable-length big sibling.
Single Instruction, Multiple Data — one instruction operates on several packed data elements at once for data parallelism.
ARM's Advanced SIMD engine: 32 128-bit vector registers (V0–V31) with lane-based instructions for multimedia and ML.
Auto-vectorization with -O3, NEON intrinsics in C, or hand-written assembly for critical loops.
NEON is fixed 128-bit; SVE has a hardware-defined variable length (128–2048 bits) so one binary scales across chips.