Add and subtract are cheap; multiply is bigger, and divide is so costly that early ARM cores left it out entirely. Today: how ARM multiplies, why divide is special, and the saturating math that keeps DSP from glitching.
| Instr | Computes | Notes |
|---|---|---|
| MUL Rd,Rn,Rm | Rd = Rn × Rm (low 32 bits) | basic 32×32→32 |
| MLA Rd,Rn,Rm,Ra | Rd = Rn × Rm + Ra | multiply-accumulate |
| UMULL RdLo,RdHi,Rn,Rm | 64-bit = Rn × Rm | unsigned long (full result) |
| SMULL RdLo,RdHi,Rn,Rm | 64-bit = Rn × Rm | signed long |
| UMLAL / SMLAL | 64-bit += Rn × Rm | long multiply-accumulate |
Use MUL for a plain product, MLA for "multiply then add" (the heart of filters and dot products), and the long forms when 32×32 can overflow 32 bits and you need the full 64-bit answer.
For a constant multiplier, the barrel shifter (Day 9) is faster: r1 × 5 is just ADD r0, r1, r1, LSL #2. Compilers prefer shift-and-add for small constants and save MUL for variable multipliers.
Division is hard to build in hardware (it's inherently iterative), so to stay small and fast many early ARM cores had no divide instruction at all. Software handled it:
LSR/ASR, Day 9).__aeabi_idiv) doing iterative long division.Modern cores do include hardware divide:
SDIV/UDIV are available on Cortex-M3 and later M-profile cores, and on modern A-profile / A64. Note there's no remainder instruction — you compute it as a − (a/b)×b with a divide, multiply and subtract.
Normal arithmetic wraps on overflow: add 1 to the largest value and it flips to the smallest (a huge, ugly jump). In signal processing that wraparound is catastrophic — a loud click, a corrupted sample. Saturating arithmetic instead clamps to the max or min:
normal: 0x7FFFFFFF + 1 = 0x80000000 (wraps to most-negative ❌)
saturate: 0x7FFFFFFF + 1 = 0x7FFFFFFF (clamps to max ✓)
ARM's DSP instructions do this in hardware:
| Instr | Does |
|---|---|
| QADD Rd,Rn,Rm | saturating add |
| QSUB Rd,Rn,Rm | saturating subtract |
| SSAT / USAT | clamp a value to a given bit-width |
These are central to fixed-point DSP on cores like the Cortex-M4/M7 (which add a DSP extension and SIMD). Audio, control loops and filters rely on saturation to degrade gracefully instead of glitching.
MUL/MLA for products (and dot-products); UMULL/SMULL when you need all 64 bits. Divide is costly — avoid it (shift/reciprocal) or use SDIV/UDIV on newer cores. Saturating math (QADD/QSUB/SSAT) clamps instead of wrapping — essential for DSP.
a×b + c in one step?MUL (low 32), MLA (multiply-accumulate), UMULL/SMULL (full 64-bit unsigned/signed), and UMLAL/SMLAL (64-bit accumulate).
Not on early cores — software handled it. SDIV/UDIV exist on Cortex-M3+ and modern A-profile/A64. There's no remainder instruction.
Math that clamps to max/min on overflow instead of wrapping (QADD/QSUB/SSAT) — essential for DSP and fixed-point.