Why did early ARM cores have no divide instruction?

Division is expensive in hardware, so to keep the core small and fast many early ARM cores omitted a divide instruction. Software handled division with library routines, and division by a constant was usually replaced by a multiply-by-reciprocal or shifts. Later cores added hardware divide: SDIV and UDIV appear in Cortex-M3 and later M-profile cores and in modern A-profile and A64.

How do you multiply by a constant efficiently on ARM?

For many constants you do not need MUL at all - the barrel shifter lets you combine shifts and adds, for example r1 times 5 is r1 plus r1 shifted left by 2 in a single instruction. Compilers prefer shift-and-add sequences for small constant multiplications because they are faster than a general multiply. Use MUL when the multiplier is a variable.

DAY 13 · THE INSTRUCTION SET

Multiply, Divide & Saturating Arithmetic

Q: What are the ARM multiply instructions?

MUL multiplies two registers and keeps the lower 32 bits. MLA is multiply-accumulate, computing a times b plus c. UMULL and SMULL are the unsigned and signed long multiplies that produce a full 64-bit result in two registers, and UMLAL/SMLAL accumulate into a 64-bit value. These cover most integer multiplication needs.

Q: What is saturating arithmetic?

Saturating arithmetic clamps a result to the maximum or minimum representable value instead of wrapping around on overflow. For example, adding to a value already near the top limit yields the top limit rather than a wrapped negative number. ARM provides saturating instructions such as QADD and QSUB, which are important in DSP and fixed-point processing where wraparound would create severe glitches.

By EcrioniX · Updated Jun 6, 2026

Add and subtract are cheap; multiply is bigger, and divide is so costly that early ARM cores left it out entirely. Today: how ARM multiplies, why divide is special, and the saturating math that keeps DSP from glitching.

1. Multiply instructions

Instr	Computes	Notes
MUL Rd,Rn,Rm	Rd = Rn × Rm (low 32 bits)	basic 32×32→32
MLA Rd,Rn,Rm,Ra	Rd = Rn × Rm + Ra	multiply-accumulate
UMULL RdLo,RdHi,Rn,Rm	64-bit = Rn × Rm	unsigned long (full result)
SMULL RdLo,RdHi,Rn,Rm	64-bit = Rn × Rm	signed long
UMLAL / SMLAL	64-bit += Rn × Rm	long multiply-accumulate

MUL r0, r1, r2 ; r0 = r1 * r2 (low 32 bits) MLA r0, r1, r2, r3 ; r0 = r1*r2 + r3 (great for dot products) UMULL r0, r1, r2, r3 ; {r1:r0} = r2 * r3 (full 64-bit)

Use MUL for a plain product, MLA for "multiply then add" (the heart of filters and dot products), and the long forms when 32×32 can overflow 32 bits and you need the full 64-bit answer.

💡 Often you don't need MUL at all

For a constant multiplier, the barrel shifter (Day 9) is faster: r1 × 5 is just ADD r0, r1, r1, LSL #2. Compilers prefer shift-and-add for small constants and save MUL for variable multipliers.

2. Division — the expensive one

Division is hard to build in hardware (it's inherently iterative), so to stay small and fast many early ARM cores had no divide instruction at all. Software handled it:

By a constant → replaced with a multiply-by-reciprocal and a shift, or pure shifts for powers of two (LSR/ASR, Day 9).
By a variable → a library routine (e.g. __aeabi_idiv) doing iterative long division.

Modern cores do include hardware divide:

SDIV r0, r1, r2 ; signed r0 = r1 / r2 UDIV r0, r1, r2 ; unsigned r0 = r1 / r2

SDIV/UDIV are available on Cortex-M3 and later M-profile cores, and on modern A-profile / A64. Note there's no remainder instruction — you compute it as a − (a/b)×b with a divide, multiply and subtract.

3. Saturating arithmetic

Normal arithmetic wraps on overflow: add 1 to the largest value and it flips to the smallest (a huge, ugly jump). In signal processing that wraparound is catastrophic — a loud click, a corrupted sample. Saturating arithmetic instead clamps to the max or min:

normal: 0x7FFFFFFF + 1 = 0x80000000 (wraps to most-negative ❌)
saturate: 0x7FFFFFFF + 1 = 0x7FFFFFFF (clamps to max ✓)

ARM's DSP instructions do this in hardware:

Instr	Does
QADD Rd,Rn,Rm	saturating add
QSUB Rd,Rn,Rm	saturating subtract
SSAT / USAT	clamp a value to a given bit-width

These are central to fixed-point DSP on cores like the Cortex-M4/M7 (which add a DSP extension and SIMD). Audio, control loops and filters rely on saturation to degrade gracefully instead of glitching.

✅ The mental model

MUL/MLA for products (and dot-products); UMULL/SMULL when you need all 64 bits. Divide is costly — avoid it (shift/reciprocal) or use SDIV/UDIV on newer cores. Saturating math (QADD/QSUB/SSAT) clamps instead of wrapping — essential for DSP.

🎯 Day 13 takeaways

MUL = 32×32→32; MLA = multiply-accumulate; UMULL/SMULL = full 64-bit; UMLAL/SMLAL accumulate.
Constant multiply → prefer shift + add (barrel shifter), not MUL.
Early ARM had no divide; use shifts/reciprocal or SDIV/UDIV on modern cores. No remainder instruction.
Saturating arithmetic (QADD/QSUB/SSAT) clamps on overflow — vital for DSP/fixed-point.

Quick check

Which instruction computes a×b + c in one step?
Why did early ARM cores omit a divide instruction?
What does a saturating add do at the maximum value instead of wrapping?

FAQ

What are the ARM multiply instructions?

MUL (low 32), MLA (multiply-accumulate), UMULL/SMULL (full 64-bit unsigned/signed), and UMLAL/SMLAL (64-bit accumulate).

Does ARM have divide?

Not on early cores — software handled it. SDIV/UDIV exist on Cortex-M3+ and modern A-profile/A64. There's no remainder instruction.

What is saturating arithmetic?

Math that clamps to max/min on overflow instead of wrapping (QADD/QSUB/SSAT) — essential for DSP and fixed-point.

← Back to the full course roadmap