We've met the registers and the data. Now: how does the CPU execute instructions fast? The answer is the pipeline — an assembly line for instructions, and the single most important idea behind processor performance.
Executing one instruction takes several steps: fetch it from memory, decode what it means, then execute it. Done one-at-a-time, the fetch and decode hardware sits idle while the execute hardware works, and vice-versa. Wasteful.
So pipelining overlaps them: while instruction 1 executes, instruction 2 is being decoded, and instruction 3 is being fetched — three instructions in flight at once, every stage busy every cycle.
One load = wash → dry → fold. Don't wait for load 1 to be folded before washing load 2! Start washing load 2 the moment load 1 moves to the dryer. Same total time per load, but you finish a load far more often. That "far more often" is throughput.
This distinction is the whole point:
Pipelining doesn't make a single instruction faster — it makes the stream of instructions flow at roughly one per clock. That's a several-times speedup for the same hardware.
Early ARM (ARM7) used a 3-stage pipeline:
| Cycle → | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Instr 1 | F | D | E | ||
| Instr 2 | F | D | E | ||
| Instr 3 | F | D | E |
F=Fetch · D=Decode · E=Execute. From cycle 3 on, one instruction completes every cycle.
ARM9 and later use a 5-stage pipeline, splitting out memory and writeback:
| Stage | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Names | Fetch | Decode | Execute | Memory | Writeback |
Memory handles load/store data access; Writeback writes the result into the register file. Modern Cortex-A cores go much deeper (10+ stages), superscalar (several instructions per cycle) and out-of-order — but the principle is identical.
Here's a classic ARM gotcha that the pipeline explains. In the 3-stage pipeline, when an instruction is executing, the PC has already moved on to fetch the instruction two ahead. Since ARM instructions are 4 bytes, the PC reads as the current instruction's address + 8.
This is why PC-relative address calculations account for the offset. (In Thumb state it's +4; the assembler usually handles it for you, but now you know why.)
The dream of "one instruction per cycle" breaks when an instruction can't proceed. These are hazards:
To soften control hazards, modern cores use branch prediction — guessing the branch outcome and speculatively fetching ahead. A correct guess = no penalty; a wrong guess = flush and refill.
A pipeline is an instruction assembly line: deeper pipelines = higher clock speeds and throughput, but bigger penalties when a branch is mispredicted (more stages to flush). Processor design is largely the art of keeping the pipeline full.
MOV r0, pc give current address + 8?Overlapping the fetch/decode/execute steps of consecutive instructions like an assembly line, so a finished instruction emerges roughly every cycle.
Classic 3-stage: Fetch, Decode, Execute. 5-stage: Fetch, Decode, Execute, Memory, Writeback.
In the 3-stage pipeline the PC has already advanced two 4-byte instructions ahead when the current one executes.