HomeARM CourseDay 7
DAY 7 · THE INSTRUCTION SET

The Pipeline — How an Instruction Actually Flows

By EcrioniX · Updated Jun 6, 2026

We've met the registers and the data. Now: how does the CPU execute instructions fast? The answer is the pipeline — an assembly line for instructions, and the single most important idea behind processor performance.

1. The assembly-line idea

Executing one instruction takes several steps: fetch it from memory, decode what it means, then execute it. Done one-at-a-time, the fetch and decode hardware sits idle while the execute hardware works, and vice-versa. Wasteful.

So pipelining overlaps them: while instruction 1 executes, instruction 2 is being decoded, and instruction 3 is being fetched — three instructions in flight at once, every stage busy every cycle.

💡 Laundry analogy

One load = wash → dry → fold. Don't wait for load 1 to be folded before washing load 2! Start washing load 2 the moment load 1 moves to the dryer. Same total time per load, but you finish a load far more often. That "far more often" is throughput.

2. Latency vs throughput

This distinction is the whole point:

Pipelining doesn't make a single instruction faster — it makes the stream of instructions flow at roughly one per clock. That's a several-times speedup for the same hardware.

3. The classic ARM pipelines

Early ARM (ARM7) used a 3-stage pipeline:

Cycle →12345
Instr 1FDE
Instr 2FDE
Instr 3FDE

F=Fetch · D=Decode · E=Execute. From cycle 3 on, one instruction completes every cycle.

ARM9 and later use a 5-stage pipeline, splitting out memory and writeback:

Stage12345
NamesFetchDecodeExecuteMemoryWriteback

Memory handles load/store data access; Writeback writes the result into the register file. Modern Cortex-A cores go much deeper (10+ stages), superscalar (several instructions per cycle) and out-of-order — but the principle is identical.

4. The famous "PC + 8" quirk

Here's a classic ARM gotcha that the pipeline explains. In the 3-stage pipeline, when an instruction is executing, the PC has already moved on to fetch the instruction two ahead. Since ARM instructions are 4 bytes, the PC reads as the current instruction's address + 8.

; if this instruction is at address 0x1000 ... MOV r0, pc ; r0 becomes 0x1008, not 0x1000 (+8 due to pipeline)

This is why PC-relative address calculations account for the offset. (In Thumb state it's +4; the assembler usually handles it for you, but now you know why.)

5. When the pipeline stalls — hazards

The dream of "one instruction per cycle" breaks when an instruction can't proceed. These are hazards:

To soften control hazards, modern cores use branch prediction — guessing the branch outcome and speculatively fetching ahead. A correct guess = no penalty; a wrong guess = flush and refill.

✅ The mental model

A pipeline is an instruction assembly line: deeper pipelines = higher clock speeds and throughput, but bigger penalties when a branch is mispredicted (more stages to flush). Processor design is largely the art of keeping the pipeline full.

🎯 Day 7 takeaways

Quick check

  1. Does pipelining reduce the latency of a single instruction?
  2. In the 3-stage pipeline, why does MOV r0, pc give current address + 8?
  3. Why does a deeper pipeline make branch mispredictions more expensive?

FAQ

What is pipelining?

Overlapping the fetch/decode/execute steps of consecutive instructions like an assembly line, so a finished instruction emerges roughly every cycle.

What are the ARM pipeline stages?

Classic 3-stage: Fetch, Decode, Execute. 5-stage: Fetch, Decode, Execute, Memory, Writeback.

Why is PC current + 8?

In the 3-stage pipeline the PC has already advanced two 4-byte instructions ahead when the current one executes.

Previous
← Day 6: Instruction sets

← Back to the full course roadmap