What is instruction pipelining?

Pipelining overlaps the steps of executing instructions so that while one instruction is executing, the next is being decoded and a third is being fetched - like an assembly line. Each instruction still takes several cycles end to end (latency), but a finished instruction can come out every cycle (throughput), greatly increasing performance.

What are the stages of the classic ARM pipeline?

The classic 3-stage ARM pipeline (ARM7) has Fetch, Decode and Execute. The 5-stage pipeline (ARM9 and similar) splits this into Fetch, Decode, Execute, Memory and Writeback, adding a stage to access data memory and a stage to write results back to the register file.

Why does the ARM program counter read as the current instruction plus 8?

In the classic 3-stage ARM pipeline, by the time an instruction executes, the PC has already moved on to fetch instructions two ahead. Since each ARM instruction is 4 bytes, the PC reads as the current instruction's address plus 8. This is a famous quirk of ARM state and is why PC-relative calculations account for the offset.

What is a pipeline hazard?

A hazard is a situation that prevents the next instruction from executing in the following cycle. Data hazards occur when an instruction needs a result not yet produced; control hazards occur on branches when the pipeline has already fetched the wrong instructions; structural hazards occur when two instructions need the same hardware resource. Pipelines handle these with stalls, forwarding and branch prediction.

DAY 7 · THE INSTRUCTION SET

The Pipeline — How an Instruction Actually Flows

By EcrioniX · Updated Jun 6, 2026

We've met the registers and the data. Now: how does the CPU execute instructions fast? The answer is the pipeline — an assembly line for instructions, and the single most important idea behind processor performance.

1. The assembly-line idea

Executing one instruction takes several steps: fetch it from memory, decode what it means, then execute it. Done one-at-a-time, the fetch and decode hardware sits idle while the execute hardware works, and vice-versa. Wasteful.

So pipelining overlaps them: while instruction 1 executes, instruction 2 is being decoded, and instruction 3 is being fetched — three instructions in flight at once, every stage busy every cycle.

💡 Laundry analogy

One load = wash → dry → fold. Don't wait for load 1 to be folded before washing load 2! Start washing load 2 the moment load 1 moves to the dryer. Same total time per load, but you finish a load far more often. That "far more often" is throughput.

2. Latency vs throughput

This distinction is the whole point:

Latency — how long one instruction takes start-to-finish (still several cycles).
Throughput — how often a finished instruction pops out. With a full pipeline: one per cycle, even though each took several.

Pipelining doesn't make a single instruction faster — it makes the stream of instructions flow at roughly one per clock. That's a several-times speedup for the same hardware.

3. The classic ARM pipelines

Early ARM (ARM7) used a 3-stage pipeline:

Cycle →	1	2	3	4	5
Instr 1	F	D	E
Instr 2		F	D	E
Instr 3			F	D	E

F=Fetch · D=Decode · E=Execute. From cycle 3 on, one instruction completes every cycle.

ARM9 and later use a 5-stage pipeline, splitting out memory and writeback:

Stage	1	2	3	4	5
Names	Fetch	Decode	Execute	Memory	Writeback

Memory handles load/store data access; Writeback writes the result into the register file. Modern Cortex-A cores go much deeper (10+ stages), superscalar (several instructions per cycle) and out-of-order — but the principle is identical.

4. The famous "PC + 8" quirk

Here's a classic ARM gotcha that the pipeline explains. In the 3-stage pipeline, when an instruction is executing, the PC has already moved on to fetch the instruction two ahead. Since ARM instructions are 4 bytes, the PC reads as the current instruction's address + 8.

; if this instruction is at address 0x1000 ... MOV r0, pc ; r0 becomes 0x1008, not 0x1000 (+8 due to pipeline)

This is why PC-relative address calculations account for the offset. (In Thumb state it's +4; the assembler usually handles it for you, but now you know why.)

5. When the pipeline stalls — hazards

The dream of "one instruction per cycle" breaks when an instruction can't proceed. These are hazards:

Data hazard — an instruction needs a result the previous one hasn't produced yet. Fixed by forwarding (routing the result early) or, failing that, a stall (bubble).
Control hazard — a branch. The pipeline already fetched the next sequential instructions, but the branch goes elsewhere — so those wrong instructions must be flushed, costing cycles (the branch penalty).
Structural hazard — two instructions need the same hardware resource in the same cycle.

To soften control hazards, modern cores use branch prediction — guessing the branch outcome and speculatively fetching ahead. A correct guess = no penalty; a wrong guess = flush and refill.

✅ The mental model

A pipeline is an instruction assembly line: deeper pipelines = higher clock speeds and throughput, but bigger penalties when a branch is mispredicted (more stages to flush). Processor design is largely the art of keeping the pipeline full.

🎯 Day 7 takeaways

Pipelining overlaps fetch/decode/execute → ~one instruction finishes per cycle.
Latency (per instruction) stays; throughput rises dramatically.
ARM7 = 3-stage (F/D/E); ARM9 = 5-stage (adds Memory, Writeback).
PC reads as current + 8 in classic ARM state because of the pipeline.
Hazards (data, control, structural) stall the pipeline; forwarding & branch prediction fight back.

Quick check

Does pipelining reduce the latency of a single instruction?
In the 3-stage pipeline, why does MOV r0, pc give current address + 8?
Why does a deeper pipeline make branch mispredictions more expensive?

FAQ

What is pipelining?

Overlapping the fetch/decode/execute steps of consecutive instructions like an assembly line, so a finished instruction emerges roughly every cycle.

What are the ARM pipeline stages?

Classic 3-stage: Fetch, Decode, Execute. 5-stage: Fetch, Decode, Execute, Memory, Writeback.

Why is PC current + 8?

In the 3-stage pipeline the PC has already advanced two 4-byte instructions ahead when the current one executes.

← Back to the full course roadmap