Topic 04 — RTL Design

Pipelining &
Throughput Optimization

Pipelining is the single most powerful micro-architectural technique for achieving high clock frequency in digital design. By splitting a long combinational path into shorter stages separated by registers, you increase the rate at which results can be produced — at the cost of a fixed latency increase. Every high-performance ASIC, CPU, and DSP relies on careful pipeline design.

Pipeline Stages
Hazard Detection
Stall & Forward
Synthesis Ready

What Is Pipelining?

Without pipelining, a combinational block must fully complete before the next input can be accepted. The clock period must be longer than the longest path through the logic — the critical path. If the critical path is 20 ns, your maximum clock is 50 MHz, no matter how simple most operations are.

Pipelining inserts flip-flop registers between logic stages. Each stage now only needs to complete its own portion of the work within one clock period. If you split that 20 ns path into 4 stages of 5 ns each, your clock can run at 200 MHz — 4× faster. While one instruction is in stage 4, three others are simultaneously in stages 3, 2, and 1.

IF

Instruction Fetch

ID

Decode & Register Read

EX

Execute / ALU

MEM

Memory Access

WB

Write Back

Key Insight:

Pipelining does not reduce latency — it increases throughput. A 5-stage pipeline takes 5 cycles to produce the first result, but after that, one result comes out every cycle. This is the core tradeoff every RTL designer must understand.

Latency vs Throughput

These two metrics move in opposite directions when pipelining:

Latency

Time for One Result

Latency = N × T_clk where N is the number of pipeline stages. A 5-stage pipeline at 200 MHz = 5 × 5 ns = 25 ns latency. Adding stages always increases latency. Latency matters for real-time control loops and memory access.

Throughput

Results Per Second

In an ideal pipeline, throughput = 1 result/clock = F_clk. A 200 MHz pipeline = 200 million results/sec. Throughput is the primary metric for DSP, networking, ML inference — applications that process continuous data streams.

Pipeline Overhead

Register Setup & Hold

Each pipeline register adds its own setup time, hold time, and clock-to-Q delay. For very deep pipelines, this overhead eats into the gains. Practical stage depths for modern ASIC nodes are 4–20 logic levels per stage.

Clock Frequency

Critical Path Reduction

T_clk ≥ T_cq + T_logic_max + T_setup. By reducing T_logic_max (the longest combinational delay in any one stage), you reduce T_clk and increase frequency. This is the entire point of pipeline stage splitting.

3-Stage Pipeline in Verilog

Below is a classic 3-stage pipeline: fetch, compute, output. Each stage is separated by a register bank clocked on the positive edge. Notice how data and valid signals are pipelined together so downstream stages always know whether their data is meaningful.

Verilog — 3-Stage Multiply-Accumulate Pipeline
// 3-stage pipeline: Stage1=load, Stage2=multiply, Stage3=accumulate
module mac_pipeline #(
  parameter WIDTH = 16
) (
  input  wire                    clk, rst_n,
  input  wire                    valid_in,
  input  wire [WIDTH-1:0]      a_in, b_in,
  output reg                     valid_out,
  output reg  [2*WIDTH-1:0] result
);

  // ── Stage 1 registers ──────────────────────
  reg [WIDTH-1:0] s1_a, s1_b;
  reg             s1_valid;

  // ── Stage 2 registers ──────────────────────
  reg [2*WIDTH-1:0] s2_product;
  reg               s2_valid;

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      s1_valid  <= 1'b0;
      s2_valid  <= 1'b0;
      valid_out <= 1'b0;
    end else begin
      // Stage 1: capture inputs
      s1_a     <= a_in;
      s1_b     <= b_in;
      s1_valid <= valid_in;

      // Stage 2: multiply
      s2_product <= s1_a * s1_b;
      s2_valid   <= s1_valid;

      // Stage 3: output
      result    <= s2_product;
      valid_out <= s2_valid;
    end
  end
endmodule
Rule:

Always pipeline the valid signal alongside data. A pipeline register without its associated valid bit is a common source of functional bugs that are hard to catch in simulation.

Pipeline Hazards — The Real Challenge

A hazard occurs when the pipeline cannot produce the correct result for every instruction every cycle. There are three types:

Hazard TypeCauseExampleSolution
Data Hazard (RAW)Instruction needs a value not yet written by a previous instructionADD r1, r2, r3 → SUB r4, r1, r5 (r1 not ready)Stall or forward from EX/MEM stage
Control HazardBranch instruction changes PC before fetch stage knows the targetBEQ → next 2 instructions fetched are wrongPredict not-taken, flush on mispredict, or delay slots
Structural HazardTwo instructions need the same hardware resource in the same cycleLoad and instruction both need memory portSeparate instruction/data caches, or stall one instruction
WAW HazardTwo writes to same register, second one may complete firstOut-of-order execution: both write r1Scoreboard or register renaming (OOO processors)
WAR HazardWrite to a register before a previous instruction reads itOut-of-order: write r2 before older read of r2Register renaming (affects OOO only, not in-order RTL)

Stalling — Inserting Bubbles

The simplest way to resolve a data hazard is to stall the pipeline — freeze the stages that need data and insert a NOP (bubble) into the stage that would produce wrong results. The stall signal freezes the enable of pipeline registers upstream and clears the register at the hazard point.

Verilog — Stall Logic for Load-Use Hazard
// Detect load-use hazard: ID stage reads a reg that EX stage is loading
wire load_use_hazard;
assign load_use_hazard =
  (ex_is_load) &&
  ((ex_rd == id_rs1) || (ex_rd == id_rs2));

// When hazard detected: stall IF and ID, insert bubble into EX
always @(posedge clk) begin
  if (!rst_n) begin
    if_id_reg  <= '0;
    id_ex_reg  <= '0;
  end else if (!load_use_hazard) begin
    // Normal operation — advance all stages
    if_id_reg  <= if_id_next;
    id_ex_reg  <= id_ex_next;
  end else begin
    // Stall: freeze IF/ID, inject NOP into EX
    if_id_reg  <= if_id_reg;   // hold (stall)
    id_ex_reg  <= '0;          // bubble (NOP)
    // PC also held (not shown)
  end
end
Cost of Stalling:

Every stall cycle reduces effective throughput. A load-use hazard in a 5-stage RISC pipeline adds 1 stall cycle, reducing IPC from 1.0 to 0.5 for every load followed by a dependent instruction. This is why compilers reorder instructions to hide load latency.

Data Forwarding (Bypassing)

Forwarding avoids stalls by routing the result from a later pipeline stage back to an earlier one without waiting for the write-back stage. Instead of waiting for EX result to flow through MEM → WB → register file before ID can read it, the EX result is forwarded directly to the EX input MUX of the dependent instruction.

Verilog — EX-to-EX Forwarding MUX
// Select correct operand A for EX stage
// Forward from MEM stage (1 cycle old result) or WB stage (2 cycles old)
always @(*) begin
  // Forward A
  if (mem_reg_write && (mem_rd != 0) && (mem_rd == ex_rs1))
    fwd_a = mem_alu_result;           // MEM→EX forward
  else if (wb_reg_write && (wb_rd != 0) && (wb_rd == ex_rs1))
    fwd_a = wb_write_data;            // WB→EX forward
  else
    fwd_a = ex_rs1_data;              // no hazard, use register file

  // Forward B (same logic)
  if (mem_reg_write && (mem_rd != 0) && (mem_rd == ex_rs2))
    fwd_b = mem_alu_result;
  else if (wb_reg_write && (wb_rd != 0) && (wb_rd == ex_rs2))
    fwd_b = wb_write_data;
  else
    fwd_b = ex_rs2_data;
end

Note: forwarding cannot resolve a load-use hazard because the load result is not available until the end of the MEM stage — one cycle too late for the dependent instruction already in EX. A 1-cycle stall is still required in that case, even with full forwarding logic.

Control Hazards & Branch Handling

When a branch instruction is in the EX stage and the outcome is finally known, the IF and ID stages have already fetched and partially decoded 2 wrong instructions. These must be flushed (converted to NOPs) if the branch is taken.

Verilog — Branch Flush Logic
wire branch_taken;
assign branch_taken = ex_is_branch && ex_branch_condition;

always @(posedge clk) begin
  if (branch_taken) begin
    // Flush the two wrongly-fetched instructions
    if_id_reg <= '0;   // bubble
    id_ex_reg <= '0;   // bubble
    // Redirect PC to branch target
    pc        <= ex_branch_target;
  end
end
Predict Not-Taken

Simplest Strategy

Always assume the branch is not taken and continue fetching sequentially. If wrong, flush 2 cycles. Works well for loops (back-edge branches are usually taken, forward branches usually not).

Delay Slots

Architecture Trick

MIPS uses a branch delay slot — the instruction after a branch is always executed regardless of branch outcome. The compiler fills this slot with a useful instruction, hiding the 1-cycle branch penalty without hardware flushing.

Branch Predictor

Hardware Prediction

Modern CPUs use branch prediction tables (BTB, BHT) to predict branch outcomes speculatively. On a correct prediction, zero penalty. On a misprediction, flush all speculative instructions — penalty depends on pipeline depth.

Retiming

Retiming is a synthesis and physical design optimization that moves flip-flops across combinational logic to equalize stage delays, without changing the circuit's functional behavior. It is transparent to the designer — you write RTL, synthesis tools retime automatically if allowed.

Verilog — Unbalanced Pipeline (before retiming)
// Stage 1: 3 ns logic (fast)
// Stage 2: 9 ns logic (slow — critical path)
// Clock must be ≥ 9 ns → max 111 MHz

always @(posedge clk) begin
  s1_out <= fast_logic(in);       // 3 ns
  s2_out <= slow_logic(s1_out);   // 9 ns ← bottleneck
end

// After retiming: tool moves some logic from stage2 to stage1
// Stage 1: 6 ns, Stage 2: 6 ns → clock = 6 ns → max 167 MHz
// Same result, higher frequency — no RTL change needed!
Enabling Retiming:

In Synopsys DC: set_optimize_registers true -designs [current_design]. In Vivado: enable Register Balancing in synthesis settings. Retiming works best when you give the tool enough freedom — avoid constraining registers with (* dont_touch *) attributes unnecessarily.

Pipelining a Multiplier

A 32-bit combinational multiplier has a critical path of ~15 ns in a 28 nm process. A 500 MHz clock has a period of 2 ns. You must pipeline the multiplier into ≥ 8 stages. Here is a clean 4-stage parameterized pipeline structure:

Verilog — 4-Stage Pipelined Multiplier
module pipe_mult #(
  parameter W = 16   // operand width; product is 2W bits
) (
  input  wire              clk, rst_n, valid_in,
  input  wire [W-1:0]  a, b,
  output reg               valid_out,
  output reg  [2*W-1:0] product
);
  // 4 pipeline stages — shift W/2 bits of partial product each stage
  localparam H = W/2;

  reg [2*W-1:0] pp0, pp1, pp2, acc0, acc1, acc2;
  reg [3:0]        vld;   // valid shift register

  always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      vld <= 4'b0; valid_out <= 0;
    end else begin
      // Valid propagates as shift register
      vld       <= {vld[2:0], valid_in};
      valid_out <= vld[3];

      // S1: lower partial product
      pp0  <= a[H-1:0]  * b;
      // S2: upper partial product
      pp1  <= a[W-1:H]  * b;
      acc0 <= pp0;
      // S3: accumulate with shift
      acc1 <= acc0 + (pp1 << H);
      pp2  <= pp1;
      // S4: register output
      product <= acc1;
    end
  end
endmodule

Synthesis-Friendly Pipelining Rules

Register Outputs

Always Register Module Outputs

If a module's output feeds another module, register it. This prevents inter-module combinational paths that explode during timing closure and makes the interface timing predictable regardless of the destination.

Reset Strategy

Synchronous Reset for Data Paths

Use synchronous reset for pipeline data registers (cleaner synthesis, no dedicated reset network needed). Use asynchronous reset only for control/valid bits where you must guarantee power-on state regardless of clock state.

Don't Touch

Avoid (* dont_touch *) on Pipeline Regs

Over-constraining pipeline registers with synthesis attributes prevents retiming. Only use dont_touch on registers that have explicit functional meaning (e.g., synchronizer FFs, scan boundary FFs).

Valid Propagation

Pipeline Valid With Data

Always propagate the valid signal through exactly the same number of pipeline stages as the data. Use a shift register of valid bits. Missing valid alignment is one of the most common pipeline bugs in RTL.

Backpressure

Add Ready/Valid Handshake

Real systems need flow control. Use a ready-valid handshake: valid indicates producer has data, ready indicates consumer can accept. Pipeline registers should only advance when both valid and ready are asserted.

Balance Stages

Keep Stage Depths Equal

The clock period is set by the slowest stage. Split unbalanced stages: if one stage has 20 ns delay and another has 4 ns, split the 20 ns stage into two 10 ns stages and insert an extra register — halving the critical path.

Pipeline Design Explorer
Visualize pipeline execution, calculate throughput, and explore hazard types interactively.

5-stage pipeline execution diagram. Each cell shows which instruction is in which stage. Stall cycles shown in grey.

Legend: IF ID EX MEM WB STALL
Latency (first result)
Clock Period
Ideal Throughput
Actual Throughput (with stalls)
Total Cycles for N instructions
Efficiency
Read After Write (RAW) — Data Hazard
A RAW hazard occurs when instruction B needs to read a register that instruction A is still in the process of writing. In a 5-stage pipeline, A writes at WB (cycle 5) but B tries to read at ID (cycle 3). B reads the old, stale value — a bug.
Example:
ADD r1, r2, r3    // writes r1 at WB (cycle 5)
SUB r4, r1, r5    // needs r1 at ID (cycle 3) — stale!
Solution: Forward the result from EX/MEM/WB stage directly to the EX stage input MUX. If load-use: insert 1 stall cycle first.

Frequently Asked Questions

Does deeper pipelining always mean higher frequency?

Not necessarily. Each pipeline register adds its own setup time and clock-to-Q overhead. Beyond a certain depth, this overhead dominates and frequency stops increasing. Practical ASIC pipelines use 8–20 logic levels per stage.

What is CPI and how does pipelining affect it?

CPI (Cycles Per Instruction) = 1.0 in an ideal pipeline. Hazards increase CPI above 1.0. A pipeline with 0.2 stall cycles per instruction has CPI = 1.2. Performance = Frequency / CPI — both matter equally.

How does pipelining interact with timing closure in ASIC?

Each pipeline stage should have roughly equal combinational depth. Unbalanced stages mean the slow stage limits frequency. STA flags this as a failing path. The fix is to re-partition logic across stages (manually or via retiming).

Can I pipeline any combinational function?

Almost any purely combinational function can be pipelined. Functions with feedback (like iterative algorithms) require special handling — you must unroll the loop or use multi-cycle paths with handshaking to allow the pipeline to stall while the iterative function completes.