1. Why FIFO Depth Matters
A FIFO (First-In First-Out) buffer decouples a producer from a consumer that operate at different rates, or in different clock domains. The single most critical parameter at design time is its depth — the number of words the buffer can hold simultaneously.
Undersize the FIFO and you get overflow: the writer tries to push a word into a full buffer and data is silently dropped. Oversize it and you waste silicon area and power for unused memory. Both outcomes are expensive after tape-out, so depth must be calculated analytically before RTL is frozen.
The core insight: a FIFO only needs to hold the words that accumulate during the worst-case period when the writer is faster than the reader. Once the burst ends, the reader drains the excess and the fill level returns to zero. The peak fill level during that drain window is the required depth.
2. Synchronous FIFO — Same Clock Domain
When both the writer and reader share the same clock but the reader cannot accept data every cycle (due to backpressure, protocol overhead, or stall cycles), the depth depends on the duty-cycle mismatch during a burst.
Scenario: Writer writes every cycle, reader reads every N cycles
If the writer pushes one word per clock for a burst of B words, and the reader can only accept a word every N clocks (read rate = 1/N), the fill level grows by (1 − 1/N) words per clock during the burst.
| Parameter | Symbol | Example |
|---|---|---|
| Burst length (words) | B | 32 |
| Write clock (MHz) | Fwr | 100 MHz |
| Read clock (MHz) | Frd | 40 MHz |
| Words written in burst time | B | 32 |
| Words read in same burst time | B × (Frd / Fwr) | 32 × 0.4 = 12.8 |
| Required depth | ceil(B × (1 − Frd/Fwr)) | ceil(32 × 0.6) = 20 |
// Synchronous FIFO depth formula // Depth = ceil( B × (1 − Frd / Fwr) ) // where B = burst size, Fwr = write freq, Frd = read freq // Example: B=32, Fwr=100, Frd=40 depth_required = ceil(32 × (1 − 40/100)) = ceil(32 × 0.6) = ceil(19.2) = 20 // round up to next power-of-2 → 32
3. Asynchronous FIFO — Clock Domain Crossing
Async FIFOs cross between two completely unrelated clocks. The same burst-analysis approach applies, but now the write and read time bases are physically different. The calculation asks: "during the time it takes to write B words at Fwr, how many words can be read at Frd?"
- Burst write time: T_burst = B / Fwr (seconds)
- Words read in that time: R = T_burst × Frd = B × (Frd / Fwr)
- Peak accumulation: B − R = B × (1 − Frd / Fwr)
The formula is algebraically identical to the synchronous case, but the physical interpretation differs: Fwr and Frd are now asynchronous frequencies that may be completely unrelated (e.g., 83.33 MHz and 27 MHz on a display interface).
CDC margin: Add 2–4 extra words to account for the 2-FF synchronizer latency. The Gray code pointer synchronized into the other domain is always slightly stale — the reader may see the write pointer as 2 clock cycles old, effectively seeing the FIFO as 2 entries more full than it is. Most designs add ceil(Fwr / Frd) + 2 as a safety margin.
4. Worst-Case Burst Analysis
The formula above assumes the writer starts immediately and the reader starts reading from cycle 0. Real protocols often have gaps between bursts and acknowledgment latencies. The worst case for FIFO depth is:
- Writer sends maximum burst B back-to-back (no gaps, full throughput)
- Reader is blocked for maximum stall cycles before it begins reading
- Both extremes occur simultaneously (simultaneous worst-case assumption)
In AXI and APB interfaces, the writer can burst at full clock rate while the reader stalls for a HREADY or PREADY de-assertion. Always design to the combined worst case, not the average case.
Practical rule: Calculate the formula depth, add 2 words for CDC margin, then round up to the next power of 2. Add one more power-of-2 step if the interface protocol has unpredictable stall insertion. A depth that is 2× oversized costs only a few percent more area on a modern process but eliminates a class of hard-to-reproduce overflow bugs.
5. Power-of-2 Requirement
FIFO depth must be a power of 2 whenever Gray code pointers are used for CDC. This is not an optional convention — it is a correctness requirement.
Why Gray code needs power-of-2 depth
A Gray code sequence only changes one bit per count. For a standard binary counter of width N, the Gray sequence cycling through 0 → 2^N − 1 → 0 has exactly one bit change at every transition, including the wrap from (2^N − 1) back to 0.
If the depth is not a power of 2 — say 12 — the pointer counter would need to reset at count 12 instead of 16. That non-power-of-2 modulo operation breaks the single-bit-change property at the wrap boundary, causing multi-bit transitions that invalidate the 2-FF synchronizer.
| Calculated Depth | Round to | Pointer Width (N+1) | Address Bits (N) |
|---|---|---|---|
| 1–2 | 2 | 2 | 1 |
| 3–4 | 4 | 3 | 2 |
| 5–8 | 8 | 4 | 3 |
| 9–16 | 16 | 5 | 4 |
| 17–32 | 32 | 6 | 5 |
| 33–64 | 64 | 7 | 6 |
6. Full and Empty Flag Generation
Getting the full and empty flags wrong is the most common FIFO bug. The N+1 bit pointer scheme (one extra MSB beyond the address width) eliminates the ambiguity:
- Empty: All N+1 bits of the read pointer equal all N+1 bits of the synchronized write pointer → they are at the same absolute position
- Full: The lower N bits (address bits) are equal, but the MSBs differ → the write pointer has lapped the read pointer exactly once
// Full/empty using N+1 bit Gray code pointers // wptr_gray: write pointer in Gray code (N+1 bits) // rptr_gray: read pointer in Gray code (N+1 bits) // wptr_sync: write ptr synchronized into read domain // rptr_sync: read ptr synchronized into write domain assign empty = (rptr_gray == wptr_sync); // Full: MSBs differ, lower bits match (one full lap) assign full = (wptr_gray[N] != rptr_sync[N]) & (wptr_gray[N-1] != rptr_sync[N-1]) & (wptr_gray[N-2:0] == rptr_sync[N-2:0]);
7. Parameterized RTL Implementation
module async_fifo_depth_calc #( parameter DATA_W = 8, parameter DEPTH = 16 // must be power of 2 ) ( input logic wr_clk, wr_rst_n, wr_en, input logic [DATA_W-1:0] wr_data, input logic rd_clk, rd_rst_n, rd_en, output logic [DATA_W-1:0] rd_data, output logic full, empty ); localparam AW = $clog2(DEPTH); // address bits localparam PW = AW + 1; // pointer width (N+1) logic [DATA_W-1:0] mem [0:DEPTH-1]; logic [PW-1:0] wbin, wgray, rbin, rgray; logic [PW-1:0] wgray_s1, wgray_sync; // write ptr synced to rd_clk logic [PW-1:0] rgray_s1, rgray_sync; // read ptr synced to wr_clk // Write domain always_ff @(posedge wr_clk or negedge wr_rst_n) if (!wr_rst_n) wbin <= '0; else if (wr_en & !full) begin mem[wbin[AW-1:0]] <= wr_data; wbin <= wbin + 1'b1; end assign wgray = wbin ^ (wbin >> 1); // Read domain always_ff @(posedge rd_clk or negedge rd_rst_n) if (!rd_rst_n) rbin <= '0; else if (rd_en & !empty) rbin <= rbin + 1'b1; assign rgray = rbin ^ (rbin >> 1); assign rd_data = mem[rbin[AW-1:0]]; // 2-FF synchronizers always_ff @(posedge rd_clk or negedge rd_rst_n) if (!rd_rst_n) {wgray_s1, wgray_sync} <= '0; else {wgray_s1, wgray_sync} <= {wgray, wgray_s1}; always_ff @(posedge wr_clk or negedge wr_rst_n) if (!wr_rst_n) {rgray_s1, rgray_sync} <= '0; else {rgray_s1, rgray_sync} <= {rgray, rgray_s1}; // Full/empty flags assign empty = (rgray == wgray_sync); assign full = (wgray == {~rgray_sync[PW-1:PW-2], rgray_sync[PW-3:0]}); endmodule
8. Worked Examples
Example A: AXI crossbar, 200 → 100 MHz
A 200 MHz AXI master bursts 64 beats into a 100 MHz slave domain. Read rate is half the write rate, so the accumulation is 64 × (1 − 100/200) = 32 words. Round up to next power-of-2: depth = 32. Add CDC margin: 32 + 2 = 34 → round up to 64.
Example B: PCIe TLP to DDR, 250 → 200 MHz
Write: 250 MHz, Read: 200 MHz, Burst: 512 beats (max TLP). Accumulation = 512 × (1 − 200/250) = 512 × 0.2 = 102.4 → ceil = 103. Round to power-of-2: 128. With CDC margin: 128 + 3 = 131 → 256. TLP FIFOs are commonly 256–512 deep for this reason.
Example C: UART receiver, 16× oversampling
A UART at 115200 baud with 16× oversampling clock (1.8432 MHz) writes one byte every ~160 clocks. If the CPU reads via polling with up to 1 ms latency at 48 MHz: bytes arriving in 1 ms = 115200/1000 ≈ 12 bytes. FIFO depth: 16 (includes margin). Classic embedded UART FIFOs are 16 bytes for this reason.
Traffic Parameters
Theoretical Analysis
Frequently Asked Questions
FIFO Depth in System-Level SoC Design
FIFO Depth as an Architectural Decision, Not Just a Formula
The depth formula gives a mathematical minimum, but production SoC teams treat FIFO depth as an architectural decision that must account for several factors the formula does not capture. First, the burst size B in the formula assumes the worst-case burst, which must come from the system specification — not from averaging. If the protocol allows bursts of up to 256 beats at AXI4 maximum, the FIFO must be sized for 256 beats, even if typical traffic is 8–16 beats. Using the "typical" burst in the formula produces an undersized FIFO that overflows in rare but completely legal traffic patterns, leading to data corruption that appears only under stress conditions and is almost impossible to reproduce in directed simulation. Second, the formula's (1 - Frd/Fwr) term assumes both clocks are running at their stated frequencies simultaneously. In real SoCs, clocks may spread-spectrum modulated (SSC) to reduce EMI — the instantaneous frequency deviation of ±0.5% means the effective frequency ratio varies. A FIFO sized with exactly zero margin at nominal frequencies will overflow during SSC troughs. SoC specifications typically require the FIFO designer to add 10–20% margin above the formula result before rounding up to the nearest power of 2.
FIFO Depth and Latency Trade-offs in High-Performance Subsystems
Every FIFO entry adds potential latency to the data path. In a 10-Gbps network interface or a PCIe Gen5 endpoint, the time-of-flight through a FIFO at nominal clock frequency is small — a 16-entry FIFO at 500 MHz holds 32 ns of data. But when multiple FIFOs are cascaded along the data path (from PHY receive to protocol layer to DMA write buffer), the accumulated latency becomes significant. A design with 6 cascaded FIFOs of average depth 8 adds 6 × 8 / (clock frequency) of deterministic latency — at 1 GHz that is 48 ns, and at 250 MHz it is 192 ns. In latency-sensitive applications like low-latency trading systems, industrial real-time control, or automotive ADAS, these latency contributions must be explicitly tracked in the system latency budget. The FIFO depth is sized not just for overflow prevention but also for the maximum tolerable latency: a FIFO that prevents overflow but adds 2 µs of worst-case queuing delay may fail the system's latency requirement even though it never loses data.
Verification Strategy for FIFO Depth Correctness
A FIFO sized for a 16-entry maximum burst will correctly handle a 16-entry burst in directed simulation, but this does not prove the depth is adequate under all legal protocol conditions. The comprehensive verification approach uses three layers. First, directed tests cover the exact worst-case scenario: simultaneous writes at maximum frequency with zero read clock, running for the full burst duration, verifying the FULL flag asserts exactly at depth and no data is lost. Second, constrained-random tests generate random burst sizes, inter-burst gaps, and clock ratios (within the specified range), running for millions of cycles to search for overflow conditions. Third, formal verification can prove exhaustively that the FIFO never overflows given a bounded set of assumptions about burst size and clock ratio — the formal tool produces a counter-example if the depth is insufficient for any reachable input sequence. Production ASIC teams require at least the first two layers before taping out, and use formal verification for safety-critical FIFOs in automotive and aerospace applications where a single dropped word may cause a system-safety violation.