1. Why FIFO Depth Matters
A FIFO (First-In First-Out) buffer decouples a producer from a consumer that operate at different rates, or in different clock domains. The single most critical parameter at design time is its depth — the number of words the buffer can hold simultaneously.
Undersize the FIFO and you get overflow: the writer tries to push a word into a full buffer and data is silently dropped. Oversize it and you waste silicon area and power for unused memory. Both outcomes are expensive after tape-out, so depth must be calculated analytically before RTL is frozen.
The core insight: a FIFO only needs to hold the words that accumulate during the worst-case period when the writer is faster than the reader. Once the burst ends, the reader drains the excess and the fill level returns to zero. The peak fill level during that drain window is the required depth.
2. Synchronous FIFO — Same Clock Domain
When both the writer and reader share the same clock but the reader cannot accept data every cycle (due to backpressure, protocol overhead, or stall cycles), the depth depends on the duty-cycle mismatch during a burst.
Scenario: Writer writes every cycle, reader reads every N cycles
If the writer pushes one word per clock for a burst of B words, and the reader can only accept a word every N clocks (read rate = 1/N), the fill level grows by (1 − 1/N) words per clock during the burst.
| Parameter | Symbol | Example |
|---|---|---|
| Burst length (words) | B | 32 |
| Write clock (MHz) | Fwr | 100 MHz |
| Read clock (MHz) | Frd | 40 MHz |
| Words written in burst time | B | 32 |
| Words read in same burst time | B × (Frd / Fwr) | 32 × 0.4 = 12.8 |
| Required depth | ceil(B × (1 − Frd/Fwr)) | ceil(32 × 0.6) = 20 |
// Synchronous FIFO depth formula // Depth = ceil( B × (1 − Frd / Fwr) ) // where B = burst size, Fwr = write freq, Frd = read freq // Example: B=32, Fwr=100, Frd=40 depth_required = ceil(32 × (1 − 40/100)) = ceil(32 × 0.6) = ceil(19.2) = 20 // round up to next power-of-2 → 32
3. Asynchronous FIFO — Clock Domain Crossing
Async FIFOs cross between two completely unrelated clocks. The same burst-analysis approach applies, but now the write and read time bases are physically different. The calculation asks: "during the time it takes to write B words at Fwr, how many words can be read at Frd?"
- Burst write time: T_burst = B / Fwr (seconds)
- Words read in that time: R = T_burst × Frd = B × (Frd / Fwr)
- Peak accumulation: B − R = B × (1 − Frd / Fwr)
The formula is algebraically identical to the synchronous case, but the physical interpretation differs: Fwr and Frd are now asynchronous frequencies that may be completely unrelated (e.g., 83.33 MHz and 27 MHz on a display interface).
CDC margin: Add 2–4 extra words to account for the 2-FF synchronizer latency. The Gray code pointer synchronized into the other domain is always slightly stale — the reader may see the write pointer as 2 clock cycles old, effectively seeing the FIFO as 2 entries more full than it is. Most designs add ceil(Fwr / Frd) + 2 as a safety margin.
4. Worst-Case Burst Analysis
The formula above assumes the writer starts immediately and the reader starts reading from cycle 0. Real protocols often have gaps between bursts and acknowledgment latencies. The worst case for FIFO depth is:
- Writer sends maximum burst B back-to-back (no gaps, full throughput)
- Reader is blocked for maximum stall cycles before it begins reading
- Both extremes occur simultaneously (simultaneous worst-case assumption)
In AXI and APB interfaces, the writer can burst at full clock rate while the reader stalls for a HREADY or PREADY de-assertion. Always design to the combined worst case, not the average case.
Practical rule: Calculate the formula depth, add 2 words for CDC margin, then round up to the next power of 2. Add one more power-of-2 step if the interface protocol has unpredictable stall insertion. A depth that is 2× oversized costs only a few percent more area on a modern process but eliminates a class of hard-to-reproduce overflow bugs.
5. Power-of-2 Requirement
FIFO depth must be a power of 2 whenever Gray code pointers are used for CDC. This is not an optional convention — it is a correctness requirement.
Why Gray code needs power-of-2 depth
A Gray code sequence only changes one bit per count. For a standard binary counter of width N, the Gray sequence cycling through 0 → 2^N − 1 → 0 has exactly one bit change at every transition, including the wrap from (2^N − 1) back to 0.
If the depth is not a power of 2 — say 12 — the pointer counter would need to reset at count 12 instead of 16. That non-power-of-2 modulo operation breaks the single-bit-change property at the wrap boundary, causing multi-bit transitions that invalidate the 2-FF synchronizer.
| Calculated Depth | Round to | Pointer Width (N+1) | Address Bits (N) |
|---|---|---|---|
| 1–2 | 2 | 2 | 1 |
| 3–4 | 4 | 3 | 2 |
| 5–8 | 8 | 4 | 3 |
| 9–16 | 16 | 5 | 4 |
| 17–32 | 32 | 6 | 5 |
| 33–64 | 64 | 7 | 6 |
6. Full and Empty Flag Generation
Getting the full and empty flags wrong is the most common FIFO bug. The N+1 bit pointer scheme (one extra MSB beyond the address width) eliminates the ambiguity:
- Empty: All N+1 bits of the read pointer equal all N+1 bits of the synchronized write pointer → they are at the same absolute position
- Full: The lower N bits (address bits) are equal, but the MSBs differ → the write pointer has lapped the read pointer exactly once
// Full/empty using N+1 bit Gray code pointers // wptr_gray: write pointer in Gray code (N+1 bits) // rptr_gray: read pointer in Gray code (N+1 bits) // wptr_sync: write ptr synchronized into read domain // rptr_sync: read ptr synchronized into write domain assign empty = (rptr_gray == wptr_sync); // Full: MSBs differ, lower bits match (one full lap) assign full = (wptr_gray[N] != rptr_sync[N]) & (wptr_gray[N-1] != rptr_sync[N-1]) & (wptr_gray[N-2:0] == rptr_sync[N-2:0]);
7. Parameterized RTL Implementation
module async_fifo_depth_calc #( parameter DATA_W = 8, parameter DEPTH = 16 // must be power of 2 ) ( input logic wr_clk, wr_rst_n, wr_en, input logic [DATA_W-1:0] wr_data, input logic rd_clk, rd_rst_n, rd_en, output logic [DATA_W-1:0] rd_data, output logic full, empty ); localparam AW = $clog2(DEPTH); // address bits localparam PW = AW + 1; // pointer width (N+1) logic [DATA_W-1:0] mem [0:DEPTH-1]; logic [PW-1:0] wbin, wgray, rbin, rgray; logic [PW-1:0] wgray_s1, wgray_sync; // write ptr synced to rd_clk logic [PW-1:0] rgray_s1, rgray_sync; // read ptr synced to wr_clk // Write domain always_ff @(posedge wr_clk or negedge wr_rst_n) if (!wr_rst_n) wbin <= '0; else if (wr_en & !full) begin mem[wbin[AW-1:0]] <= wr_data; wbin <= wbin + 1'b1; end assign wgray = wbin ^ (wbin >> 1); // Read domain always_ff @(posedge rd_clk or negedge rd_rst_n) if (!rd_rst_n) rbin <= '0; else if (rd_en & !empty) rbin <= rbin + 1'b1; assign rgray = rbin ^ (rbin >> 1); assign rd_data = mem[rbin[AW-1:0]]; // 2-FF synchronizers always_ff @(posedge rd_clk or negedge rd_rst_n) if (!rd_rst_n) {wgray_s1, wgray_sync} <= '0; else {wgray_s1, wgray_sync} <= {wgray, wgray_s1}; always_ff @(posedge wr_clk or negedge wr_rst_n) if (!wr_rst_n) {rgray_s1, rgray_sync} <= '0; else {rgray_s1, rgray_sync} <= {rgray, rgray_s1}; // Full/empty flags assign empty = (rgray == wgray_sync); assign full = (wgray == {~rgray_sync[PW-1:PW-2], rgray_sync[PW-3:0]}); endmodule
8. Worked Examples
Example A: AXI crossbar, 200 → 100 MHz
A 200 MHz AXI master bursts 64 beats into a 100 MHz slave domain. Read rate is half the write rate, so the accumulation is 64 × (1 − 100/200) = 32 words. Round up to next power-of-2: depth = 32. Add CDC margin: 32 + 2 = 34 → round up to 64.
Example B: PCIe TLP to DDR, 250 → 200 MHz
Write: 250 MHz, Read: 200 MHz, Burst: 512 beats (max TLP). Accumulation = 512 × (1 − 200/250) = 512 × 0.2 = 102.4 → ceil = 103. Round to power-of-2: 128. With CDC margin: 128 + 3 = 131 → 256. TLP FIFOs are commonly 256–512 deep for this reason.
Example C: UART receiver, 16× oversampling
A UART at 115200 baud with 16× oversampling clock (1.8432 MHz) writes one byte every ~160 clocks. If the CPU reads via polling with up to 1 ms latency at 48 MHz: bytes arriving in 1 ms = 115200/1000 ≈ 12 bytes. FIFO depth: 16 (includes margin). Classic embedded UART FIFOs are 16 bytes for this reason.