In-depth explanations of hardware architectures that chip designers encounter — from FPGAs and GPUs to processor design. These articles answer the "how does it actually work" questions that go beyond what textbooks cover.
RTL design does not happen in a vacuum. The signals you implement at the register level are serving a system-level function — storing state in an FPGA fabric, feeding tensor data to a GPU shader array, or routing memory requests between cores. Understanding what sits above and below your RTL block changes how you make design decisions: which protocol to use, how much latency is acceptable, what the power envelope is, and why a particular interface was chosen over the alternatives.
Every ASIC design goes through FPGA prototyping before tapeout. The SoC RTL runs on a Xilinx UltraScale or Intel Agilex FPGA for software bring-up, driver validation, and protocol testing — months before the first silicon returns from the fab. Understanding FPGA architecture — CLBs, BRAMs, DSPs, PLLs, I/O banks — is essential for writing RTL that maps cleanly and meets FPGA timing. FPGA timing constraints are stricter in some ways than ASIC: there is no custom cell sizing, no drive strength optimization, and the routing delay model is fixed at programming time. Writing synthesizable RTL for FPGA teaches habits that directly translate to better ASIC RTL: avoiding latches, using synchronous resets, minimizing logic depth per pipeline stage.
A modern GPU (NVIDIA H100, AMD Instinct MI300X) contains hundreds of Streaming Multiprocessors (SMs), each with 128+ CUDA cores, shared memory, register files, and tensor cores for matrix operations. The die area is dominated by compute — unlike a CPU where caches and control logic consume most of the area. RTL engineers designing AI accelerators need to understand GPU-style architectures: warp scheduling, SIMD execution, shared memory bank conflicts, coalesced memory access, and the memory hierarchy from registers through L1/L2 caches to HBM. The same parallelism principles apply to custom NPU and TPU RTL designs, which must be hand-coded in SystemVerilog and verified against behavioral references.
Deep-dive explanations of hardware architectures that complement RTL and VLSI design knowledge.
Everything a chip design intern needs on day one: full VLSI design flow (RTL→Synthesis→P&R→Tapeout), Linux command reference, gvim cheatsheet, SVN vs Git side-by-side, EDA tools (Synopsys/Cadence/Siemens), protocol quick notes (I2C, SPI, UART, AXI, APB, PCIe), Tcl scripting basics, and a searchable 70-entry acronym glossary.
How FPGAs work internally: CLBs, LUTs, flip-flops, BRAMs, DSPs, PLLs. How the bitstream configures the routing fabric. FPGA vs ASIC tradeoffs — NRE cost, performance, power. Major FPGA families (Xilinx/AMD, Intel/Altera, Microchip). When to use FPGA vs ASIC in a product design.
GPU vs CPU architecture: why thousands of simple cores beat a few complex ones for data-parallel workloads. Streaming Multiprocessors, warp scheduling, SIMD execution, shared memory, tensor cores. How GPUs connect to the CPU via PCIe and NVLink. Memory hierarchy from registers to HBM.
What really happens between pressing the power button and seeing your desktop — from the CPU's first instruction out of ROM, through BIOS/UEFI, MBR, bootloader, to the OS kernel loading. Includes a real working x86 boot sector you can run in QEMU.
Light travels 30cm. A 3 GHz CPU completes one full clock cycle. Six billion transistors switch state. This deep-dive unpacks what your processor does in the time it takes light to cross your hand — pipeline stages, memory latency, and why silicon can't get faster.
Claude AI's 12-month outlook for NVIDIA, AMD, Intel, Qualcomm, TSMC, Broadcom, and ARM. Bull/bear cases, conviction scores, key catalysts, risk factors, and the AI market trends shaping the semiconductor supercycle. Educational analysis — not financial advice.
NRE cost breakdown, performance gap analysis, power efficiency trade-offs, design flow comparison, and a decision framework for choosing between custom silicon and reconfigurable logic — with real-world examples from Apple, Nvidia, and Google.
Modern chip design does not happen at a single level of abstraction. An RTL engineer writing SystemVerilog for an AXI4 DMA controller is working within a larger system that includes a host CPU, possibly an FPGA prototype platform, and downstream compute engines that may be GPUs or custom NPU accelerators. Understanding each layer — what it does, how it is architected, and where it sits in the data flow — is essential for making correct design decisions at the RTL level.
An FPGA (Field-Programmable Gate Array) implements digital logic through a matrix of configurable logic blocks (CLBs) connected by a programmable routing fabric. Each CLB contains lookup tables (LUTs) — small memory arrays that implement arbitrary Boolean functions — D flip-flops for sequential logic, carry chains for arithmetic, and multiplexers. The routing fabric consists of wire segments and programmable switch matrices that connect CLBs to each other and to I/O banks.
When you synthesize RTL for an FPGA, the EDA tool (Vivado, Quartus, Libero) maps your logic gates to LUTs, places CLBs on the physical fabric, and routes connections through the switch matrix. The final configuration is stored as a bitstream — typically several megabytes for a large device — that is loaded into SRAM cells at power-on. This is why FPGAs are volatile: they must be reconfigured from flash or a host processor every time power is applied.
FPGA designs face unique constraints that differ from ASIC design. LUT delays are fixed and not cell-sized optimized, so critical path optimization relies entirely on pipelining and logic restructuring. BRAM resources are finite and shared — a design that over-allocates BRAM may not fit. DSP slices handle 18×18 or 27×27 multiply-accumulate operations in a single clock cycle, and RTL written to use them explicitly (via instantiation or inference-friendly multiply patterns) runs orders of magnitude faster than an equivalent LUT-based multiplier.
A GPU achieves its throughput by executing thousands of threads simultaneously using a SIMD (Single Instruction, Multiple Data) model. The fundamental execution unit is the warp (NVIDIA) or wavefront (AMD) — a group of 32 or 64 threads that execute the same instruction in lockstep across different data elements. A Streaming Multiprocessor (SM) on an NVIDIA H100 contains 128 CUDA cores, 4 tensor core units (for matrix multiplications), shared memory (192 KB), and register file storage for thousands of concurrent threads.
Memory bandwidth is the dominant performance limiter for most GPU workloads. Modern AI training GPUs (H100, MI300X) use HBM (High Bandwidth Memory) stacked directly on the package with through-silicon vias, delivering 3–5 TB/s of memory bandwidth — orders of magnitude beyond what a DDR5 DIMM provides. For RTL engineers designing AI accelerator SoCs, the key architectural lessons from GPU design are: keep data close to compute using large on-chip SRAM, pipeline deeply to hide memory latency, and structure the dataflow so that data reuse maximizes arithmetic intensity (compute operations per byte of memory traffic).
Before any ASIC tapes out, its RTL almost always runs on FPGAs. A typical pre-tapeout flow uses 4–16 FPGAs (often Xilinx UltraScale+ or Intel Stratix 10) networked together to emulate the full chip at reduced speed (10–50 MHz versus the target 1–4 GHz on silicon). Software drivers, firmware, and application stacks run against this FPGA prototype, exposing software-visible bugs months before silicon is available.
This FPGA prototyping phase is where many RTL bugs are caught: CDC crossings that only fail at speed, FSM states that are never reached in RTL simulation, and interface protocol violations that the simulator abstracted away. RTL engineers who understand FPGA architecture write better RTL — they know which constructs synthesize efficiently, which clock structures require specific primitives, and how to partition a large design across multiple FPGAs using FPGA-to-FPGA serial links.
FPGAs are preferred for low-to-medium volume (< 100K units), frequent field updates, prototyping, and time-sensitive projects that cannot wait 12–18 months for ASIC tapeout. ASICs win on unit cost (for volumes > 1M), power efficiency (3–5× better than FPGA at the same logic function), and performance (2–3 GHz clocks vs 300–600 MHz for FPGA). Most complex SoCs use ASICs with embedded FPGA (eFPGA) blocks for the programmable portions.
CPUs minimize latency for a single thread: 4–64 large cores, deep out-of-order execution pipelines, and 30–100 MB of L3 cache that keeps frequently used data latency-close. GPUs maximize throughput: thousands of smaller cores, in-order execution within each warp, and high-bandwidth memory for streaming large datasets. Neural network inference maps naturally to GPUs because matrix-vector multiplications are data-parallel; OS scheduling and database queries do not, because they have irregular control flow and data dependencies that defeat SIMD execution.
Designing AI accelerator chips requires understanding warp occupancy, shared memory bank conflicts, tensor core data formats (FP16, BF16, INT8, FP8), and how the memory hierarchy is organized. RTL for a custom NPU essentially reimplements these concepts in custom silicon: a systolic array for matrix multiply (like Google's TPU), scratchpad SRAM for reuse buffers, and DMA engines to feed compute from HBM. The GPU serves as the reference architecture from which custom accelerators borrow proven ideas.
Because every production ASIC goes through an FPGA prototyping phase. RTL that doesn't synthesize cleanly on FPGA (due to latches, asynchronous logic, or non-standard clock structures) creates prototyping problems that delay software bring-up. Additionally, writing FPGA-friendly RTL from the start — synchronous resets, single-clock FSMs, inference-compatible multipliers — produces better ASIC RTL as a side effect. Many VLSI engineers spend 30–40% of their time on FPGA-related tasks even when their final target is a fully custom ASIC.
Standard DDR5 DRAM uses a 64-bit wide interface per DIMM at around 50–60 GB/s bandwidth. HBM2E and HBM3 use a 1024-bit wide bus with multiple stacked DRAM dies connected through silicon vias directly to the package substrate, delivering 400–900 GB/s per stack. For SoC designers, HBM integration requires a dedicated PHY that is co-designed with the package, strict die-to-die signal integrity constraints, and careful power delivery for hundreds of amperes at low voltage. LPDDR4/5 sits between the two — wider than DDR5 in total bandwidth but without the complexity of HBM TSV stacking — and is the standard choice for mobile and edge AI chips.
A k-input LUT is a small 2^k × 1 SRAM array. For a 6-input LUT (the standard in modern FPGAs), the SRAM has 64 cells. Each cell corresponds to one row of the truth table — the 64 possible input combinations of A through F. The synthesis tool fills the 64 SRAM cells with the output values from the truth table of the target function. At runtime, the 6-bit input vector acts as an address into this SRAM, reading out the pre-programmed output value in one cycle. Because any Boolean function of up to 6 variables can be expressed as a 64-entry truth table, a single 6-LUT can implement any 6-input combinational logic function — an AND gate, an XOR chain, a multiplexer, or an arbitrary expression.
Modern SoC design is as much about moving data as computing it. An RTL engineer who understands only the compute blocks — CPU cores, NPU systolic arrays, DSPs — without understanding how data flows between them, through caches, across bus fabrics, and to and from off-chip memory, is working with an incomplete mental model. The memory hierarchy and interconnect fabric define the system's latency, throughput, and power budget as surely as the compute blocks do.
A CPU-class SoC typically implements three levels of cache. L1 cache is private per-core, split into instruction and data caches (typically 32–64 KB each), runs at the full core clock, and has access latency of 4–5 cycles. L2 cache is also private per-core (256 KB–2 MB), unified for instructions and data, and adds 10–15 cycles of additional latency. L3 cache (also called LLC — Last Level Cache) is shared across cores, ranging from 8 MB to 64 MB on a multi-core die. Cache coherency protocols such as MESI (Modified, Exclusive, Shared, Invalid) ensure that when one core writes to a cache line, other cores that hold copies of the same line invalidate or update their copies before reading. For RTL engineers designing cache controllers, the coherency directory — the structure that tracks which cores hold copies of each cache line — is the most complex component to implement correctly because it must handle interleaved requests from multiple cores atomically.
In a simple SoC, bus arbitration is handled by an AXI crossbar interconnect: N masters (CPU, DMA, GPU) connect to M slaves (DRAM controller, PCIe, peripheral bus) through a switch matrix that can route any master to any slave simultaneously as long as no two masters access the same slave concurrently. A 4×4 AXI crossbar has 16 internal paths and can sustain 4 simultaneous independent transfers — a typical arrangement for a mid-complexity SoC. As the number of masters and slaves grows beyond 8×8, the crossbar becomes expensive in area and routing congestion, and the industry has shifted to Network-on-Chip (NoC) topologies: mesh, torus, or ring fabrics where packets are routed hop-by-hop through routers. ARM's AMBA 5 CHI (Coherent Hub Interface) and Intel's CCI-500 are examples of coherent NoC fabrics used in server-class SoCs. For the RTL engineer, the key implication is that latency from master to slave is not fixed — it depends on hop count, congestion, and arbitration at each router — and the AXI protocol's back-pressure mechanism (READY/VALID) must be correctly handled to avoid deadlock when multiple agents compete for the same downstream path.
Thermal and power constraints drive the clock architecture of every modern SoC. A smartphone SoC (Apple A-series, Qualcomm Snapdragon) operates within a 5–10W thermal envelope while running multiple high-frequency domains simultaneously: a set of high-performance CPU cores at 3–4 GHz, efficiency cores at 1–2 GHz, an integrated GPU at 600–1200 MHz, an NPU at 1–2 GHz, a modem DSP, an image signal processor, and dozens of always-on peripheral controllers. These domains are architecturally independent: each has its own PLL, voltage regulator, and power state machine. Clock domain crossings between them are handled by asynchronous FIFOs, handshake synchronizers, or gated bridges that assert power-down isolation before the source domain loses power. The power management unit (PMU) orchestrates these transitions: it sequences voltage ramps, asserts and deasserts reset, enables and gates clocks, and controls the retention state of SRAM cells that must preserve content across a power-down event. RTL engineers designing any block that crosses a power domain boundary must understand this sequencing — a block that releases its isolation cells before its voltage is stable, or that loses its clock before its FIFO is drained, will produce corrupted state that is often unreproducible in RTL simulation because the simulator has no power model.
The physical realization of RTL requires a sequence of tools that transforms a behavioral description into a geometrically exact layout of silicon polygons. Synthesis (Synopsys Design Compiler, Cadence Genus) maps RTL to a technology-specific standard cell library, producing a gate-level netlist. Floorplanning places the major functional blocks on the die area and establishes the power grid. Placement (Cadence Innovus, Synopsys ICC2) assigns each standard cell to a specific location. Clock tree synthesis builds balanced trees that deliver the clock signal to every flip-flop with controlled skew and insertion delay. Routing (also Innovus/ICC2) draws the metal wire paths between all cells. At each step, Static Timing Analysis (STA) verifies that no setup or hold violation exists across all corners (process, voltage, temperature). Design Rule Check (DRC) verifies that all geometries meet the foundry's minimum spacing, width, and enclosure rules. Layout vs Schematic (LVS) verifies that the physical layout matches the netlist it was supposed to implement. Only when all three checks are clean can the GDSII file be submitted for tapeout — after which the masks are written, the wafers are fabricated, and the physical silicon is cut, packaged, and tested. This 18–24 month journey from RTL to silicon is why the RTL engineer's first-pass design quality matters so much: a bug found at synthesis costs hours to fix; the same bug found at silicon costs months of re-spin time and millions of dollars in mask tooling.