Each day is 4,000–5,000 words with diagrams, waveforms, equations, real chip examples, and production checklists.
⚡
Day 1
FPGA vs GPU vs CPU for AI Inference
Why FPGAs win on latency and power. Architecture comparison, roofline model applied to inference hardware, real benchmarks: Xilinx Alveo vs A100 GPU.
FPGA ArchitectureRoofline ModelLatencyPower Efficiency
Start Day 1 →
🔢
Day 2
Fixed-Point Arithmetic & Quantization
Why FP32 is wasteful on FPGA. INT8/INT4 Q-format, quantization error analysis, scale factors, overflow handling, and a pipelined 8-bit MAC unit in Verilog.
Fixed-PointINT8QuantizationQ-Format
Start Day 2 →
✖️
Day 3
Matrix Multiply Accelerator
Tiled GEMM engine: 4×4 MAC array, DSP48 packing (2× throughput), output-stationary dataflow, tiling loop controller, and complete Verilog + testbench.
MAC ArrayDSP48 PackingGEMMTiling
Start Day 3 →
🔲
Day 4
Systolic Array Architecture
TPU-style systolic array on FPGA. Weight-stationary dataflow, PE design, skewed activation feeding, timing diagram, and 4×4 Verilog using generate/genvar.
Systolic ArrayWeight-StationaryPE DesignTPU-style
Start Day 4 →
🧠
Day 5
Convolution Engine for CNNs
2D convolution hardware: sliding window logic, line buffers, kernel weight storage, im2col transformation, and a full 3×3 conv engine with BRAM buffering.
Conv2DLine Bufferim2colBRAM
Coming Soon
💾
Day 6
Memory Architecture — BRAM vs DDR
On-chip BRAM vs off-chip DDR4. Memory bandwidth bottleneck analysis, ping-pong buffering, double buffering, AXI4 master interface for DDR access.
BRAMDDR4AXI4Ping-Pong
Coming Soon
📊
Day 7
Activation Functions in Hardware
ReLU, Leaky ReLU, Sigmoid, Softmax — synthesizable hardware. LUT-based approximations, CORDIC for complex functions, pipelined activation units.
ReLUSigmoidCORDICLUT Approx
Coming Soon
🏊
Day 8
Pooling Layers & Normalization
Max pooling, average pooling, global average pooling hardware. Batch normalization approximation, running mean/variance, and fused BN+ReLU design.
Max PoolingBatch NormFused LayersHardware BN
Coming Soon
🚀
Day 9
Pipelining & Parallelism
Layer-by-layer pipelining, inter-layer FIFOs, throughput vs latency trade-off, data parallelism vs model parallelism, and pipeline stall analysis.
PipeliningFIFOParallelismThroughput
Coming Soon
Write CNN layers in C++, synthesize to RTL with Vitis HLS. PIPELINE, UNROLL, ARRAY_PARTITION pragmas, co-simulation, and latency/area reports.
Vitis HLSC++ to RTLPragmasCo-sim
Coming Soon
🎯
Day 11
Vitis AI — DPU Deployment
AMD/Xilinx DPU IP core, vai_q_pytorch quantization, Vitis AI compiler, runtime on Zynq/Kria. Deploy ResNet-50 on FPGA in under 10ms latency.
Vitis AIDPUQuantizationResNet
Coming Soon
🌿
Day 12
Power Optimization for Edge AI
Dynamic power reduction on FPGA: clock gating, partial reconfiguration, voltage scaling, sleep modes. Targeting <5W for IoT/automotive edge AI.
Clock GatingPartial ReconfigEdge AI5W Budget
Coming Soon
🔄
Day 13
Transformer Attention on FPGA
Self-attention mechanism in hardware: QKV matrix multiply, softmax approximation, multi-head attention parallelism. BERT/ViT inference on Alveo.
AttentionTransformerBERTMulti-Head
Coming Soon
📈
Day 14
Benchmarking & Profiling
Measure FPGA inference throughput (TOPS), latency, power. Compare against GPU/CPU baselines. MLPerf benchmark methodology for edge AI systems.
TOPSMLPerfBenchmarksProfiling
Coming Soon
🏭
Day 15
Production Edge AI Systems
Real-world deployment: Xilinx Kria KV260, Intel Arria + OpenVINO, FPGA in autonomous vehicles (Mobileye), medical imaging, and HFT applications.
Kria KV260OpenVINOAutonomousProduction
Coming Soon