HomeFPGA Neural NetworkDay 10 — Vitis HLS

Vitis HLS
C++ to RTL for CNNs

Write neural network layers in readable C++ and let HLS generate the RTL. Master PIPELINE, UNROLL, ARRAY_PARTITION and DATAFLOW — the four pragmas that decide whether your accelerator hits 2 TOPS or 0.2 TOPS.

By EcrioniX Engineering Team · Published June 15, 2026 · ~5,400 words · 17 min read

1. Why HLS Changed FPGA AI Forever

Hand-writing RTL for a full CNN is brutal. A single ResNet-50 accelerator in raw Verilog is tens of thousands of lines, months of work, and painful to modify when the model changes. High-Level Synthesis (HLS) flips this: you describe the algorithm in C++, and the tool generates the RTL — letting you iterate on architecture in hours instead of weeks.

The catch — and the entire skill of HLS — is that naive C++ produces terrible hardware. A plain triple-nested convolution loop synthesizes to a slow, sequential machine. The pragmas are how you tell HLS to build parallel, pipelined hardware. Getting them right is the difference between a 0.2 TOPS toy and a 2 TOPS production engine.

The HLS Flow — C++ to Bitstream
C++ + pragmas Vitis HLSC-synth RTLVerilog/VHDL VivadoP&R Bitstream C-sim & co-sim verify here (fast) Iterate architecture at the C++ level — minutes per build vs weeks of hand-RTL

2. Naive Convolution — and Why It's Slow

Here is a textbook convolution in plain C++. It's correct — and it produces dreadful hardware. HLS schedules every loop iteration sequentially, so each multiply waits for the previous one.

conv_naive.cpp
// Naive 3×3 convolution — CORRECT but SLOW hardware #define IMG 32 #define CIN 16 #define COUT 16 void conv_naive( int8_t in [CIN][IMG][IMG], int8_t w [COUT][CIN][3][3], int32_t out [COUT][IMG][IMG]) { for (int co = 0; co < COUT; co++) for (int y = 1; y < IMG-1; y++) for (int x = 1; x < IMG-1; x++) { int32_t acc = 0; for (int ci = 0; ci < CIN; ci++) for (int ky = 0; ky < 3; ky++) for (int kx = 0; kx < 3; kx++) acc += in[ci][y+ky-1][x+kx-1] * w[co][ci][ky][kx]; out[co][y][x] = acc; } } // Result: ~1 MAC issued every few cycles. One shared multiplier, // one BRAM port, deeply sequential. ResNet would take seconds/frame.

Correct ≠ Fast

This compiles, simulates, and gives the right answer — but with no pragmas, HLS builds a single-multiplier state machine that processes one MAC at a time. The whole craft of HLS is transforming this same C++ into parallel hardware using directives, without changing the math.

3. The Four Pragmas That Matter

Ninety percent of HLS performance comes from four pragmas. Understand exactly what each one does to the generated hardware.

PragmaWhat it doesCostWhen to use
PIPELINEOverlaps loop iterations → new input every II cyclesRegistersInner-ish loop you want streaming at II=1
UNROLLReplicates the loop body N× to run in parallelN× compute (DSPs)Small loops (kernel taps, channels)
ARRAY_PARTITIONSplits an array into banks/registers for parallel accessMore BRAM/regsFeeding an unrolled loop without port stalls
DATAFLOWRuns functions concurrently via stream FIFOsFIFO BRAMTop level — chain conv→pool→act

3.1 PIPELINE — the throughput pragma

Place #pragma HLS PIPELINE on a loop and HLS schedules a new iteration every II cycles. With II=1, after the pipeline fills, you get one result per clock — exactly the streaming behavior from Day 9.

No Pipeline vs PIPELINE II=1
No pragma iter 0 (5cy) iter 1 (5cy) iter 2 (5cy) 3 iters = 15 cycles PIPELINE i0 i1 i2 new iter every cycle (II=1) 3 iters = 7 cycles Same C++ loop — PIPELINE overlaps iterations for ~depth/II speedup
pipeline example
// Pipeline the spatial loop → one output pixel per cycle for (int y = 1; y < IMG-1; y++) for (int x = 1; x < IMG-1; x++) { #pragma HLS PIPELINE II=1 // <-- the magic line int32_t acc = 0; for (int ci = 0; ci < CIN; ci++) for (int ky = 0; ky < 3; ky++) for (int kx = 0; kx < 3; kx++) acc += in[ci][y+ky-1][x+kx-1] * w[co][ci][ky][kx]; out[co][y][x] = acc; } // HLS auto-unrolls loops *below* a PIPELINE pragma. So the ci/ky/kx // loops fully unroll → 16×3×3 = 144 MACs must run in one cycle. // That needs 144 parallel array reads → see ARRAY_PARTITION next.

3.2 UNROLL — the parallelism pragma

#pragma HLS UNROLL replicates a loop body so all iterations run at once. Unrolling the 16 input channels turns 16 sequential MACs into 16 parallel ones — 16× the DSPs, 16× the throughput on that loop.

unroll example
int32_t acc = 0; for (int ci = 0; ci < CIN; ci++) { #pragma HLS UNROLL // 16 channels computed in parallel acc += in[ci][y][x] * w[co][ci][ky][kx]; } // Before: 16 cycles (one MAC at a time) // After: 1 cycle, 16 DSPs + an adder tree to sum the 16 products // // Partial unroll if full is too costly: // #pragma HLS UNROLL factor=4 → 4 parallel, 4 iterations

3.3 ARRAY_PARTITION — feeding the parallel hardware

Here's the trap that catches every HLS beginner: you unroll a loop to get 144 parallel MACs, but the input array lives in a BRAM with only 2 ports. HLS can only read 2 values per cycle, so your 144 MACs stall to II=72. ARRAY_PARTITION splits the array into many small memories (or pure registers) so all 144 reads happen at once.

Why Partitioning Is Mandatory After UNROLL
Single BRAM (2 ports) weights[144] only 2 reads/cycle 144 MACs stall → II=72 ✗ ARRAY_PARTITION complete 144 registers — all readable same cycle 144 MACs run → II=1 ✓ Unroll without partition = the #1 reason HLS designs are slow
array_partition example
int8_t win [CIN][3][3]; // sliding window buffer int8_t wgt [CIN][3][3]; // kernel weights // Partition fully so all CIN×3×3 elements read in one cycle #pragma HLS ARRAY_PARTITION variable=win complete dim=0 #pragma HLS ARRAY_PARTITION variable=wgt complete dim=0 // Partition options: // complete → every element its own register (max parallel) // cyclic factor=4 → round-robin into 4 banks // block factor=4 → contiguous chunks into 4 banks // Use 'cyclic' when an unrolled loop strides through the array.

3.4 DATAFLOW — the architecture pragma

#pragma HLS DATAFLOW is how you build the layer-pipelined accelerator from Day 9 directly in C++. Each layer is a function; HLS connects them with hls::stream FIFOs and runs them all concurrently.

dataflow top — cnn_accel.cpp
#include <hls_stream.h> void cnn_accel(int8_t in[CIN][IMG][IMG], int8_t out_cls[COUT]) { #pragma HLS DATAFLOW // run all stages concurrently hls::stream<int8_t> s1, s2, s3; #pragma HLS STREAM variable=s1 depth=512 #pragma HLS STREAM variable=s2 depth=512 #pragma HLS STREAM variable=s3 depth=128 conv_layer (in, s1); // produces into s1 relu_pool (s1, s2); // consumes s1, produces s2 (runs concurrently!) conv_layer2(s2, s3); // consumes s2, produces s3 fc_softmax (s3, out_cls); // final classifier } // While relu_pool drains s1, conv_layer is already filling it with the // next frame. This is the Day 9 dataflow architecture — from pure C++.

4. A Complete Pipelined Conv Layer in HLS

Putting all four pragmas together — this is a production-style streaming convolution function with channel-parallel MACs at II=1.

conv_layer.cpp (full)
#include <hls_stream.h> #include <ap_int.h> #define IMG 32 #define CIN 16 #define COUT 16 typedef ap_int<8> data_t; typedef ap_int<32> acc_t; void conv_layer( hls::stream<data_t> &in_stream, data_t weights[COUT][CIN][3][3], hls::stream<data_t> &out_stream) { #pragma HLS ARRAY_PARTITION variable=weights complete dim=2 // CIN parallel #pragma HLS ARRAY_PARTITION variable=weights complete dim=3 #pragma HLS ARRAY_PARTITION variable=weights complete dim=4 static data_t lb0[CIN][IMG], lb1[CIN][IMG]; // 2 line buffers #pragma HLS ARRAY_PARTITION variable=lb0 complete dim=1 #pragma HLS ARRAY_PARTITION variable=lb1 complete dim=1 data_t win[CIN][3][3]; #pragma HLS ARRAY_PARTITION variable=win complete dim=0 for (int y = 0; y < IMG; y++) { for (int x = 0; x < IMG; x++) { #pragma HLS PIPELINE II=1 // 1 output pixel / cycle // shift the 3×3 window and update line buffers (per channel) for (int ci = 0; ci < CIN; ci++) { #pragma HLS UNROLL // all channels parallel data_t px = in_stream.read(); win[ci][0][0]=win[ci][0][1]; win[ci][0][1]=win[ci][0][2]; win[ci][0][2]=lb0[ci][x]; win[ci][1][0]=win[ci][1][1]; win[ci][1][1]=win[ci][1][2]; win[ci][1][2]=lb1[ci][x]; win[ci][2][0]=win[ci][2][1]; win[ci][2][1]=win[ci][2][2]; win[ci][2][2]=px; lb0[ci][x]=lb1[ci][x]; lb1[ci][x]=px; } // COUT output filters for (int co = 0; co < COUT; co++) { #pragma HLS UNROLL // all output channels parallel acc_t acc = 0; for (int ci = 0; ci < CIN; ci++) for (int ky = 0; ky < 3; ky++) for (int kx = 0; kx < 3; kx++) acc += win[ci][ky][kx] * weights[co][ci][ky][kx]; // requantize + ReLU inline, then stream out data_t y8 = (acc < 0) ? (data_t)0 : (data_t)(acc >> 8); out_stream.write(y8); } } } } // Parallelism = CIN(16) × COUT(16) × 9 taps = 2304 MACs/cycle at II=1

5. C/RTL Co-Simulation

HLS's killer feature: verify in C++ first. You write a C testbench, run it in seconds (C simulation), then run C/RTL co-simulation which feeds the same vectors through the generated RTL and checks they match — all without leaving HLS.

testbench.cpp
// tb.cpp — golden-model check int main() { // 1. generate random input // 2. run conv_layer() (the HLS function) // 3. run conv_golden() (a plain reference in C++) // 4. compare element-by-element int errors = 0; for (int i = 0; i < N; i++) if (hls_out[i] != golden_out[i]) errors++; printf("Mismatches: %d\n", errors); return errors; // 0 = PASS → run csim, then cosim } // Vitis HLS flow: // vitis_hls -f run.tcl // csim_design → fast functional check (C only) // csynth_design → generate RTL + reports // cosim_design → RTL simulated against same testbench // export_design → package as IP for Vivado

6. Reading the Synthesis Report

After csynth_design, the report tells you if your pragmas worked. The two numbers that matter most: II (did the pipeline hit 1?) and resource usage (did it fit?).

Report MetricWhat it meansTarget
Latency (cycles)Total cycles for one callAs low as possible
Interval / IICycles between new inputs1 for streaming
DSPMultipliers used< device budget
BRAM_18KBlock RAMs used< device budget
FF / LUTRegisters / logic< device budget
Timing (ns)Achieved clock period≤ target

Reading "II violation" Warnings

If the report shows II=72 when you asked for II=1, HLS prints the reason — almost always "cannot schedule load operation" (a BRAM port conflict → add ARRAY_PARTITION) or "carried dependency" (an accumulator → use partial sums). The report log is your debugging roadmap; never ignore it.

7. HLS vs Hand-Written RTL

AspectVitis HLSHand-written RTL
Development speedDays (C++ + pragmas)Weeks-months
VerificationC-sim in secondsRTL sim, slower
Peak performance~85-95% of hand-RTL100% (full control)
Architecture changesEdit C++, re-synthRewrite RTL
Learning curvePragma intuitionDeep RTL expertise
Best forMost AI accelerators todayCritical IP, last 10%

The Industry Reality

Most modern FPGA AI accelerators — including the building blocks inside Vitis AI's DPU (Day 11) — are developed substantially in HLS. Hand-RTL is reserved for the most performance-critical kernels. Knowing HLS pragmas well is now a core, highly employable VLSI/AI skill.

Day 10 — Key Takeaways

Next — Day 11: Vitis AI & the DPU — deploy a real ResNet-50 on a Xilinx Kria board using the pre-built Deep Learning Processing Unit, model quantization, and the Vitis AI runtime.

← Previous
Day 9: Pipelining & Parallelism
Next →
Day 11: Vitis AI & DPU