Write neural network layers in readable C++ and let HLS generate the RTL. Master PIPELINE, UNROLL, ARRAY_PARTITION and DATAFLOW — the four pragmas that decide whether your accelerator hits 2 TOPS or 0.2 TOPS.
Hand-writing RTL for a full CNN is brutal. A single ResNet-50 accelerator in raw Verilog is tens of thousands of lines, months of work, and painful to modify when the model changes. High-Level Synthesis (HLS) flips this: you describe the algorithm in C++, and the tool generates the RTL — letting you iterate on architecture in hours instead of weeks.
The catch — and the entire skill of HLS — is that naive C++ produces terrible hardware. A plain triple-nested convolution loop synthesizes to a slow, sequential machine. The pragmas are how you tell HLS to build parallel, pipelined hardware. Getting them right is the difference between a 0.2 TOPS toy and a 2 TOPS production engine.
Here is a textbook convolution in plain C++. It's correct — and it produces dreadful hardware. HLS schedules every loop iteration sequentially, so each multiply waits for the previous one.
// Naive 3×3 convolution — CORRECT but SLOW hardware
#define IMG 32
#define CIN 16
#define COUT 16
void conv_naive(
int8_t in [CIN][IMG][IMG],
int8_t w [COUT][CIN][3][3],
int32_t out [COUT][IMG][IMG])
{
for (int co = 0; co < COUT; co++)
for (int y = 1; y < IMG-1; y++)
for (int x = 1; x < IMG-1; x++) {
int32_t acc = 0;
for (int ci = 0; ci < CIN; ci++)
for (int ky = 0; ky < 3; ky++)
for (int kx = 0; kx < 3; kx++)
acc += in[ci][y+ky-1][x+kx-1] * w[co][ci][ky][kx];
out[co][y][x] = acc;
}
}
// Result: ~1 MAC issued every few cycles. One shared multiplier,
// one BRAM port, deeply sequential. ResNet would take seconds/frame.This compiles, simulates, and gives the right answer — but with no pragmas, HLS builds a single-multiplier state machine that processes one MAC at a time. The whole craft of HLS is transforming this same C++ into parallel hardware using directives, without changing the math.
Ninety percent of HLS performance comes from four pragmas. Understand exactly what each one does to the generated hardware.
| Pragma | What it does | Cost | When to use |
|---|---|---|---|
PIPELINE | Overlaps loop iterations → new input every II cycles | Registers | Inner-ish loop you want streaming at II=1 |
UNROLL | Replicates the loop body N× to run in parallel | N× compute (DSPs) | Small loops (kernel taps, channels) |
ARRAY_PARTITION | Splits an array into banks/registers for parallel access | More BRAM/regs | Feeding an unrolled loop without port stalls |
DATAFLOW | Runs functions concurrently via stream FIFOs | FIFO BRAM | Top level — chain conv→pool→act |
Place #pragma HLS PIPELINE on a loop and HLS schedules a new iteration every II cycles. With II=1, after the pipeline fills, you get one result per clock — exactly the streaming behavior from Day 9.
// Pipeline the spatial loop → one output pixel per cycle
for (int y = 1; y < IMG-1; y++)
for (int x = 1; x < IMG-1; x++) {
#pragma HLS PIPELINE II=1 // <-- the magic line
int32_t acc = 0;
for (int ci = 0; ci < CIN; ci++)
for (int ky = 0; ky < 3; ky++)
for (int kx = 0; kx < 3; kx++)
acc += in[ci][y+ky-1][x+kx-1] * w[co][ci][ky][kx];
out[co][y][x] = acc;
}
// HLS auto-unrolls loops *below* a PIPELINE pragma. So the ci/ky/kx
// loops fully unroll → 16×3×3 = 144 MACs must run in one cycle.
// That needs 144 parallel array reads → see ARRAY_PARTITION next.#pragma HLS UNROLL replicates a loop body so all iterations run at once. Unrolling the 16 input channels turns 16 sequential MACs into 16 parallel ones — 16× the DSPs, 16× the throughput on that loop.
int32_t acc = 0;
for (int ci = 0; ci < CIN; ci++) {
#pragma HLS UNROLL // 16 channels computed in parallel
acc += in[ci][y][x] * w[co][ci][ky][kx];
}
// Before: 16 cycles (one MAC at a time)
// After: 1 cycle, 16 DSPs + an adder tree to sum the 16 products
//
// Partial unroll if full is too costly:
// #pragma HLS UNROLL factor=4 → 4 parallel, 4 iterationsHere's the trap that catches every HLS beginner: you unroll a loop to get 144 parallel MACs, but the input array lives in a BRAM with only 2 ports. HLS can only read 2 values per cycle, so your 144 MACs stall to II=72. ARRAY_PARTITION splits the array into many small memories (or pure registers) so all 144 reads happen at once.
int8_t win [CIN][3][3]; // sliding window buffer
int8_t wgt [CIN][3][3]; // kernel weights
// Partition fully so all CIN×3×3 elements read in one cycle
#pragma HLS ARRAY_PARTITION variable=win complete dim=0
#pragma HLS ARRAY_PARTITION variable=wgt complete dim=0
// Partition options:
// complete → every element its own register (max parallel)
// cyclic factor=4 → round-robin into 4 banks
// block factor=4 → contiguous chunks into 4 banks
// Use 'cyclic' when an unrolled loop strides through the array.#pragma HLS DATAFLOW is how you build the layer-pipelined accelerator from Day 9 directly in C++. Each layer is a function; HLS connects them with hls::stream FIFOs and runs them all concurrently.
#include <hls_stream.h>
void cnn_accel(int8_t in[CIN][IMG][IMG], int8_t out_cls[COUT]) {
#pragma HLS DATAFLOW // run all stages concurrently
hls::stream<int8_t> s1, s2, s3;
#pragma HLS STREAM variable=s1 depth=512
#pragma HLS STREAM variable=s2 depth=512
#pragma HLS STREAM variable=s3 depth=128
conv_layer (in, s1); // produces into s1
relu_pool (s1, s2); // consumes s1, produces s2 (runs concurrently!)
conv_layer2(s2, s3); // consumes s2, produces s3
fc_softmax (s3, out_cls); // final classifier
}
// While relu_pool drains s1, conv_layer is already filling it with the
// next frame. This is the Day 9 dataflow architecture — from pure C++.Putting all four pragmas together — this is a production-style streaming convolution function with channel-parallel MACs at II=1.
#include <hls_stream.h>
#include <ap_int.h>
#define IMG 32
#define CIN 16
#define COUT 16
typedef ap_int<8> data_t;
typedef ap_int<32> acc_t;
void conv_layer(
hls::stream<data_t> &in_stream,
data_t weights[COUT][CIN][3][3],
hls::stream<data_t> &out_stream)
{
#pragma HLS ARRAY_PARTITION variable=weights complete dim=2 // CIN parallel
#pragma HLS ARRAY_PARTITION variable=weights complete dim=3
#pragma HLS ARRAY_PARTITION variable=weights complete dim=4
static data_t lb0[CIN][IMG], lb1[CIN][IMG]; // 2 line buffers
#pragma HLS ARRAY_PARTITION variable=lb0 complete dim=1
#pragma HLS ARRAY_PARTITION variable=lb1 complete dim=1
data_t win[CIN][3][3];
#pragma HLS ARRAY_PARTITION variable=win complete dim=0
for (int y = 0; y < IMG; y++) {
for (int x = 0; x < IMG; x++) {
#pragma HLS PIPELINE II=1 // 1 output pixel / cycle
// shift the 3×3 window and update line buffers (per channel)
for (int ci = 0; ci < CIN; ci++) {
#pragma HLS UNROLL // all channels parallel
data_t px = in_stream.read();
win[ci][0][0]=win[ci][0][1]; win[ci][0][1]=win[ci][0][2]; win[ci][0][2]=lb0[ci][x];
win[ci][1][0]=win[ci][1][1]; win[ci][1][1]=win[ci][1][2]; win[ci][1][2]=lb1[ci][x];
win[ci][2][0]=win[ci][2][1]; win[ci][2][1]=win[ci][2][2]; win[ci][2][2]=px;
lb0[ci][x]=lb1[ci][x]; lb1[ci][x]=px;
}
// COUT output filters
for (int co = 0; co < COUT; co++) {
#pragma HLS UNROLL // all output channels parallel
acc_t acc = 0;
for (int ci = 0; ci < CIN; ci++)
for (int ky = 0; ky < 3; ky++)
for (int kx = 0; kx < 3; kx++)
acc += win[ci][ky][kx] * weights[co][ci][ky][kx];
// requantize + ReLU inline, then stream out
data_t y8 = (acc < 0) ? (data_t)0 : (data_t)(acc >> 8);
out_stream.write(y8);
}
}
}
}
// Parallelism = CIN(16) × COUT(16) × 9 taps = 2304 MACs/cycle at II=1HLS's killer feature: verify in C++ first. You write a C testbench, run it in seconds (C simulation), then run C/RTL co-simulation which feeds the same vectors through the generated RTL and checks they match — all without leaving HLS.
// tb.cpp — golden-model check
int main() {
// 1. generate random input
// 2. run conv_layer() (the HLS function)
// 3. run conv_golden() (a plain reference in C++)
// 4. compare element-by-element
int errors = 0;
for (int i = 0; i < N; i++)
if (hls_out[i] != golden_out[i]) errors++;
printf("Mismatches: %d\n", errors);
return errors; // 0 = PASS → run csim, then cosim
}
// Vitis HLS flow:
// vitis_hls -f run.tcl
// csim_design → fast functional check (C only)
// csynth_design → generate RTL + reports
// cosim_design → RTL simulated against same testbench
// export_design → package as IP for VivadoAfter csynth_design, the report tells you if your pragmas worked. The two numbers that matter most: II (did the pipeline hit 1?) and resource usage (did it fit?).
| Report Metric | What it means | Target |
|---|---|---|
| Latency (cycles) | Total cycles for one call | As low as possible |
| Interval / II | Cycles between new inputs | 1 for streaming |
| DSP | Multipliers used | < device budget |
| BRAM_18K | Block RAMs used | < device budget |
| FF / LUT | Registers / logic | < device budget |
| Timing (ns) | Achieved clock period | ≤ target |
If the report shows II=72 when you asked for II=1, HLS prints the reason — almost always "cannot schedule load operation" (a BRAM port conflict → add ARRAY_PARTITION) or "carried dependency" (an accumulator → use partial sums). The report log is your debugging roadmap; never ignore it.
| Aspect | Vitis HLS | Hand-written RTL |
|---|---|---|
| Development speed | Days (C++ + pragmas) | Weeks-months |
| Verification | C-sim in seconds | RTL sim, slower |
| Peak performance | ~85-95% of hand-RTL | 100% (full control) |
| Architecture changes | Edit C++, re-synth | Rewrite RTL |
| Learning curve | Pragma intuition | Deep RTL expertise |
| Best for | Most AI accelerators today | Critical IP, last 10% |
Most modern FPGA AI accelerators — including the building blocks inside Vitis AI's DPU (Day 11) — are developed substantially in HLS. Hand-RTL is reserved for the most performance-critical kernels. Knowing HLS pragmas well is now a core, highly employable VLSI/AI skill.
Next — Day 11: Vitis AI & the DPU — deploy a real ResNet-50 on a Xilinx Kria board using the pre-built Deep Learning Processing Unit, model quantization, and the Vitis AI runtime.