What is Vitis HLS and why use it for neural networks?

Vitis HLS (High-Level Synthesis) is AMD/Xilinx's tool that converts C/C++ code into synthesizable RTL (Verilog/VHDL). For neural networks it lets you describe convolution, pooling, and FC layers in readable C++ loops, then use pragmas to control parallelism — dramatically faster development than hand-writing RTL, while still reaching high performance through directives like PIPELINE and UNROLL.

What does the PIPELINE pragma do in HLS?

#pragma HLS PIPELINE tells HLS to overlap loop iterations so a new iteration starts every II (initiation interval) cycles, ideally II=1. Without it, a loop runs one iteration to completion before the next begins. PIPELINE is the single most important pragma for throughput — it turns a sequential loop into a streaming pipeline that accepts a new input each clock.

What is the DATAFLOW pragma used for in HLS CNNs?

#pragma HLS DATAFLOW lets multiple functions (e.g. conv, pool, activation) run concurrently as a task-level pipeline connected by hls::stream FIFOs. While the pool function processes layer 1's output, the conv function already works on the next frame. This is how HLS builds a full layer-pipelined CNN accelerator — the same dataflow architecture from Day 9, but generated from C++.

What is ARRAY_PARTITION in Vitis HLS?

#pragma HLS ARRAY_PARTITION splits an array into multiple smaller memories (or individual registers) so several elements can be accessed in the same cycle. BRAM has only 1-2 ports, so an unrolled loop needing 9 array reads per cycle would stall — partitioning the array into 9 banks (or completely into registers) removes the bottleneck and enables II=1.

Vitis HLS for Neural Networks — C++ to RTL, Pragmas & CNN Accelerators

1. Why HLS Changed FPGA AI Forever

Hand-writing RTL for a full CNN is brutal. A single ResNet-50 accelerator in raw Verilog is tens of thousands of lines, months of work, and painful to modify when the model changes. High-Level Synthesis (HLS) flips this: you describe the algorithm in C++, and the tool generates the RTL — letting you iterate on architecture in hours instead of weeks.

The catch — and the entire skill of HLS — is that naive C++ produces terrible hardware. A plain triple-nested convolution loop synthesizes to a slow, sequential machine. The pragmas are how you tell HLS to build parallel, pipelined hardware. Getting them right is the difference between a 0.2 TOPS toy and a 2 TOPS production engine.

The HLS Flow — C++ to Bitstream

2. Naive Convolution — and Why It's Slow

Here is a textbook convolution in plain C++. It's correct — and it produces dreadful hardware. HLS schedules every loop iteration sequentially, so each multiply waits for the previous one.

conv_naive.cpp

// Naive 3×3 convolution — CORRECT but SLOW hardware
#define IMG  32
#define CIN  16
#define COUT 16

void conv_naive(
    int8_t  in  [CIN][IMG][IMG],
    int8_t  w   [COUT][CIN][3][3],
    int32_t out [COUT][IMG][IMG])
{
  for (int co = 0; co < COUT; co++)
    for (int y = 1; y < IMG-1; y++)
      for (int x = 1; x < IMG-1; x++) {
        int32_t acc = 0;
        for (int ci = 0; ci < CIN; ci++)
          for (int ky = 0; ky < 3; ky++)
            for (int kx = 0; kx < 3; kx++)
              acc += in[ci][y+ky-1][x+kx-1] * w[co][ci][ky][kx];
        out[co][y][x] = acc;
      }
}
// Result: ~1 MAC issued every few cycles. One shared multiplier,
// one BRAM port, deeply sequential. ResNet would take seconds/frame.

Correct ≠ Fast

This compiles, simulates, and gives the right answer — but with no pragmas, HLS builds a single-multiplier state machine that processes one MAC at a time. The whole craft of HLS is transforming this same C++ into parallel hardware using directives, without changing the math.

3. The Four Pragmas That Matter

Ninety percent of HLS performance comes from four pragmas. Understand exactly what each one does to the generated hardware.

Pragma	What it does	Cost	When to use
`PIPELINE`	Overlaps loop iterations → new input every II cycles	Registers	Inner-ish loop you want streaming at II=1
`UNROLL`	Replicates the loop body N× to run in parallel	N× compute (DSPs)	Small loops (kernel taps, channels)
`ARRAY_PARTITION`	Splits an array into banks/registers for parallel access	More BRAM/regs	Feeding an unrolled loop without port stalls
`DATAFLOW`	Runs functions concurrently via stream FIFOs	FIFO BRAM	Top level — chain conv→pool→act

3.1 PIPELINE — the throughput pragma

Place #pragma HLS PIPELINE on a loop and HLS schedules a new iteration every II cycles. With II=1, after the pipeline fills, you get one result per clock — exactly the streaming behavior from Day 9.

No Pipeline vs PIPELINE II=1

pipeline example

// Pipeline the spatial loop → one output pixel per cycle
for (int y = 1; y < IMG-1; y++)
  for (int x = 1; x < IMG-1; x++) {
    #pragma HLS PIPELINE II=1     // <-- the magic line
    int32_t acc = 0;
    for (int ci = 0; ci < CIN; ci++)
      for (int ky = 0; ky < 3; ky++)
        for (int kx = 0; kx < 3; kx++)
          acc += in[ci][y+ky-1][x+kx-1] * w[co][ci][ky][kx];
    out[co][y][x] = acc;
  }
// HLS auto-unrolls loops *below* a PIPELINE pragma. So the ci/ky/kx
// loops fully unroll → 16×3×3 = 144 MACs must run in one cycle.
// That needs 144 parallel array reads → see ARRAY_PARTITION next.

3.2 UNROLL — the parallelism pragma

#pragma HLS UNROLL replicates a loop body so all iterations run at once. Unrolling the 16 input channels turns 16 sequential MACs into 16 parallel ones — 16× the DSPs, 16× the throughput on that loop.

unroll example

int32_t acc = 0;
for (int ci = 0; ci < CIN; ci++) {
  #pragma HLS UNROLL              // 16 channels computed in parallel
  acc += in[ci][y][x] * w[co][ci][ky][kx];
}
// Before: 16 cycles (one MAC at a time)
// After:  1 cycle, 16 DSPs + an adder tree to sum the 16 products
//
// Partial unroll if full is too costly:
//   #pragma HLS UNROLL factor=4   → 4 parallel, 4 iterations

3.3 ARRAY_PARTITION — feeding the parallel hardware

Here's the trap that catches every HLS beginner: you unroll a loop to get 144 parallel MACs, but the input array lives in a BRAM with only 2 ports. HLS can only read 2 values per cycle, so your 144 MACs stall to II=72. ARRAY_PARTITION splits the array into many small memories (or pure registers) so all 144 reads happen at once.

Why Partitioning Is Mandatory After UNROLL

array_partition example

int8_t win [CIN][3][3];     // sliding window buffer
int8_t wgt [CIN][3][3];     // kernel weights

// Partition fully so all CIN×3×3 elements read in one cycle
#pragma HLS ARRAY_PARTITION variable=win complete dim=0
#pragma HLS ARRAY_PARTITION variable=wgt complete dim=0

// Partition options:
//   complete           → every element its own register (max parallel)
//   cyclic   factor=4  → round-robin into 4 banks
//   block    factor=4  → contiguous chunks into 4 banks
// Use 'cyclic' when an unrolled loop strides through the array.

3.4 DATAFLOW — the architecture pragma

#pragma HLS DATAFLOW is how you build the layer-pipelined accelerator from Day 9 directly in C++. Each layer is a function; HLS connects them with hls::stream FIFOs and runs them all concurrently.

dataflow top — cnn_accel.cpp

#include <hls_stream.h>

void cnn_accel(int8_t in[CIN][IMG][IMG], int8_t out_cls[COUT]) {
  #pragma HLS DATAFLOW            // run all stages concurrently

  hls::stream<int8_t> s1, s2, s3;
  #pragma HLS STREAM variable=s1 depth=512
  #pragma HLS STREAM variable=s2 depth=512
  #pragma HLS STREAM variable=s3 depth=128

  conv_layer (in, s1);           // produces into s1
  relu_pool  (s1, s2);           // consumes s1, produces s2  (runs concurrently!)
  conv_layer2(s2, s3);           // consumes s2, produces s3
  fc_softmax (s3, out_cls);      // final classifier
}
// While relu_pool drains s1, conv_layer is already filling it with the
// next frame. This is the Day 9 dataflow architecture — from pure C++.

4. A Complete Pipelined Conv Layer in HLS

Putting all four pragmas together — this is a production-style streaming convolution function with channel-parallel MACs at II=1.

conv_layer.cpp (full)

#include <hls_stream.h>
#include <ap_int.h>
#define IMG 32
#define CIN 16
#define COUT 16

typedef ap_int<8>  data_t;
typedef ap_int<32> acc_t;

void conv_layer(
    hls::stream<data_t> &in_stream,
    data_t  weights[COUT][CIN][3][3],
    hls::stream<data_t> &out_stream)
{
  #pragma HLS ARRAY_PARTITION variable=weights complete dim=2  // CIN parallel
  #pragma HLS ARRAY_PARTITION variable=weights complete dim=3
  #pragma HLS ARRAY_PARTITION variable=weights complete dim=4

  static data_t lb0[CIN][IMG], lb1[CIN][IMG];   // 2 line buffers
  #pragma HLS ARRAY_PARTITION variable=lb0 complete dim=1
  #pragma HLS ARRAY_PARTITION variable=lb1 complete dim=1
  data_t win[CIN][3][3];
  #pragma HLS ARRAY_PARTITION variable=win complete dim=0

  for (int y = 0; y < IMG; y++) {
    for (int x = 0; x < IMG; x++) {
      #pragma HLS PIPELINE II=1                  // 1 output pixel / cycle

      // shift the 3×3 window and update line buffers (per channel)
      for (int ci = 0; ci < CIN; ci++) {
        #pragma HLS UNROLL                       // all channels parallel
        data_t px = in_stream.read();
        win[ci][0][0]=win[ci][0][1]; win[ci][0][1]=win[ci][0][2]; win[ci][0][2]=lb0[ci][x];
        win[ci][1][0]=win[ci][1][1]; win[ci][1][1]=win[ci][1][2]; win[ci][1][2]=lb1[ci][x];
        win[ci][2][0]=win[ci][2][1]; win[ci][2][1]=win[ci][2][2]; win[ci][2][2]=px;
        lb0[ci][x]=lb1[ci][x]; lb1[ci][x]=px;
      }

      // COUT output filters
      for (int co = 0; co < COUT; co++) {
        #pragma HLS UNROLL                       // all output channels parallel
        acc_t acc = 0;
        for (int ci = 0; ci < CIN; ci++)
          for (int ky = 0; ky < 3; ky++)
            for (int kx = 0; kx < 3; kx++)
              acc += win[ci][ky][kx] * weights[co][ci][ky][kx];
        // requantize + ReLU inline, then stream out
        data_t y8 = (acc < 0) ? (data_t)0 : (data_t)(acc >> 8);
        out_stream.write(y8);
      }
    }
  }
}
// Parallelism = CIN(16) × COUT(16) × 9 taps = 2304 MACs/cycle at II=1

5. C/RTL Co-Simulation

HLS's killer feature: verify in C++ first. You write a C testbench, run it in seconds (C simulation), then run C/RTL co-simulation which feeds the same vectors through the generated RTL and checks they match — all without leaving HLS.

testbench.cpp

// tb.cpp — golden-model check
int main() {
  // 1. generate random input
  // 2. run conv_layer() (the HLS function)
  // 3. run conv_golden() (a plain reference in C++)
  // 4. compare element-by-element
  int errors = 0;
  for (int i = 0; i < N; i++)
    if (hls_out[i] != golden_out[i]) errors++;

  printf("Mismatches: %d\n", errors);
  return errors;   // 0 = PASS → run csim, then cosim
}
// Vitis HLS flow:
//   vitis_hls -f run.tcl
//     csim_design        → fast functional check (C only)
//     csynth_design      → generate RTL + reports
//     cosim_design       → RTL simulated against same testbench
//     export_design      → package as IP for Vivado

6. Reading the Synthesis Report

After csynth_design, the report tells you if your pragmas worked. The two numbers that matter most: II (did the pipeline hit 1?) and resource usage (did it fit?).

Report Metric	What it means	Target
Latency (cycles)	Total cycles for one call	As low as possible
Interval / II	Cycles between new inputs	1 for streaming
DSP	Multipliers used	< device budget
BRAM_18K	Block RAMs used	< device budget
FF / LUT	Registers / logic	< device budget
Timing (ns)	Achieved clock period	≤ target

Reading "II violation" Warnings

If the report shows II=72 when you asked for II=1, HLS prints the reason — almost always "cannot schedule load operation" (a BRAM port conflict → add ARRAY_PARTITION) or "carried dependency" (an accumulator → use partial sums). The report log is your debugging roadmap; never ignore it.

7. HLS vs Hand-Written RTL

Aspect	Vitis HLS	Hand-written RTL
Development speed	Days (C++ + pragmas)	Weeks-months
Verification	C-sim in seconds	RTL sim, slower
Peak performance	~85-95% of hand-RTL	100% (full control)
Architecture changes	Edit C++, re-synth	Rewrite RTL
Learning curve	Pragma intuition	Deep RTL expertise
Best for	Most AI accelerators today	Critical IP, last 10%

The Industry Reality

Most modern FPGA AI accelerators — including the building blocks inside Vitis AI's DPU (Day 11) — are developed substantially in HLS. Hand-RTL is reserved for the most performance-critical kernels. Knowing HLS pragmas well is now a core, highly employable VLSI/AI skill.

Day 10 — Key Takeaways

✅ HLS turns C++ into RTL — iterate architecture in hours, not weeks
✅ Naive C++ = slow hardware; pragmas create the parallelism
✅ PIPELINE II=1 → one result per cycle (the throughput pragma)
✅ UNROLL → replicate loop body for parallel MACs (costs DSPs)
✅ ARRAY_PARTITION → split arrays so unrolled loops don't stall on BRAM ports
✅ DATAFLOW + hls::stream → layer-pipelined accelerator from C++ (Day 9 architecture)
✅ C/RTL co-sim verifies the generated RTL against your C testbench
✅ Read the report: II=1 and within-budget resources = success
✅ HLS reaches ~85-95% of hand-RTL performance at a fraction of the effort

Next — Day 11: Vitis AI & the DPU — deploy a real ResNet-50 on a Xilinx Kria board using the pre-built Deep Learning Processing Unit, model quantization, and the Vitis AI runtime.

← Previous

Day 9: Pipelining & Parallelism

Day 11: Vitis AI & DPU

Vitis HLSC++ to RTL for CNNs