Fellow Rustaceans, embark on a journey with me, into the enchanted world of hardware design.

With FPGAs, we’ll transcend the limitations of traditional CPUs, unlocking faster and more efficient calculations. Our journey will be a perilous one, as the language of FPGAs and the common programming languages of our world are vastly different. It would need a mighty sorcerer to bridge that gap. Maybe even multiple ones. The path ahead is unclear, but we shall not be deterred.

In the labyrinthine passages of the internet, we shall search for answers. We will encounter PandA, an experienced high-level synthesizer. This powerful entity can translate the ancient languages of C/C++ into descriptive forms that the FPGAs can understand. Yet, even PandA is unable to understand our native tongue, Rust. Fear not, brave adventurer, for we have the aid of a mighty and swift wyvern of LLVM. It set out to create a common language, not to be natively spoken by anyone, but to be understood by all. We will use the language of LLVM to communicate with PandA, and see how it fares with our Rust compared to its familiar C/C++. So gather your courage and sharpen your wits, for our journey into the mystical world of HLS, FPGAs, CPUs, and LLVM is about to begin.

Introduction

A few weeks ago, I decided that I am interested in FPGAs and how Rust could be used to program them. In this post, I document my first explorations into HLS and Rust. We will learn what HLS is, how it works and how we can use it to synthesize Rust into circuits.

Background

I will assume that you have some knowledge of Rust and systems programming. I will try to give you a short intro on what FPGAs are and how they are usually used, but I will not go into too much detail.

What even is an FPGA?

CPUs process their instructions one by one. They are purpose-built machines that are carefully designed for processing lots of instructions in sequence. Traditional programming languages reflect this design for the most part. They are designed to make it easy for humans to write programs that can be executed sequentially one instruction at a time. There are approaches to parallelism, but they are either limited to having multiple threads of execution that run from top to bottom simultaneously. Or they have some instructions that perform the same operation on a fixed amount of data elements at the same time.

Field programmable gate arrays (FPGAs) on the other hand are not designed to process one instruction at a time. They are not even designed to process instructions at all. As the name field programmable gate array implies, it is just a lot of programmable logic gates with programmable connections between them. So while a CPU is a circuit designed to do one thing, an FPGA is a circuit designed to emulate other circuits.

For example, you can define the logic gates for your CPU and program the FPGA with that design, so it will behave exactly like a CPU and can process instructions. Such a 'soft' CPU will be bigger and therefore slower than a real CPU because a circuit emulated on an FPGA takes up more space than an application-specific circuit. This is however really useful for prototyping because you can test the CPU design in hardware with external peripherals, like memory and I/O devices.

In the past, FPGAs were mostly used like this for prototyping hardware development. Programmers designed a circuit in a hardware description language (HDL) and simulate it as a first stage of verification. If that works you can deploy it to an FPGA, and then test it there. If everything works as expected, you would produce that design as integrated circuits.

This is still the most common use case for FPGAs, but it is not the only one. In recent years the usage of FPGAs as pseudo-general-purpose computational accelerators became more relevant. Here you do not use them to prototype circuits that will eventually be taped out but as the final platform. FPGAs used in this context are known as computational FPGAs. It is somewhat comparable to the use of GPUs as computing accelerators. But where GPUs excel at tasks that perform the same operations in parallel on massive amounts of data, FPGAs can be used for some kind of computations with irregular parallelism with static structure. In opposition to GPUs, we are not yet entirely sure what an appropriate abstraction for the computational pattern used with computational FPGAs is.

The world hasn’t settled yet on a succinct description of the fundamental computational pattern that FPGAs are supposed to be good at. But it has something to do with potentially-irregular parallelism, data reuse, and mostly-static data flow.

— Adrian Sampson
FPGAs Have the Wrong Abstraction, 2019

Until that is settled, computational FPGAs will probably stay a niche application. There are currently multiple projects that try to find an answer to this problem, by developing a new type of programming language for designing these hardware accelerators. We will learn more about that in the next sections.

How are FPGAs programmed?

Usually, logic design is done in hardware description languages (HDL). Currently, there are two major ones, Verilog and VHDL. VHDL has a slightly higher level of abstraction and some features that make it easier to manage bigger projects. In some ways, the relation between Verilog and VHDL is comparable to the relation between C and C++ in terms of features and abstraction. Both can be used to model the structure of hardware equally efficiently, so the choice is mostly a matter of personal preference. In this post, we will focus on Verilog, because it is the default language of most tools.

Most commonly HDLs are used to describe circuits in a register-transfer level (RTL) abstraction. On RTL we describe registers that can hold state and the combinational (time-independent) logic that connects them. In HDLs we can describe modules around these RTL descriptions of logic. Modules can describe input and output ports that connect the contained logic to the outside world. We can then connect these modules to form a larger circuit. This is the basic structure of an HDL design.

Note

There also is rust-hdl which is a HDL you can write in Rust. Its capabilities are comparable to other HDLs like Verilog or VHDL. When using it you write Rust structs that represent HDL modules. You can then annotate these structs with macros and rust-hdl will generate Verilog from that. It does not do high-level synthesis, but allows you to write, simulate and verify RTL logic directly in Rust. If you are interested in FPGA programming, you should definitely check it out.

Let us look at a simple example of a counter in Verilog. On every clock pulse, it counts up by one and resets to zero when it reaches 255:

Interface of the counter
Figure 1. Interface of the counter
Verilog source of the counter
module counter( clock, count );
  input   clock;
  output  [7:0] count;

  reg     [7:0] internal_counter;
  initial internal_counter = 8'b0;

  assign count = internal_counter;

  always @ (posedge clock)
  begin
    internal_counter <= internal_counter + 1;
  end
 
endmodule

The connections to the outside are an input port name clock and an output port name count. clock is just a single bit, but count is 8 bits wide. Internally we also have a register named internal_counter that stores our current count. First, we tell Verilog that the output port should be directly connected to the internal counter register. Then we define that the value of internal_counter plus 1 gets assigned to the internal_counter every time the clock input goes from zero (low) to one (high).

Can’t we just use a normal programming language?

As we have seen, earlier HDLs mostly operate on the register-transfer level (RTL) abstraction. High-level synthesis (HLS) is a process that enables the design of digital circuits in a high-level language, such as C or C++. HLS allows designers to write their design at a higher level of abstraction which can be more natural and familiar to software developers.

The biggest downside of using HLS is that the generated circuits are usually a lot more inefficient than circuits designed directly on RTL. It is sometimes a bit unpredictable what optimizations will and will not be beneficial during HLS, because the optimizations that improve performance on a CPU are not always beneficial for synthesizing ciruits. Writing an application at this level and letting the HLS tools figure out the details is considered a bad design practice in hardware design. This is in opposition to software programming, where the compiler is expected to do the heavy lifting in terms of speed and optimization. The programmer can focus on the behavioral application logic. Computational FPGA programming usually accepts this tradeoff, because bigger FPGAs are usually cheaper than the time it would take to design the circuit on RTL.

The source languages used for HLS are usually limited subsets of C/C++ with some additional restrictions. In this post, we will look into Rust as a source language for HLS. Rust could be a good fit for HLS because it already has some restrictions for dealing with memory that could prove beneficial for HLS. For example, its semantics define memory and ownership more clearly than C/C++.

There are some ways how toolchains for HLS from Rust can be built with currently existing software. However, it is currently not known how viable these different toolchains are for practical application specifically what restrictions apply. It is also not known how they can benefit from the additional information provided by Rusts semantics.

Are there other alternatives?

For this kind of computational programming a new style of programming language is currently being created, known as accelerator design languages (ADL). They are somewhat similar to using HLS, but where HLS uses a programming language as an ad-hoc ADL, real ADLs are designed from the ground up to accommodate the needs of hardware design, while still offering a level of abstraction comparable to programming languages. As a tradeoff, they do not attempt to enable the design of arbitrary hardware. The two big areas of complexity that ADLs try to abstract away are: physicality and time

Physicality. Perhaps the most fundamental thing about hardware is that computation uses finite, physical resources to do its work. When an accelerator performs a floating-point multiplication, for example, that has to happen somewhere - it needs to dedicate an FPU to that computation. And meanwhile, it needs to take care not to simultaneously try to use the same FPU to do something else at the same time. While CPUs of course also need to allocate computational resources to computations, they do it implicitly - whereas accelerator designs have to manage it as a first-class design concern.

Time. While parallel software has to deal with a notion of logical time, such as a happens-before relation for ordering events between threads, hardware accelerators need to contend with physical time in the form of clock cycles. Cycle-level timing adds a dimension to ADLs’ complexity, but it also gives accelerators the unique ability to ensure determinism: that a given computation takes 100 cycles, for example, with no exceptions. This level of control is an important advantage over even high-performance CPUs. For many use cases such as data center networking, 100 deterministic cycles might be preferable to a CPU that takes 80 cycles most of the time but 500 cycles in rare exceptions.

As mentioned above, HLS can be viewed as abusing a normal programming language to express something it is not meant to express. While a real ADL is probably better suited for generating serious hardware, for smaller tasks or as part of a bigger design HLS can still be a good alternative. The big benefit of HLS over dedicated ADLs is that they enable developers to write their code in a language they are already familiar with. In addition, the tooling for traditional programming languages is more mature than for ADLs.

While it is probable that in the future ADLs will become more popular, for now, HLS remains the most common way to write high-level code for FPGAs.

Note

If you want to know more about ADLs, I recommend reading the following blog posts by Adrian Sampson:

Why use Rust instead of C/C++?

Rust seems to be better suited as an ad-hoc ADL, than C/C++, because its memory semantics already touch on some of the problems of ADLs. For example, Rust only allows for only one mutable, or multiple read-only references to a variable at a time. This is could help with the problem of physicality because there can only be one thing that writes to a variable at a time.

Another benefit that could come from using Rust specifically is that there already is a project named rust-hdl that allows for writing RTL modules on the same level of abstraction as other HDL languages but in Rust. It would be really interesting if we could use HLS to use normal rust functions as modules in rust-hdl. That way all hardware and accelerator design could be done in a single rust project. For now, we will not focus on this, but if we have a successful proof-of-concept for HLS from rust, we could look into this.

Setting up the toolchain

So now you should have a bit of background on the field that we are going to explore. Our goal for this post is to get a basic HLS toolchain for Rust up and running. Additionally, we want to understand how the Results compare to HLS from C/C++.

The PandA project maintains and develops a framework for research in the Hardware/Software Co-Design area. As part of that project, they publish an HLS tool called bambu. While bambu is mostly used for research, it is also one of the most complete free and open-source HLS tools available.

It can use the clang or GCC compilers as a frontend and synthesize their output to Verilog. As clang is based on LLVM, it can load LLVM intermediate representation (LLVM IR) directly. The Rust compiler is also based on LLVM and has the option to output LLVM IR. We can then use the generated LLVM IR as the input for bambu. This way it is possible to perform high-level synthesis on Rust code.

The toolchain
Figure 2. The toolchain

As you can see, our toolchain uses rustc to compile rust code to LLVM IR, then uses bambu to generate Verilog. We could then use an RTL synthesis tool like yosys to generate gate-level logic. This representation is also known as a netlist because it lists the net of connections between the gates. In the final step, we can then use a place-and-route tool to synthesize the actual layout of the gates and connections on the FPGA. That file can then be loaded onto an FPGA. We only focus on the first steps in this post.

Using bambu for C++

Bambu is built to synthesize C/C++ code.

Let us look at this C++ function that finds the minimum and maximum in an array:

void min_max_cpp(int* numbers, int numbers_length, int* out_max, int* out_min)
{
   int local_max = numbers[0];
   int local_min = numbers[0];
   int i = 0;
   for(i = 0; i < numbers_length; i++)
   {
      if(numbers[i] > local_max)
      {
         local_max = numbers[i];
      }
      if(numbers[i] < local_min)
      {
         local_min = numbers[i];
      }
   }
   *out_max = local_max;
   *out_min = local_min;
}

It is C-like code that mutates its inputs. The function takes a pointer to an array of integers, the number of elements in the array, and two pointers to integers. The function finds the smallest and largest value of the array and writes them to the memory locations pointed to by out_max and out_min.

We synthesize the function using bambu and use its integrated test runner to test the function against some test cases. Later we will use the same test cases for the Rust version of the function. The Rust function also requires some level of optimizations to be enabled. We want this test to be comparable with that, so we set the optimization level to -O2, which is the lowest possible value. The test will contain 21 test cases because we will test the performance of the function with 0 to 20 elements in the array.

$ nix run github:zebreus/bachelor-thesis#bambu -- src/min_max_cpp.cpp --top-fname=min_max_cpp --generate-tb=src/testbench.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -O2
  Scheduling Information of function min_max_cpp:
    Number of control steps: 15
    Minimum slack: 0.37826399699999946
    Estimated max frequency (MHz): 103.93134873875212

  Module binding information for function min_max_cpp:
    Number of modules instantiated: 112
    Number of performance conflicts: 12
    Estimated resources area (no Muxes and address logic): 1034
    Estimated area of MUX21: 72
    Total estimated area: 1106
    Estimated number of DSPs: 0

  Total cycles             : 363 cycles
  Number of executions     : 21
  Average execution        : 17 cycles

The most important aspects of the output are the average number of clock cycles and the total area used. The area is an important metric because bigger FPGAs are more expensive. The number of clock cycles is also important because it is a reasonable indicator of how fast the function is.

We tell bambu to use Verilator to execute the tests. Verilator is not a traditional simulator but a compiler, as it converts the Verilog model back to C++ that can be executed. This way the tests run a lot faster, but we primarily use it, because it is the most convenient option. Verilator does not simulate timings beyond clock cycles, so the results are not completely representative of the actual hardware. That said, comparing clock cycles should be completely sufficient for now. Bambu also creates an estimation of the maximum clock frequency that the design can run at. Later we will use that in conjunction with the clock cycles to create a more accurate estimate of the performance.

Looking at the generated interface

The interface of the generated Verilog will look something like this:

Interface generated from {cpp} function
Figure 3. Interface generated from C++ function

The input ports are shown on the left side and the output ports are on the right side of the module.

We can see that the module has 3 input ports named clock, reset and start. These do pretty much what you would expect them to do. The reset is a signal that resets the module and the start signal is used to start the module. The module will not start until the start signal is high and the reset signal is low. The module will then run until it is done and the start signal goes low again. The clock signal is used to clock the module, when we talk about cycles later we are talking about clock cycles.

The module has an output port named done. I assume that it will be high when the module has finished.

Then there are some input and output ports prefixed with M. They are used to connect the module to external memory. This is necessary because the input to our function is a pointer, so the module also needs access to the memory it points to.

At last, we have the four inputs of our module as 32-bit wide signals. I assume that the input for numbers_length passes in the number as a 32-bit integer at the start. The other inputs are pointers to the memory locations where the module can find the array and the two output values.

All of these are just educated guesses for now, because we did not look at the generated Verilog.

Limitations when using bambu with Rust

When using Rust with bambu, we need to take a few things into account.

First, bambu cannot synthesize anything that unwinds the stack, at least when using the LLVM IR backend. Bambu also has no way of synthesizing a function that terminates the program (Like std::process::exit). This is a bit unfortunate because unwinding and terminating are the two ways that Rust uses to deal with panics. So we must make sure that our code cannot panic. When optimizations are disabled, the Rust compiler sometimes inserts exception handlers into the LLVM IR, even though they are not used. On higher optimization levels, the compiler is smart enough to remove those exception handlers.

Second, our function can not use IO and basically every other impure feature. This includes things like std::io::println. This does not affect us in our example, but keep it in mind when writing your code.

Third, we need to make sure that the function interface is preserved. We can specify that we want the name to be preserved by attaching the #[no_mangle] attribute macro to our function. Otherwise, our function will be named something like _ZN17min_max_rust12min_max_rust17h0f0776ba5e267ab5E instead of min_max_rust. We also need a C-compatible interface, so we need to mark the function as extern "C". While this is not strictly necessary, it is needed for this case, because we use bambus testbench generator to generate test cases for our function. The testbench generator only supports a C interface to call our function.

Fourth, bambu does not support all LLVM intrinsics. Intrinsics are the standard library of LLVM IR. Most notably bambu does not support the LLVM vector::reduce intrinsics, which Rust uses for vectorizing loops. There are probably other intrinsics that are not supported, but I have not encountered any more yet. If we encounter an unsupported intrinsic, our easiest option is to try another optimization level or see if the Rust compiler has the option to disable the use of that intrinsic. If that does not work we can try to modify the generated LLVM IR to remove the intrinsic. A more advanced option is to write a pass for the LLVM IR that replaces the intrinsic with something that is supported by bambu. We won’t need that in this blog post, but it is something to keep in mind.

Converting the min_max example to Rust

Translating min_max_cpp to Rust is not very difficult. We need to access the input array in rust using array.get_unchecked(i) instead of array[i], because the latter is always bounds checked. If we used the checked version, our code could panic, which is not synthesizable. Finally, we need to mark the function as #[no_mangle], so that the function name is preserved.

/// Minmax function that is as similar as possible as the equivalent cpp function
#[no_mangle]
pub unsafe extern "C" fn min_max_rust(
    numbers: *mut i32,
    numbers_length: i32,
    out_max: &mut i32,
    out_min: &mut i32,
) {
    let mut local_max = *numbers.offset(0);
    let mut local_min = *numbers.offset(0);
    for i in 0..numbers_length {
        if *numbers.offset(i as isize) > local_max {
            local_max = *numbers.offset(i as isize);
        }
        if *numbers.offset(i as isize) < local_min {
            local_min = *numbers.offset(i as isize);
        }
    }
    *out_max = local_max;
    *out_min = local_min;
}

If we compile that code to LLVM IR and try to use bambu with it we get an error that llvm.vector.reduce.smax.v4i32 is not supported. As mentioned above bambu does not support LLVM vector intrinsics. They probably get inserted by the LLVM Loop Vectorizer. We can disable that vectorization pass by passing -C no-vectorize-loops to rustc.

By default, the Rust compiler generates code that unwinds the stack on panic. The generated LLVM IR will also have an exception-handling personality function added to every function in LLVM IR. This is not synthesizable. We tell the compiler to instead terminate the program on panic by passing -C panic=abort to rustc.

For that reason, we should make sure that debug assertions are disabled. Debug assertions add extra safety checks, which can make it easier to find and fix bugs, but these checks usually manifest as panics. We can disable them by passing the -C debug-assertions=off flag. On every optimization level other than 0 they are automatically disabled, but we want to be explicit.

We also need to disable overflow checks, because they will generate panics if overflows occur during arithmetic operations. We can do that by passing -C overflow-checks=off to rustc.

It is probably a good idea to tell the rust compiler that we are not targeting a specific CPU architecture by passing -C target-cpu=generic.

Recent versions of rustc will generate LLVM IR with opaque pointers(ptr instead of i32*) by default. This is not supported by bambu because it uses an old LLVM version so we need to pass -C llvm-args="--opaque-pointers=false" to rustc.

The final command to compile the Rust code to LLVM IR is:

$ rustc --emit=llvm-ir --crate-type=lib src/min_max_rust.rs -o min_max_rust.ll -C opt-level=0 -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C llvm-args="--opaque-pointers=false"

Bambu fails to synthesize the Rust code if all optimizations are disabled. Enabling at least some level of optimization in rustc or enabling a level higher than 1 in bambu fixes the problem. I assume this is because Rust still generates exception handlers in LLVM if optimizations are disabled. We will set the optimization level to -O2 in bambu because that way we can compare the results with the C++ function which used that level.

Synthesizing and testing the rust function
$ rustc --emit=llvm-ir --crate-type=lib -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C opt-level=0 src/min_max_rust.rs -o build/min_max_rust_intro.ll
$ nix run github:zebreus/bachelor-thesis#bambu -- build/min_max_rust_intro.ll --top-fname=min_max_rust --generate-tb=src/testbench_rust.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -O2
  Scheduling Information of function min_max_rust:
    Number of control steps: 15
    Minimum slack: 0.37826399699999946
    Estimated max frequency (MHz): 103.93134873875212

  Module binding information for function min_max_rust:
    Number of modules instantiated: 112
    Number of performance conflicts: 12
    Estimated resources area (no Muxes and address logic): 1034
    Estimated area of MUX21: 72
    Total estimated area: 1106
    Estimated number of DSPs: 0

  Total cycles             : 363 cycles
  Number of executions     : 21
  Average execution        : 17 cycles

It is quite interesting to see that the rust version of the function is identical to the C++ version. My best guess is that rustc and clang both generated nearly identical LLVM IR because the function is the same.

I would have assumed that the rust version would be slower, because the C++ version had optimizations enabled during compilation and synthesis, while we only had synthesis optimizations enabled for the rust version. Also, bambu was designed to synthesize C++ code, so I imagined it might be better at synthesizing C++ code than Rust code. My best guess is that bambu treats both versions just as LLVM IR, so it does not have any special knowledge about C++. We will do a more detailed comparison of the C++ and Rust versions later, where we also include the GCC backend, maybe that yield different results.

Converting it to more idiomatic Rust

While the previous example technically was Rust, it was not very Rust-like.

A more idiomatic Rust version of the same function using slices and iterators.

#[repr(C)]
pub struct MinMax {
    pub max: i32,
    pub min: i32,
}

#[no_mangle]
pub unsafe extern "C" fn min_max_rust_idiomatic(numbers: *mut i32, numbers_length: i32) -> MinMax {
    let slice = std::slice::from_raw_parts_mut(numbers, numbers_length as usize);

    slice.iter().fold(MinMax { max: 0, min: 0 }, |mut acc, &x| {
        if x > acc.max {
            acc.max = x;
        }
        if x < acc.min {
            acc.min = x;
        }
        acc
    })
}

The new version makes use of Rust’s standard library and functional constructs such as iterators and the fold method. Functional programming is widely used in Rust and considered idiomatic because it is expressive and encourages best practices like immutability and explicitness. It is also safer than manually indexing into the input pointer with the offset method as in the previous version. Additionally, the use of a struct to return the minimum and maximum values is a more natural way to represent the output in Rust, as opposed to using two separate output pointers.

Usually, idiomatic Rust code is faster than C-like Rust code, because the compiler can do more optimizations. So let"s see if that is the case here. Bambu cannot synthesize this function when the LLVM IR was not optimized by Rust, so we will set the optimization level to 1. This makes this result less comparable to the previous one, but it shouldn’t make a big difference, because Rust does only basic optimizations at that level.

Synthesizing and testing the idiomatic Rust function
$ rustc --emit=llvm-ir --crate-type=lib -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C opt-level=1 src/min_max_rust_idiomatic.rs -o build/min_max_rust_idiomatic_intro.ll
$ nix run github:zebreus/bachelor-thesis#bambu -- build/min_max_rust_idiomatic_intro.ll --top-fname=min_max_rust_idiomatic --generate-tb=src/testbench_rust_idiomatic.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -O2
  Scheduling Information of function min_max_rust_idiomatic:
    Number of control steps: 7
    Minimum slack: 2.6688639990000045
    Estimated max frequency (MHz): 136.40450809582526

  Module binding information for function min_max_rust_idiomatic:
    Number of modules instantiated: 29
    Number of performance conflicts: 0
    Estimated resources area (no Muxes and address logic): 215
    Estimated area of MUX21: 0
    Total estimated area: 215
    Estimated number of DSPs: 0

  Total cycles             : 462 cycles
  Number of executions     : 21
  Average execution        : 22 cycles

The results look promising so far. This version is faster and way smaller than the previous two, but that could be because it had the rust compiler set to a higher optimization level. Let us see what happens when we synthesize our functions with all optimizations enabled.

Evaluating the different designs

For the comparison, we will use the same test cases as before. We will test every version of the function once with optimizations set for maximum speed and once with optimizations set for minimum size. For the C++ version we will also compare how the results change when we use GCC instead of clang.

We will consider the synthesis as a black box, so we will not compare the generated code. We will only compare the execution time and resource usage of the generated code.

C++ with clang and -O5
$ nix run github:zebreus/bachelor-thesis#bambu -- src/min_max_cpp.cpp --top-fname=min_max_cpp --generate-tb=src/testbench.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -O5
  Scheduling Information of function min_max_cpp:
    Number of control steps: 15
    Minimum slack: 0.37826399699999946
    Estimated max frequency (MHz): 103.93134873875212

  Module binding information for function min_max_cpp:
    Number of modules instantiated: 112
    Number of performance conflicts: 12
    Estimated resources area (no Muxes and address logic): 1034
    Estimated area of MUX21: 72
    Total estimated area: 1106
    Estimated number of DSPs: 0

  Total cycles             : 363 cycles
  Number of executions     : 21
  Average execution        : 17 cycles
C++ with clang and -Os
$ nix run github:zebreus/bachelor-thesis#bambu -- src/min_max_cpp.cpp --top-fname=min_max_cpp --generate-tb=src/testbench.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -Os
  Scheduling Information of function min_max_cpp:
    Number of control steps: 9
    Minimum slack: 2.6688639990000045
    Estimated max frequency (MHz): 136.40450809582526

  Module binding information for function min_max_cpp:
    Number of modules instantiated: 25
    Number of performance conflicts: 0
    Estimated resources area (no Muxes and address logic): 208
    Estimated area of MUX21: 70
    Total estimated area: 278
    Estimated number of DSPs: 0

  Total cycles             : 485 cycles
  Number of executions     : 21
  Average execution        : 23 cycles
C++ with GCC and -O5
$ nix run github:zebreus/bachelor-thesis#bambu -- src/min_max_cpp.cpp --top-fname=min_max_cpp --generate-tb=src/testbench.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_GCC8 -O5
  Scheduling Information of function min_max_cpp:
    Number of control steps: 9
    Minimum slack: 3.0950000000000006
    Estimated max frequency (MHz): 144.82259232440262

  Module binding information for function min_max_cpp:
    Number of modules instantiated: 19
    Number of performance conflicts: 0
    Estimated resources area (no Muxes and address logic): 189
    Estimated area of MUX21: 70
    Total estimated area: 259
    Estimated number of DSPs: 0

  Total cycles             : 525 cycles
  Number of executions     : 21
  Average execution        : 25 cycles
C++ with GCC and -Os
$ nix run github:zebreus/bachelor-thesis#bambu -- src/min_max_cpp.cpp --top-fname=min_max_cpp --generate-tb=src/testbench.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_GCC8 -Os
  Scheduling Information of function min_max_cpp:
    Number of control steps: 9
    Minimum slack: 3.0950000000000006
    Estimated max frequency (MHz): 144.82259232440262

  Module binding information for function min_max_cpp:
    Number of modules instantiated: 15
    Number of performance conflicts: 0
    Estimated resources area (no Muxes and address logic): 172
    Estimated area of MUX21: 70
    Total estimated area: 242
    Estimated number of DSPs: 0

  Total cycles             : 546 cycles
  Number of executions     : 21
  Average execution        : 26 cycles
Rust with -O5 and -C opt-level=3
$ rustc --emit=llvm-ir --crate-type=lib -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C opt-level=3 src/min_max_rust.rs -o build/min_max_rust_speed.ll
$ nix run github:zebreus/bachelor-thesis#bambu -- build/min_max_rust_speed.ll --top-fname=min_max_rust --generate-tb=src/testbench_rust.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -O5
  Scheduling Information of function min_max_rust:
    Number of control steps: 14
    Minimum slack: 0.37826399699999946
    Estimated max frequency (MHz): 103.93134873875212

  Module binding information for function min_max_rust:
    Number of modules instantiated: 116
    Number of performance conflicts: 12
    Estimated resources area (no Muxes and address logic): 1190
    Estimated area of MUX21: 72
    Total estimated area: 1262
    Estimated number of DSPs: 0

  Total cycles             : 342 cycles
  Number of executions     : 21
  Average execution        : 16 cycles
Rust with -Os and -C opt-level=s
$ rustc --emit=llvm-ir --crate-type=lib -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C opt-level=s src/min_max_rust.rs -o build/min_max_rust_size.ll
$ nix run github:zebreus/bachelor-thesis#bambu -- build/min_max_rust_size.ll --top-fname=min_max_rust --generate-tb=src/testbench_rust.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -Os
  Scheduling Information of function min_max_rust:
    Number of control steps: 8
    Minimum slack: 2.6688639990000045
    Estimated max frequency (MHz): 136.40450809582526

  Module binding information for function min_max_rust:
    Number of modules instantiated: 28
    Number of performance conflicts: 0
    Estimated resources area (no Muxes and address logic): 238
    Estimated area of MUX21: 68
    Total estimated area: 306
    Estimated number of DSPs: 0

  Total cycles             : 464 cycles
  Number of executions     : 21
  Average execution        : 22 cycles
Idiomatic rust with -Os and -C opt-level=s
$ rustc --emit=llvm-ir --crate-type=lib -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C opt-level=s src/min_max_rust_idiomatic.rs -o build/min_max_rust_idiomatic_size.ll
$ nix run github:zebreus/bachelor-thesis#bambu -- build/min_max_rust_idiomatic_size.ll --top-fname=min_max_rust_idiomatic --generate-tb=src/testbench_rust_idiomatic.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -Os
  Scheduling Information of function min_max_rust_idiomatic:
    Number of control steps: 7
    Minimum slack: 2.6688639990000045
    Estimated max frequency (MHz): 136.40450809582526

  Module binding information for function min_max_rust_idiomatic:
    Number of modules instantiated: 27
    Number of performance conflicts: 0
    Estimated resources area (no Muxes and address logic): 202
    Estimated area of MUX21: 0
    Total estimated area: 202
    Estimated number of DSPs: 0

  Total cycles             : 462 cycles
  Number of executions     : 21
  Average execution        : 22 cycles
Idiomatic rust with -O5 and -C opt-level=3
$ rustc --emit=llvm-ir --crate-type=lib -C overflow-checks=off -C no-vectorize-loops -C target-cpu=generic -C panic=abort -C opt-level=3 src/min_max_rust_idiomatic.rs -o build/min_max_rust_idiomatic_speed.ll
$ nix run github:zebreus/bachelor-thesis#bambu -- build/min_max_rust_idiomatic_speed.ll --top-fname=min_max_rust_idiomatic --generate-tb=src/testbench_rust_idiomatic.xml --simulate --simulator=VERILATOR -v2 --compiler=I386_CLANG12 -O5
  Scheduling Information of function min_max_rust_idiomatic:
    Number of control steps: 13
    Minimum slack: 0.37826399699999946
    Estimated max frequency (MHz): 103.93134873875212

  Module binding information for function min_max_rust_idiomatic:
    Number of modules instantiated: 98
    Number of performance conflicts: 6
    Estimated resources area (no Muxes and address logic): 861
    Estimated area of MUX21: 69
    Total estimated area: 930
    Estimated number of DSPs: 0

  Total cycles             : 322 cycles
  Number of executions     : 21
  Average execution        : 15 cycles

Which design is the smallest?

While the idiomatic version is not faster than the non-idiomatic version, it takes up the smallest area of all synthesized designs. It is more readable and easier to understand.

The area of our design is an important metric for us because bigger circuits mean that we need a bigger more expensive FPGA. Also, we can fit more things on the FPGA, if we have more area left. So the area is basically the cost of our design. While the exact amount of resources the circuit takes up depends on how we synthesize the hardware design, bambu gives us an estimate of how much area the design will take up. The following chart shows the area of the designs we synthesized:

The designs seem to be divided into two groups, one with the big designs (~250 logic elements) and one with the small designs (~1000 logic elements). The big designs are the clang and rust designs optimized for speed. The small designs are all circuits optimized for size, but also the GCC design optimized for speed.

It appears to be that GCC always generates small designs. This could be because both rust and clang use LLVM as a backend, so they are probably optimized similarly. GCC does not use LLVM but has its own optimization passes, so it is probably optimized differently. It will be interesting to see if the GCC design optimized for speed still has speed comparable to the other designs optimized for speed, despite its significantly smaller footprint.

The versions optimized for size are smaller than the versions optimized for speed. As expected the idiomatic rust design performs better than the C-style rust design on both optimization levels. This is probably because the compiler can do more optimizations on the idiomatic version.

The smallest design of the bunch is the idiomatic rust design optimized for size.

Which design is the fastest?

We tested each design with test cases ranging from 0 to 20 elements. The following chart shows the average number of cycles for each circuit.

We can see that the designs optimized for speed are a bit faster than the designs optimized for size. An exception to this is, again, the GCC design optimized for speed, which requires a comparable number of cycles to the LLVM-based designs, optimized for size. It is still a bit faster than the GCC design, optimized for speed. It seems like the GCC backend for bambu prioritizes optimizing for size over speed.

The GCC designs both take the most cycles. Unexpectedly, even the GCC design optimized for speed requires more cycles than the LLVM designs optimized for size instead of speed. But the GCC design optimized for speed is still smaller than the clang and C-style rust designs optimized for size, so maybe it could have been expected that it is also slower.

Again, the idiomatic rust design performs about the same or a bit better than the C-style rust design on both optimization levels.

We can see that the designs divide into two groups again. The low-cycle designs require between 15 and 17 cycles and the high-cycle designs require between 22 and 26 cycles. These two groups correspond exactly to the two groups we saw in the area chart. The low-cycle designs are also the big designs and the high-cycle designs are the small designs.

If we look at a more detailed chart for every design we can see that they divide into two groups:

While they all behave quite similarly for less than four inputs, they start to diverge as we increase the length of the input array. The big designs seem to require around \$2 * "input_length"\$ cycles, while the small designs only take around \$1 * "input_length"\$ cycles.

For small designs, the number of cycles increases by two for each additional input element. For big designs, they behave similarly but for every fourth additional input element, the number of cycles decreases by two instead. I speculate that the small designs only have one operation that processes one input element. The big designs could in addition to that also have an operation that can process four elements at once, which they use for the bulk of the computation. If the input length is not a multiple of four, they would have to do the last few elements one by one. It could be that this is the result of the LLVM compiler unrolling the loop when set to optimize for speed instead of size.

The big designs containing one operation that processes 1 element and one operation that processes 4 would also explain why these designs are roughly five times larger because they include the main logic that happens inside the loop 1+4 times. The small designs only include the main logic that happens inside the loop 1 time. They are probably not exactly five times bigger, because there is probably some control logic that stays the same or only doubles for the designs with two operations. I assume that there are probably some optimizations that can be applied if the same operation is done multiple times simultaneously that lead to space savings.

Note

Here it seems like the loop is converted into two operations, one that implements a single iteration and one that implements four iterations. It would be interesting to have the ability to specify in the rust code how big these chunks could be. Maybe bambu even has this feature for C++ already, I have not looked into it yet.

Which design clocks the highest?

Besides the number of cycles the designs take the other relevant metric for speed is the frequency that the cycles can be executed. While the real maximum frequency is highly dependent on the hardware and the actual placement and routing of the logic, bambu generates an estimate of the maximum frequency of each design. The following chart shows the maximum frequency of each design:

Interestingly the GCC designs top this chart. While they performed quite poorly on the other metrics, they are capable of running at a slightly faster speed than the LLVM designs. The LLVM designs optimized for size are the second fastest designs, while the ones optimized for speed are the slowest.

Which design is the best?

There seems to be a connection between the size and the frequency of the designs. The smaller the design, the faster it executes. This compensates a bit for the fact that the smaller designs on average require more cycles to run. To get the estimated maximum performance for each design we multiply the maximum frequency by the average number of cycles required to run the design for our test cases:

As expected this show that the performance gap between the big designs and the small designs is not as big as we initially expected when looking only at the required cycles

We can see that the big designs are still a bit faster but not as much as we expected. The big clang design is even slower than some small designs.

If we plot this against the area we can see that the difference in performance (\$ +- 20% \$) is quite small compared to the difference in the area (\$ +- 500% \$).

It is interesting to see that there are only two useful designs. If we want maximum speed, we should choose the idiomatic rust design optimized for speed, if we want the smallest size the choice would be the idiomatic rust design optimized for size. The other designs are not viable options, because one of these two designs will always be better on either metric.

To get the most out of our FPGA we would want to maximize the space efficiency. We can calculate this by dividing the executions per second by the area. As this metric is quite abstract we will only compare it relative to the other designs:

Note

You can click the bars in the chart to set the baseline to compare against.

If we compare them for space efficiency we can see that the idiomatic rust version is the most space-efficient design. Surprisingly enough the GCC designs are quite good here.

The big designs are not very space efficient, which was to be expected as they sacrificed space for speed. This tradeoff is not worth it for this design, because the problem of finding the minimum/maximum element can be parallelized quite well. We could just use two of the small designs simultaneously and get the same performance as a big design for a smaller area. We would need to implement some wrapper logic for that, which would probably be quite small compared to the rest of the design.

Conclusion

It seems like high-level synthesis from Rust is possible. In our simple example it even performed slightly better than the usual input languages like C/C++. In future work we could try to synthesize more complex designs and see if the results are still good. We could also try to optimize the designs for speed and see if we can get better results.

The process could benefit from some kind of annotations in the Rust code to help the HLS tool better synthesize the code. I don’t know if that is supported by bambu but will look into it in the future. Currently, all the parameter names get lost in the process. It would be nice if we could keep them. Also, we have no idea how the designs actually work as we just viewed the generated Verilog code as a black box. It would probably be interesting to look at the generated RTL and try to understand how the designs work.

We could try to use a different HLS tool, like Vivado HLS, which is probably way more mature and capable than bambu. The main restriction is probably that it needs to be able to take LLVM IR as input. I am not sure if there is another free and open-source HLS tool that can process LLVM IR. It may be feasible to generate C code from the LLVM IR and then try using that with a C HLS tool.


And so, our first journey into the realm of High-level synthesis comes to a close, and we have only scratched the surface. We have discovered that Rust, with the aid of the powerful PandA and the wyvern of LLVM, unlocks some of the potentials of FPGAs. But there is still much to be explored, and our quest for faster and more efficient calculations continues.

In the next chapter of our journey, we will delve deeper into the mystical world of HLS and FPGAs. We will put our newfound knowledge to the test by crafting a more complex design, a simple blinker, in Rust. But there are still dangers lurking in the shadows, and our work is far from done. But we’ll be ready, for our wit and cunning will guide us through the maze of memory, and the challenges posed by switch and match will only make our journey all the more exhilarating.

Stay tuned, dear reader, as in just a week’s time, the next chapter of our journey will be available, and I promise it will be just as thrilling.