The HLS approach to FPGA development is to only abstract portions of the application that can be easily expressed in a C/C++ environment. The HLS tool flow is available for essentially any BittWare board through the use of Vivado (Xilinx) or Intel (Quartus) tools.
To succeed with HLS, it’s important to recognize the portions of your application that will be a good fit. Guidelines include:
- Target uses are, generally, IP blocks that are defined in a high-level language to begin with. A math algorithm would work well or, as in our RSS block, some network protocol processing.
- Another category of uses are blocks that are poorly defined and thus may require multiple rounds of implementation. The biggest benefit here is allowing the HLS tool to automatically pipeline the resulting native FPGA code, often emitting fewer stages than quickly hand-coding a pipeline will. Also, when it comes time to modify the hand-coded pipeline, delay changes on one parallel path can have a ripple effect on everything. Using the HLS tool to automatically pipeline a second time, from scratch, eliminates that headache.
- Finally, an HLS flow makes it easier to port code between FPGA brands and speed grades. This is because HLS automatically generates the appropriate number of pipeline stages—something you need to manually specify when working with Verilog or VHDL.
The current limitations of HLS are clearly that its scope is limited to IP blocks. The application team would still require RTL for other components, although leveraging something like BittWare’s SmartNIC Shell for the RTL parts, a user may be able to define their unique application entirely in HLS. It should also be noted that HLS is a poor choice for the most simplistic of codes or larger designs that consist mostly of pre-optimized components.
Our Application: Networking RSS on an FPGA
What is RSS? RSS stands for “Receiver Side Scaling.” It is a hashing algorithm to efficiently distribute network packets across multiple CPUs. RSS is a feature on modern Ethernet cards, and generally implements the specific Toeplitz hash defined by Microsoft.
The environment that hosts our RSS application is BittWare’s SmartNIC Shell. The SmartNIC Shell is designed to give users head start when building an FPGA-based networking application. It provides users an optimized FPGA-based 100G Ethernet pipeline, including DPDK for host interaction. All the user needs to do is drop in their application as an IP block.
In this case, BittWare was the user as well, having created as our application an FPGA implementation of RSS. The team creating RSS using the traditional RTL approach and the HLS team both were able to use SmartNIC Shell as their FPGA Ethernet framework and concentrate on the RSS application itself.
BittWare’s RSS Implementation
Our FPGA-based RSS implementation is specifically based on C code found in the DPDK source tree, with the test function for that code also available in the tree. Our RSS application also uses a 64-entry indirection table instead of the more common 128-entry table. What is important for this HLS study is that the function we are moving into the FPGA starts off defined in C. That meets our number one criterion for HLS success—a definition in C or C++.
Grouping Packets Using Tuples
The goal of the the RSS function is to distribute packets among CPUs, keeping related streams of packets together. Different Toeplitz key sets provide different distribution patterns. However, no matter the key set, our RSS function uses each packet’s source and destination IP address and source and destination ports as input. These four components combined are called a 4-tuple.
Note that for our RSS application we are assuming the 4-tuple was already parsed and added to the packet’s metadata. Another SmartNIC Shell module handles this packet classification function. We call that module our “parser” and will be the subject of a separate BittWare white paper.
Our RSS implementation currently accepts a 96-bit field for classification—enough for the 4-tuple of IPv4 source/destination and port. The parser provides zero for fields not available in the packet; if a packet does not include any IP payload, the full 96-bit tuple field is zero.
Many RSS implementations use a 5-tuple instead of a 4-tuple. Doing that would require an additional 8-bits to accommodate the protocol number. HLS users of RSS can easily accommodate that change with minor source code changes. This ability to quickly adapt from 4-tuple to 5-tuple is an example of the number two criteria for HLS success – a requirement for multiple rounds of implementation.