Newby, Gregory B.; El-Ghazawi, Tarek; El-Araby, Esam; Taher, Mohamed;
Abouellail, Mohamed.  2007.  "High Level Programming for
Reconfigurable Computers: Readiness for High Performance Computing."
Third Reconfigurable Supercomputing Conference.  Manchester, England,
March 28-29.

This paper describes a hands-on investigation of high-level
programming environments for reconfigurable computing. Our main goal
was to assess the relative merits of the different programming
environments, which represented the best of those currently
available. Our evaluation was in the context of high performance
computing, in which software developers would not necessarily be
expected to be facile with low-level languages for FPGAs (i.e.,
Verilog or VHDL). A range of programming tasks were investigated,
including implementation of common benchmarks and library functions,
as well as seeking to accelerate a real-world supercomputing
application. In this abstract, we focus on our methodology for
fundamental performance measures and evaluations.

We considered three representative commercially available high-level
tools: Impulse-C, Mitrion-C, and DSPLogic. All were evaluated on our
Cray XD1, with software design, simulation and compilation on Windows
workstations or the XD1, as required by each tool. These tools were
selected to represent imperative programming, functional programming
and schematic programming. In spite of the disparity in concepts
behind those tools, our methodology was able to uncover the basic
differences among them and assess their comparative performance,
utilization, and ease-of-use.

In order to create programs for reconfigurable computing from
high-level languages (HLLs), some sort of compiler (perhaps with
multiple elements, pre-processing steps, etc.) is used. Unlike
compilers with a single hardware target, such as when gcc generates an
executable binary file for a i386-compatible Linux system, programming
environments for HLLs and reconfigurable computing need to compile for
multiple targets. They use a special compiler module provided by the
FPGAe);

    * Specification of communication and synchronization between both
      systems, including such tasks as reading and writing shared
      memory or accessing external devices;

    * Design partitioning and communication within the reconfigurable
      processor system (i.e., among multiple FPGAs); and

    * Defining the sequence of FPGA reconfigurations.

The general structure of the XD1 we used is as follows: one chassis
houses six compute nodes. Each node has two AMD Opterons at 2.4 GHz
and RapidArray Processors (RAPs) that handle communication. The two
Opterons on a node are connected via AMD's HyperTransport with a
bandwidth of 3.2 GB/s, forming a 2-way SMP. Our XD1 featured one
chassis in which all six nodes had an application acceleration
processor (FPGA).

Three workloads were selected for implementation on the XD1 using the
selected tools. The first workload was a simple pass-through
implementation that read input from the Îead caused by each tool on
the FPGA with respect to the percent of circuitry utilized, as well as
to measure the clocking rates reached by each tool. This provided an
initial idea of the performance for each tool.

The second application was a discrete wavelet transform (DWT). DWT is
composed of two FIR filters and two down-samplers. The two filters
were preloaded with the high-pass and low-pass coefficients defining
the particular wavelet used for the transform.

The last application was the data encryption standard (DES)
algorithm. DES takes a 64-bit plaintext block (data) and a 64-bit key
as inputs and generates a 64-bit ciphertext block (encrypted
data). DES consists of 16 identical rounds supplemented by a few
auxiliary transformations.

We also assessed the programming model, which is the hardware abstract
view presented to the programmer by the programming tool. Thus, a
programming model defines which parts of the hardware architecture
will become visible to the programmer and under his/her direct
control. In a reconfigurable computer, the programming model
determines whether (and how) the programmer can control data transfers
between the FPGA and the onboard memory, the FPGA and the
microprocessor memory, and the FPGA and the microprocessor.

The XD1 provides a number of transfer modes between the microprocessor
and the FPGA depending on the initiator of the transfer. The
microprocessor can read from and/or write to the FPGA local memory
space (i.e. internal registers, internal BRAMS, and external
memory). Additionally, the FPGA can read from and/or write to the
microprocessorss/implicitness. Some of these tools hide the details of
implementation of such transfer while others leave it as the
responsibility of the programmer. For example, DSPlogic implicitly
handles the transfer scenarios utilizing the best mode
(i.e. write-only architecture) guaranteeing the highest throughput
possible. The metrics we used to evaluate the efficiency of the tools
were in terms of the synthesized clock frequency, resource usage
(slice utilization), and measured end-to-end throughput.

The full paper presents details on findings for the three tools across
the different workloads. It will also discuss our less structured
investigations, where the different tools were utilized to implement
real-world applications. These included some standard benchmarks for
I/O, sorting/searching, and random number generation, as well as an
investigation into the suitability of the FPGAs for a Smith-Waterman
algorithm. An important finding concerning S-W was that our
collaborators at the US National Institutes of Health provided detail
on real-world criteria for benchmarking. We discovered that despite
common knowledge that FPGAs excel at this task, the task as typically
implemented does not meet the needs of real-world users of the
algorithm. Instead, real-world use requires a feedback mechanism that
is impractical for current generations of FPGAs due to their limited
circuitry. Another aspect of our investigation was to accelerate some
supercomputing applications in active use at our center. The key
findings here were that high-end applications (running on hundreds to
thousands of processors, typically utilizing MPI for communication
among processors) are already highly optimized and quite complex âs
learned included the following. First, the designer must keep in mind
that optimized hardware C isn believe the extent to which relatively
good performance is achievable using HLLs is mitigated by the many
large applications already seeing widespread use in the
high-performance computing world. For the near future, we see two
areas of great utility for these HLLs. First is to implement
relatively small programs, either from scratch or by porting, to gain
better performance through hardware acceleration. This seems to be the
emphasis of the tools we examined. A second target for HLL application
development, which we see as crucial for widespread adoption of
hardware acceleration for high-performance computing, is to develop
libraries of common functions (e.g. BLAS and LAPACK). Such functions
will make the use of FPGAs as coprocessors much more transparent, and
less laborious. This approach, already being undertaken in the GPU
world as well as by ClearSpeed, seems promising for high-performance
computing.