Newby, Gregory B.; El-Ghazawi, Tarek; El-Araby, Esam; Taher, Mohamed; Abouellail, Mohamed. 2007. "High Level Programming for Reconfigurable Computers: Readiness for High Performance Computing." Third Reconfigurable Supercomputing Conference. Manchester, England, March 28-29. This paper describes a hands-on investigation of high-level programming environments for reconfigurable computing. Our main goal was to assess the relative merits of the different programming environments, which represented the best of those currently available. Our evaluation was in the context of high performance computing, in which software developers would not necessarily be expected to be facile with low-level languages for FPGAs (i.e., Verilog or VHDL). A range of programming tasks were investigated, including implementation of common benchmarks and library functions, as well as seeking to accelerate a real-world supercomputing application. In this abstract, we focus on our methodology for fundamental performance measures and evaluations. We considered three representative commercially available high-level tools: Impulse-C, Mitrion-C, and DSPLogic. All were evaluated on our Cray XD1, with software design, simulation and compilation on Windows workstations or the XD1, as required by each tool. These tools were selected to represent imperative programming, functional programming and schematic programming. In spite of the disparity in concepts behind those tools, our methodology was able to uncover the basic differences among them and assess their comparative performance, utilization, and ease-of-use. In order to create programs for reconfigurable computing from high-level languages (HLLs), some sort of compiler (perhaps with multiple elements, pre-processing steps, etc.) is used. Unlike compilers with a single hardware target, such as when gcc generates an executable binary file for a i386-compatible Linux system, programming environments for HLLs and reconfigurable computing need to compile for multiple targets. They use a special compiler module provided by the FPGAe); * Specification of communication and synchronization between both systems, including such tasks as reading and writing shared memory or accessing external devices; * Design partitioning and communication within the reconfigurable processor system (i.e., among multiple FPGAs); and * Defining the sequence of FPGA reconfigurations. The general structure of the XD1 we used is as follows: one chassis houses six compute nodes. Each node has two AMD Opterons at 2.4 GHz and RapidArray Processors (RAPs) that handle communication. The two Opterons on a node are connected via AMD's HyperTransport with a bandwidth of 3.2 GB/s, forming a 2-way SMP. Our XD1 featured one chassis in which all six nodes had an application acceleration processor (FPGA). Three workloads were selected for implementation on the XD1 using the selected tools. The first workload was a simple pass-through implementation that read input from the Îead caused by each tool on the FPGA with respect to the percent of circuitry utilized, as well as to measure the clocking rates reached by each tool. This provided an initial idea of the performance for each tool. The second application was a discrete wavelet transform (DWT). DWT is composed of two FIR filters and two down-samplers. The two filters were preloaded with the high-pass and low-pass coefficients defining the particular wavelet used for the transform. The last application was the data encryption standard (DES) algorithm. DES takes a 64-bit plaintext block (data) and a 64-bit key as inputs and generates a 64-bit ciphertext block (encrypted data). DES consists of 16 identical rounds supplemented by a few auxiliary transformations. We also assessed the programming model, which is the hardware abstract view presented to the programmer by the programming tool. Thus, a programming model defines which parts of the hardware architecture will become visible to the programmer and under his/her direct control. In a reconfigurable computer, the programming model determines whether (and how) the programmer can control data transfers between the FPGA and the onboard memory, the FPGA and the microprocessor memory, and the FPGA and the microprocessor. The XD1 provides a number of transfer modes between the microprocessor and the FPGA depending on the initiator of the transfer. The microprocessor can read from and/or write to the FPGA local memory space (i.e. internal registers, internal BRAMS, and external memory). Additionally, the FPGA can read from and/or write to the microprocessorss/implicitness. Some of these tools hide the details of implementation of such transfer while others leave it as the responsibility of the programmer. For example, DSPlogic implicitly handles the transfer scenarios utilizing the best mode (i.e. write-only architecture) guaranteeing the highest throughput possible. The metrics we used to evaluate the efficiency of the tools were in terms of the synthesized clock frequency, resource usage (slice utilization), and measured end-to-end throughput. The full paper presents details on findings for the three tools across the different workloads. It will also discuss our less structured investigations, where the different tools were utilized to implement real-world applications. These included some standard benchmarks for I/O, sorting/searching, and random number generation, as well as an investigation into the suitability of the FPGAs for a Smith-Waterman algorithm. An important finding concerning S-W was that our collaborators at the US National Institutes of Health provided detail on real-world criteria for benchmarking. We discovered that despite common knowledge that FPGAs excel at this task, the task as typically implemented does not meet the needs of real-world users of the algorithm. Instead, real-world use requires a feedback mechanism that is impractical for current generations of FPGAs due to their limited circuitry. Another aspect of our investigation was to accelerate some supercomputing applications in active use at our center. The key findings here were that high-end applications (running on hundreds to thousands of processors, typically utilizing MPI for communication among processors) are already highly optimized and quite complex âs learned included the following. First, the designer must keep in mind that optimized hardware C isn believe the extent to which relatively good performance is achievable using HLLs is mitigated by the many large applications already seeing widespread use in the high-performance computing world. For the near future, we see two areas of great utility for these HLLs. First is to implement relatively small programs, either from scratch or by porting, to gain better performance through hardware acceleration. This seems to be the emphasis of the tools we examined. A second target for HLL application development, which we see as crucial for widespread adoption of hardware acceleration for high-performance computing, is to develop libraries of common functions (e.g. BLAS and LAPACK). Such functions will make the use of FPGAs as coprocessors much more transparent, and less laborious. This approach, already being undertaken in the GPU world as well as by ClearSpeed, seems promising for high-performance computing.