Execution of memory-bound workloads on GPUs via software managed cache

We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms.
We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio.
The use of user-managed cache allows for an order of magnitude speedups versus the version without such a cache. Overall, our implementation of MPF on NVIDIA GTX285 results in up to 40 times (double precision) speedup versus Intel Core2 Duo CPU@ 2.40GHz with 2MB L2 cache. It allows for the saturation of the double precision unit on a GPU ( effective performance of 78 GFLOP/sec), with the single precision performance as high as 104 GFLOP/sec. In log-scale, the speedup is up to 3,000 versus the CPU version.
Finally we present the preliminary results of running a distributed version of MPF solver on a heterogeneous supercomputer TSUBAME over 120 GPUs and 1024 CPUs.
A preliminary version of this work was presented at ICS08 in Greece. Joint work with John Owens, UC Davis

Speaker Details

Mark Silberstein: http://www.cs.technion.ac.il/~marksI’m a last year PhD student at the Technion, Israel Institute of Technology, supervised by D. Geiger and A. Schuster. My research focus is high performance parallel and distributed computing with the application to genetic linkage analysis. The outcome of my work is the distributed system for genetic linkage analysis, Superlink-online (bioinfo.cs.technion.ac.il/superlink-online), used by geneticists worldwide. The system employs thousands of non-dedicated computers for the parallel execution of the analysis, and recently has been extended to work on GPUs. Before starting my PhD I worked for IBM at Haifa Research Lab in the Grid computing group.

Date:: November 20, 2009
Speakers:: Mark Silberstein
Affiliation:: Technion, Israel Institute of Technology

- Jeff Running

Execution of memory-bound workloads on GPUs via software managed cache

Speaker Details

Speakers

Jeff Running