Sponge: Portable Stream Programming on Graphics Engines

Amir Hormati; Mehrzad Samadi; Mark Woh; Trevor Mudge; Scott Mahlke

Sponge: Portable Stream Programming on Graphics Engines

Amir Hormati ,
Mehrzad Samadi ,
Mark Woh ,
Trevor Mudge ,
Scott Mahlke

Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2011) | August 2011

Download BibTex

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a writeonce software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponge’s compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.