Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha
GPUTeraSort sorts billion-record wide-key databases using the data and task parallelism on the graphics processing unit(GPU)to perform memory-intensive and compute-intensive tasks while the CPU performs I/O and resource management. It exploits both the high-bandwidth GPU memory interface and the lower-bandwidth CPU main memory interface to achieve higher aggregate memory bandwidth than purely CPU-based algorithms. It also pipelines disk transfers to achieve near-peak I/O performance. GPUTera-Sort is a two-phase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. We tested the performance of GPUTeraSort on billion-record files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with 265 NVIDIA 7800 GT GPU is significantly faster than optimized CPU-based algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort price-performance. These results suggest that a GPU co-processor can significantly improve performance on large data processing tasks.