Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions

Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages that prepare data partitions for parallel computation. Instead of treating user-defined functions as ``black boxes'', we propose to analyze those functions to turn them into ``gray boxes'' that expose opportunities to optimize data shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimization framework that reasons about data-partition properties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clusters, and designed a framework that is incorporated it into the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47% in terms of disk and network I/O for shuffling, and up to 48% in terms of cross-pod network traffic.

The content will be available after NSDI 2012.

main.pdf
PDF file

Publisher  

Details

TypeTechReport
NumberMSR-TR-2012-28
> Publications > Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions