Automating Distributed Partial Aggregation

SOCC 2014: 5th ACM Symposium on Cloud Computing, Seattle, Washington |

Published by ACM - Association for Computing Machinery

Partial aggregation is of great importance in many distributed data-parallel systems. Most notably, it is commonly applied by MapReduce programs to optimize I/O by successively aggregating partially reduced results into a final result, as opposed to aggregating all input records at once. In spite of its importance, programmers currently enable partial aggregation by tediously encoding their reduce functionality into separate reduce and combine functions. This is error prone and often leads to missed optimization opportunities.

This paper proposes an algorithm that automatically verifies if the original monolithic reduce function of a MapReduce program is eligible for partial aggregation, and if so, synthesizes enabling partial aggregation code. The key insight behind this algorithm is a novel necessary and sufficient condition for when partial aggregation is applicable to a reduce function. This insight provides us with a formal foundation for an automaton, which derives a satisfiability problem that can be fed into a standard SMT solver. By doing so, we transform the problem of synthesis into a program inversion problem, which is however nondeterministic. Although such inversion is hard to solve in general, we observe that most reducers in practical distributed computing contexts can be classified into a few categories for which we can design efficient synthesis algorithms. Finally, we build and evaluate a prototype of our method to demonstrate its feasibility in the SCOPE distributed data-parallel system.