PASS project is a continuing collaboration with the Cosmos team that aims to improve SCOPE script correctness and performance using program analysis techniques, following the inter-disciplinary research direction, among program language, system and database research.
SUDO: Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions
Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. Instead of treating user-defined functions as "black boxes", we propose to analyze those functions to turn them into "gray boxes" that expose opportunities to optimize data shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimization framework that reasons about data-partition properties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clusters, and designed a framework that is incorporated it into the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47% in terms of disk and network I/O for shuffling, and up to 48% in terms of cross-pod network traffic.
PeriSCOPE: Spotting Code Optimizations in Data-Parallel Pipelines
To minimize the amount of data-shuffling I/O that occurs between the pipeline stages of a distributed data-parallel program, its procedural code must be optimized with full awareness of the pipeline that it executes in. Unfortunately, neither pipeline optimizers nor traditional compilers examine both the pipeline and procedural code of a data-parallel program so programmers must either hand-optimize their program across pipeline stages or live with poor performance. To resolve this tension between performance and programmability, this paper describes PeriSCOPE, which automatically optimizes a data-parallel program's procedural code in the context of data flow that is reconstructed from the program's pipeline topology.
Such optimizations eliminate unnecessary code and data, perform early data filtering, and calculate small derived values (e.g., predicates) earlier in the pipeline, so that less data---sometimes much less data---is transferred between pipeline stages.
We describe how PeriSCOPE is implemented and evaluate its effectiveness on real production jobs.
Cybertron: Awareness between Data and Computation in Data-Parallel Program
Data transferring during shuffling, data de-serialization before real computation and data serialization after real computation are the most significant cost of distributed data-parallel jobs. The transferring and (de-)serialization of unused data and redundant data is totally a waste of network resource and CPU time. The (de-)serialization of data is another waste if real computation can be directly operated atop the serialized data. The broadly usage of sophisticated user defined functions (UDFs) hinders the reasoning of data usage thus blocks the optimization by both programmer and existing optimizer. We present system Cybertron to automatically analyze the data usage information, remove the unused data and encode the used data by providing holistic semantic-aware encoding techniques; and Cybertron further automatically identify real computation which can be operated atop the encoded data format to eliminate the unnecessary data (de-)serialization.
Automatic Partial Aggregation for MapReduce
Partial aggregation is commonly applied to MapReduce-style programs to optimize network I/O by transmitting partially reduced results, which are successively combined into a final result, rather than a whole list of values, which are reduced all at once into one result. Programmers currently enable partial aggregation by manually encoding their reduce functionality into separate reduce and combine functions. However, we have found through an empirical study that such a manual encoding is expensive and error prone.
We propose automatically verifying whether a given MapReduce program, with its original monolithic reduce function, is eligible for partial aggregation, and then if it is, synthesizing the appropriate enabling code. To accomplish this, we prove a necessary and sufficient condition for when partial aggregation is applicable to a MapReduce program, which provides a theoretical foundation for our solution. We next use this condition to show that verification is a standard program verification problem and that synthesis is essentially a nondeterministic program inversion problem that is not well studied. However, we observe that most MapReduce programs can be classified into three non-trivial categories for which we can design special-purpose synthesis algorithms. We have implemented a prototype of our method to evaluate it over simple benchmark programs, showing that for all programs, verification and synthesis can succeed within 10 and seconds, respectively.
SCA: Static Code Analysis for SCOPE Scripts
The execution of failed SCOPE jobs due to improper programming is a huge waste of computation resource. By identifying the frequent failure patterns, like column index out of range and containers collecting data from input rowset, we propose static analysis rules to detect such defects in compile time, with corresponding warnings or even errors to programmers.
- Chang Liu, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Zhenyu Guo, and Thomas Moscibroda, Automating Distributed Partial Aggregation, SOCC, November 2014
- Tian Xiao, Zhenyu Guo, Hucheng Zhou, Jiaxing Zhang, Xu Zhao, Chencheng Ye, Xi Wang, Wei Lin, Wenguang Chen, and Lidong Zhou, Cybertron: Pushing the Limit on I/O Reduction in Data-Parallel Programs, OOPSLA, June 2014
- Tian Xiao, Jiaxing Zhang, Hucheng Zhou, Zhenyu Guo, Sean McDirmid, Wei Lin, Wenguang Chen, and Lidong Zhou, Nondeterminism in MapReduce Considered Harmful? An Empirical Study on Non-commutative Aggregators in MapReduce Programs, (ICSE SEIP) Software Engineering in Practice , 13 April 2014
- Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie, A Characteristic Study on Failures of Production Distributed Data-Parallel Programs, in ICSE (SEIP track). Best paper!, ACM/IEEE, 22 May 2013
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou, Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE, in OSDI, USENIX, 8 October 2012
- Jiaxing Zhang, Hucheng Zhou, Rishan Chen, Xuepeng Fan, Zhenyu Guo, Haoxiang Lin, Jack Y.Li, Wei Lin, Jingren Zhou, and Lidong Zhou, Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions, in NSDI, USENIX, 25 April 2012
- Jiaxing Zhang, Hucheng Zhou, Rishan Chen, Xuepeng Fan, Zhenyu Guo, Haoxiang Lin, Jack Y. Li, Wei Lin, Jingren Zhou, and Lidong Zhou, Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions, no. MSR-TR-2012-28, April 2012