kaushik rajan, KApil Vaswani, Dharmesh Kakadia, Subru Krishnan, and carlo curino
Systems such as Pig, Hive, FlumeJava, Scope and DryadLINQ provide programmers with declarative abstractions for querying large amount of data, and a runtime system that can execute these queries on a cluster of machines. However, this level of abstraction comes at a cost – the inability to understand, predict and optimize performance. In this paper, we propose a performance modeling approach for predicting the execution time of distributed queries. Our modeling approach is based on the critical path method.We examine how distributed query engines exploit data-parallelism and accurately model features common to several query engines like locality aware scheduling, asynchronous buffered writes, pipelined data transfers and replication.We perform detailed evaluation of our performance models on both DryadLINQ and Hive. We find that the models can accurately predict execution time to within 32% of the actual execution time over a range of TPC-H and real world queries. We demonstrate how the models can be used to allocate resources and optimize queries in a cloud environment.