Performance Modeling and Optimization of Distributed Queries

kaushik rajan, kapil vaswani, and ankush desai


Systems such as Pig, Hive, FlumeJava, Scope and DryadLINQ

provide programmers with declarative abstractions for querying

large amount of data, and a runtime system that can execute

these queries on a cluster of machines. However, this

level of abstraction comes at a cost – the inability to understand,

predict and optimize performance. In this paper, we

propose a performance modeling approach for predicting

the execution time of distributed queries. Our modeling approach

is based on the critical path method.We examine how

distributed query engines exploit data-parallelism and accurately

model features common to several query engines like

locality aware scheduling, asynchronous buffered writes,

pipelined data transfers and replication.We perform detailed

evaluation of our performance models on both DryadLINQ

and Hive. We find that the models can accurately predict

execution time to within 32% of the actual execution time

over a range of TPC-H and real world queries. We demonstrate

how the models can be used to allocate resources and

optimize queries in a cloud environment.


Publication typeTechReport
> Publications > Performance Modeling and Optimization of Distributed Queries