Share on Facebook Tweet on Twitter Share on LinkedIn Share by email

Big-data processing is performed by composing a sequence of data transformation steps into a processing pipeline. The pipeline is then executed on a distributed system by exploiting parallelism at scale. Many decisions that need to be made by end-users (resource provisioning for the cloud) and cluster management frameworks (meeting job SLOs) need a better understanding of the performance characteristics of such pipelines. The goal of perforator is to enable various performance driven decisions

Data-size estimation for big-data pipelines

Good estimates of data sizes processed through a big-data pipeline are crucial to efficiently manage resources. Apriori estimates are needed for applications like execution-time prediction, SLA driven scheduling and resource provisioning.

Data size estimation for big-data pipelines is challenging because (1) big-data query languages support arbitrary user-defined functions. We observe that about 75% of production pipelines have UDFs. (2) The storage layer used in big-data systems are non-relational and support custom serialization, compression and storage formats. Data sizes not only depend on the number of rows but also on format used.

We propose black box technique for directly estimating the size of data read, written and shuffled by different stages of a big data pipeline. The estimator employs curve fitting to learn the relationship between data sizes at the input and output of operators by observing the behavior on random samples of the actual data. As with any estimation problem there is a trade-off between the cost of estimation and accuracy of estimation. We explore this trade-off and come up with a design that is well suited for big-data systems. We implement our estimator as part of the HIVE query engine and plug into multiple big-data systems (Hadoop, Tez, Impala, Spark).

Evaluation on Big-Bench and production big-data pipelines shows that this technique can estimate data sizes accurately by expending as little as 5% of the resource-time product needed to run the actual pipeline. We predict data sizes to within a factor of 10 in 97% of the cases and significantly outperform the current estimator used by the HIVE optimizer. When employed for resource allocation we find that our estimator can estimate the costs to within 7% of an oracle.

Predictive performance models for SLA aware cluster management

Sharing large Big Data clusters is an effective way to amortize costs and maximize utility. However, this requires large production jobs, with strict SLAs, and latency-sensitive interactive computations, to execute side-by-side. Currently deployed scheduling mechanisms such as fairness and priorities are insufficient to guarantee production SLAs, minimize best-effort job latency, and maintain high cluster utilization.

To address this problem we propose a system that: 1) automatically infers the resource needs of a broad class of production jobs, by means of profiling and white-box modeling, 2) dynamically requests resources, and cope with changing cluster conditions by renegotiating the resources dedicated to each job, and 3) enhances Hadoop admission control to guarantee enough resources are set aside to meet all SLAs.

We integrate our system in Hadoop, leveraging recent advances in the YARN resource manager, and demosntrate broad applicability (Hive, Oozie, MapReduce, Pig) and high precision when handling production workloads from popular benchmarks such as TPC-H and BigBench.

Dharmesh Kakadia
Dharmesh Kakadia