Comet: Batched Stream Processing in Data Intensive Distributed Computing

Performance and resource optimization is an important research problem in data intensive distributed computing. We present a new batched stream processing model that captures query

correlations to expose I/O and computation redundancies for optimizations. The model is inspired by our empirical study on a trace from a production large-scale data processing cluster, which reveals significant redundancies caused by strong temporal and spatial correlations among queries.

We have developed Comet, a query processing system that embraces the batched stream processing model for optimizations. We have integrated Comet with DryadLINQ. With its roots in query optimizations for database systems, Comet enables a set of new heuristics and opportunities tailored for distributed computing in DryadLINQ. Optimizations in Comet are effective. The evaluation of a micro-benchmark on a 40-machine cluster shows a 42% reduction in total machine time and over 40% reduction in total I/O. Our simulation on a real trace covering over 19 million machine hours shows an estimated I/O saving of over 50%.

socc22-he.pdf
PDF file

Publisher  Microsoft Research
© 2009 Microsoft Corporation. All rights reserved.

Details

TypeTechReport
NumberMSR-TR-2009-180

Previous Versions

Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Hongyi Wang, and Lidong Zhou. Wave Computing in the Cloud, USENIX, April 2009.

> Publications > Comet: Batched Stream Processing in Data Intensive Distributed Computing