Comet: Batched Stream Processing for Data Intensive Distributed Computing

Bingshen He, Mao Yang, Zhenyu Guo, Rishan Chen, Bing Su, Wei Lin, and Lidong Zhou

Abstract

Batched stream processing is a new distributed data processing paradigm that models recurring batch computations on incrementally bulk-appended data streams. The model is inspired by our empirical study on a trace from a large-scale production data-processing cluster; it allows a set of effective query optimizations that are not possible in a traditional batch processing model.

We have developed a query processing system called Comet that embraces batched stream processing and integrates with DryadLINQ. We used two complementary methods to evaluate the effectiveness of optimizations that Comet enables. First, a prototype system deployed on a 40-node cluster shows an I/O reduction of over 40% using our benchmark. Second, when applied to a real production trace covering over 19 million machine-hours, our simulator shows an estimated I/O saving of over 50%.

Details

Publication typeProceedings
Published inACM Symposium on Cloud Computing 2010
PublisherAssociation for Computing Machinery, Inc.

Previous versions

Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Bing Su, Wei Lin, and Lidong Zhou. Comet: Batched Stream Processing in Data Intensive Distributed Computing, Microsoft Research, December 2009.

> Publications > Comet: Batched Stream Processing for Data Intensive Distributed Computing