Rhea: Automatic IO Filtering for Optimizing Cloud Analytics

Duration  00:05:37

Date recorded  14 November 2012

Hadoop is the de-facto standard for processing large datasets. In close collaboration with the Microsoft Hadoop team we are developing an Azure service to accelerate Hadoop jobs. We take advantage of the observation that many Hadoop jobs are very selective and operate on just a fraction of their input data. In this cross-group project we have used static analysis techniques to examine the map phase of a job and automatically extract a filter that identifies the interesting rows and columns of the input data. Instead of sending all data from the Azure storage to the compute cluster, we automatically identify and send only the subset of interest. Using our filters on some example jobs, we have reduced network overheads by a factor of 5, and job completion times by a factor of 3 to 4.

©2012 Microsoft Corporation. All rights reserved.
> Rhea: Automatic IO Filtering for Optimizing Cloud Analytics