Big data Analytics Systems

in

Published by Cambridge University Press | 2014

Performing timely analysis on huge datasets is the central promise of big data analytics. To cope with the high volumes of data to be analyzed, computation frameworks have resorted to “scaling out” — parallelization of analytics which allows for seamless execution across large clusters. These frameworks automatically compose analytics jobs into a DAG of small tasks, and then aggregate the intermediate results from the tasks to obtain the final result. Their ability to do so relies on an efficient scheduler and a reliable storage layer that distributes the datasets on different machines.

In this chapter, we survey the above two aspects, scheduling and storage, which are the foundations of modern big data analytics systems. We describe their key principles, and how these principles are realized in widely-deployed systems.