Making Big Data Analytics Interactive and Real-Time
Reynold Xin, University of California, Berkeley
The rapid growth in data volumes requires new computer systems that scale out across hundreds of machines. While early frameworks, such as MapReduce, handled large-scale batch processing, the demands on these systems have also grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) more complex multi-pass algorithms (e.g. machine learning and graph processing), and (3) real-time processing on large data streams.
In this talk, we present a single abstraction, resilient distributed datasets (RDDs), that supports all of these emerging workloads by providing efficient and fault-tolerant in-memory data sharing. We have used RDDs to build a stack of computing systems including the Spark parallel engine, Shark SQL processor, Spark Streaming engine, and a new graph processing engine called GraphX. Spark and Shark can run machine learning algorithms and interactive queries up to 100x faster than Hadoop MapReduce, while Spark Streaming enables fault-tolerant stream processing at significantly higher scales than were possible before. More importantly, however, RDDs show that it is possible to *combine* these seemingly diverse workloads in a single processing engine, a powerful feature that we believe will be critical in future data analytics applications.