Intro - By Volker Markl
Part 1 - Query Optimization with MapReduce Functions, Kostas Tzoumas
Abstract: Many systems for big data analytics employ a data flow programming abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators using a shallow static code analysis pass based on reverse data and control flow analysis over the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of non-relational data flows, a unique feature among today's systems.
Part 2 - Spinning Fast Iterative Data Flows, Stephan Ewen
Abstract: Parallel data flow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel data flow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate 'incremental iterations', a form of workset iterations, with parallel data flows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the 'sparse computational dependencies' inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified data flow abstraction.
Part 3 - A Taxonomy of Platforms for Analytics on Big Data, Thomas Bodner
Abstract: Within the past few years, industrial and academic organizations designed a wealth of systems for data-intensive analytics including MapReduce, SCOPE/Dryad, ASTERIX, Stratosphere, Spark, and many others. These systems are being applied to new applications from diverse domains other than (traditional) relational OLAP, making it difficult to understand the tradeoffs between them and the workloads for which they were built. We present a taxonomy of existing system stacks based on their architectural components and the design choices made related to data processing and programmability to sort this space. We further demonstrate a web repository for sharing Big Data analytics platform information and use cases. The repository enables researchers and practitioners to store and retrieve data and queries for their use case, and to easily reproduce experiments from others on different platforms, simplifying comparisons.