Overview of “Big Data” Research at TU Berlin

Intro – By Volker Markl

Part 1 – Query Optimization with MapReduce Functions, Kostas Tzoumas

Abstract: Many systems for big data analytics employ a data flow programming abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators using a shallow static code analysis pass based on reverse data and control flow analysis over the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of non-relational data flows, a unique feature among today’s systems.

Part 2 – Spinning Fast Iterative Data Flows, Stephan Ewen

Abstract: Parallel data flow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel data flow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate “incremental iterations”, a form of workset iterations, with parallel data flows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the “sparse computational dependencies” inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified data flow abstraction.

Part 3 – A Taxonomy of Platforms for Analytics on Big Data, Thomas Bodner

Abstract: Within the past few years, industrial and academic organizations designed a wealth of systems for data-intensive analytics including MapReduce, SCOPE/Dryad, ASTERIX, Stratosphere, Spark, and many others. These systems are being applied to new applications from diverse domains other than (traditional) relational OLAP, making it difficult to understand the tradeoffs between them and the workloads for which they were built. We present a taxonomy of existing system stacks based on their architectural components and the design choices made related to data processing and programmability to sort this space. We further demonstrate a web repository for sharing Big Data analytics platform information and use cases. The repository enables researchers and practitioners to store and retrieve data and queries for their use case, and to easily reproduce experiments from others on different platforms, simplifying comparisons.

Speaker Details

Dr. Markl is currently working for the Bavarian Research Center for Knowledge-Based Systems in Munich. He is heading an international research effort to investigate with industry partners the application of multidimensional access methods to relational database systems. The participating commercial partners are SAP AG, NEC, Hitachi, Teijin Systems Technology, TransAction Software, GfK, the European Union, Microsoft. The FORWISS team includes 4 researchers and an average of 10 master students of the Munich University of Technology.

Volker Markl is a graduate of the Munich University of Technology. He completed his Ph.D. thesis in Computer Science in March 1999 under the supervision of Rudolf Bayer. His dissertation was in “Relational Queries Processing Using a Multidimensional Access Technique.”He earned a degree in Business Administration from the University Hagen, Germany in 1995. His research interests are on physical data modeling and query optimization but also include data warehousing, electronic commerce and web based systems.

Dr. Markl’s professional experience include software engineer for a virology laboratory, as part of his military service; instructor for a computer school; and consultant for a forwarding agency. He was awarded a sponsorship by Siemens AG, Munich and also worked as an international intern with Benefit Panel Services, Los Angeles.

Kostas Tzoumas is a postdoctoral researcher co-leading the Stratosphere research project at the Technische Universität Berlin. He received his PhD from Aalborg University in 2011 with a thesis on discovering and exploiting correlations for query optimization. He was a visiting researcher at the University of Maryland, College Park, and an intern at Microsoft Research. He received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens in 2007. His research interests are centered around systems for data analytics, including query processing and optimization in massively parallel environments.

Stephan Ewen is a research associate at the department for Database Systems and Information Management (DIMA) at the Technische Universität Berlin. He is working on the Stratosphere Project that aims at creating a versatile and efficient analytics engine for deep analysis of Big Data on cloud platforms. Within the project, Stephan works on the system’s data flow programming abstraction, the data flow optimization and the parallel runtime system. Prior to joining the DIMA group, Stephan completed the “Applied Computer Science” program at the University of Cooperative Education Stuttgart jointly with IBM Germany and got his Diploma from the University of Stuttgart. In the course of his studies, Stephan Ewen worked, among others, for the IBM Almaden Research Centre and the IBM Development Laboratory Böblingen.