Data Storage and Processing in Parallel Array Engines

Scientists today are able to generate data at unprecedented scale and rate. For example the Large Synaptic Survey Telescope (LSST) announced that they will be producing approximately 30 TB of data per night in a few years. Also in many fields of science, multidimensional arrays rather than flat tables are standard data types because data values are associated with coordinates in space and time. For example, images in astronomy are 2D arrays of pixel intensities. Climate and ocean models use arrays or meshes to describe 3D regions of the atmosphere and oceans. As a result, scientists need powerful tools to help them manage these massive arrays.

In this talk, I will focus on various challenges in building parallel array data management systems that facilitates massive-scale data analytics over arrays. In particular, I will present AscotDB system which is a collaboration of an interdisciplinary team comprising astronomy and database experts. Our goal is to answer one question: What would be the most transformative tool for processing the next-generation telescope image collections, such as the one that LSST will produce? In AscotDB, we integrated several pieces of technology: the SciDB open-source array engine for data storage and processing, Ascot for graphical data exploration, and Python for easy programmatic access. We built the system on the combination of these three pieces of technology to provide a compelling and powerful environment for the exploration, analysis, visualization, and sharing of large astronomical datasets. In the context of the AscotDB project and also motivated by other array-processing applications, I describe some of the critical challenges for building a parallel array management system and the way we addressed those challenges. In particular, I present three major components of an array engine that I tackled during my ph.d. in the context of the following projects: 1) ArrayStore: Efficient storage management mechanisms to store array on disk. 2) TimeArr: Efficient support for updates and data versioning 3) ArrayLoop: Native support for efficient iterative computations.

Speaker Details

Emad Soroush is a Ph.D. student in the computer science & engineering department at University of Washington under supervision of Dr. Magdalena Balazinska. He received his bachelor from Sharif University of Technology in Iran and his Master from University of Victoria in Canada. He is a member of the database group and a member of the SciDB group. His area of expertise are database management systems. During his Ph.D. studies, he built new tools to facilitate massive-scale data analytics over array data types. (http://homes.cs.washington.edu/~soroush/)

Date:
Speakers:
Emad Soroush
Affiliation:
University of Washington
    • Portrait of Jeff Running

      Jeff Running