Yogesh Simmhan, Catharine van Ingen, Roger Barga, Alex Szalay, and Jim Heasley
15 September 2009
The pervasive availability of scientific data from sensors and field observations is posing a challenge to data valets responsible for accumulating and managing them in data repositories. Science collaborations, big and small, are standing up repositories built on commodity clusters need to reliably ingest data constantly and ensure its availability to a wide user community. Workflows provide several benefits to model data-intensive science applications and many of these benefits can be transmitted effectively to manage the data ingest pipelines. But using workflows is not panacea in itself and data valets need to consider several issues when designing workflows that behave reliably on fault prone hardware while retaining the consistency of the scientific data, and when selecting workflow frameworks that support these requirements. In this paper, we propose workflow design models for reliable data ingest in a distributed environment and identify workflow framework features to support resilience. We illustrate these using the data ingest pipeline for the Pan-STARRS sky survey, one of the largest digital surveys that accumulates 100TB of data annually, where these concepts are applied.
© 2008 Microsoft Corporation. All rights reserved.
Yogesh Simmhan, Catharine van Ingen, Roger Barga, Alex Szalay, and Jim Heasley. Building Reliable Data Pipelines for Managing Community Data using Scientific Workflows, IEEE, 9 December 2009.