Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Reliable Management of Community Data Pipelines using Scientific Workflows

Yogesh Simmhan, Catharine van Ingen, Roger Barga, Alex Szalay, and Jim Heasley

Abstract

The pervasive availability of scientific data from sensors and field observations is posing a challenge to data valets responsible for accumulating and managing them in data repositories. Science collaborations, big and small, are standing up repositories built on commodity clusters need to reliably ingest data constantly and ensure its availability to a wide user community. Workflows provide several benefits to model data-intensive science applications and many of these benefits can be transmitted effectively to manage the data ingest pipelines. But using workflows is not panacea in itself and data valets need to consider several issues when designing workflows that behave reliably on fault prone hardware while retaining the consistency of the scientific data, and when selecting workflow frameworks that support these requirements. In this paper, we propose workflow design models for reliable data ingest in a distributed environment and identify workflow framework features to support resilience. We illustrate these using the data ingest pipeline for the Pan-STARRS sky survey, one of the largest digital surveys that accumulates 100TB of data annually, where these concepts are applied.

Details

Publication typeTechReport
NumberMSR-TR-2009-125
PublisherMicrosoft

Previous versions

Yogesh Simmhan, Catharine van Ingen, Roger Barga, Alex Szalay, and Jim Heasley. Building Reliable Data Pipelines for Managing Community Data using Scientific Workflows, IEEE, 9 December 2009.

> Publications > Reliable Management of Community Data Pipelines using Scientific Workflows