
POSTDOC RESEARCHER
eScience Group
Microsoft Research eScience Group
Los Angeles Office
1100 Glendon Ave, Suite 1080,
Los Angeles CA 90024 [Map]
Redmond Office
B99/4611, 14820 NE 36th St,
Redmond, WA 98052 [Map]
+1 (540) 449-4770 [Cell]
+1 (425) 538-6245 [Work/FAX]
+1 (425) 704-8891 [Redmond]yoges@microsoft.com
« Background • Research • Publications • Talks • Projects »
Background
I am interested in end-to-end management of data in distributed systems such as Grids and Clouds. These data management problems are particularly challenging when dealing with scientific datasets, where managing the metadata, searching and discovering data, tracking data provenance, evaluating data quality and archiving experimental results are of particular importance. I am also interested in scientific workflow frameworks that are becoming standard actors that consume and generate scientific data, and help model complex, data intensive scientific experiments in silico that can be executed in the Cloud.
Research Interests
Data Provenance
Scientific Workflows
- Yogesh Simmhan, Maria Nieto-Santisteban, Roger Barga, Tamas Budavari, Laszlo Dobos, Nolan Li, Michael Shipway, Alexander S. Szalay, Ani Thakar, Jan Vandenberg, Alainna Wonders, Sue Werner, Richard Wilton, Dan Fay, Michael Thomassy, Catharine van Ingen, Jim Heasley, and Conrad Holmberg, GrayWulf: Scalable Software Architecture for Data Intensive Computing, in Hawaii International Conference on System Sciences (HICSS), Computer Society Press, January 2009
- Maria Nieto-Santisteban, Yogesh Simmhan, Roger Barga, Laszlo Dobos, Jim Heasley, Conrad Holmberg, Nolan Li, Michael Shipway, Alexander S. Szalay, Catharine van Ingen, and Sue Werner, Pan-STARRS: Learning to Ride the Data Tsunami, December 2008
- Yogesh Simmhan, Roger Barga, Catharine van Ingen, Ed Lazowska, and Alex Szalay, On Building Scientific Workflow Systems for Data Management in the Cloud, in IEEE eScience Conference, December 2008
- Roger Barga, Jared Jackson, Nelson Araujo, Dean Guo, Nitin Gautam, and Yogesh Simmhan, The Trident Scientific Workflow Workbench, in IEEE eScience Conference, December 2008
- Yogesh Simmhan, Roger Barga, and Catharine van Ingen, Automatic Provenance Recording for Scientific Data using Trident, in American Geophysical Union (AGU) Fall Meeting, December 2008
- Maria A. Nieto-Santisteban, Tamas Budavari, Laszlo Dobos, Nolan Li, Michael Shipway, Alexander Szalay, Ani Thakar, Suzanne Werner, Richard Wilton, Yogesh Simmhan, Catharine van Ingen, Jim Heasley, and Conrad Holmberg, GrayWulf: Conquering Astronomical Databases, in Astronomical Data Analysis Software and Systems (ADASS), Astronomical Society of the Pacific, November 2008
- Yogesh Simmhan, End-to-End Scientific Data Management Using Workflows, in Scientific Workflows Workshop, IEEE Congress on Services, IEEE Computer Society, Los Alamitos, CA, USA, July 2008
- Roger S. Barga, Dan Fay, Dean Guo, Steven Newhouse, Yogesh Simmhan, and Alex Szalay, Efficient scheduling of scientific workflows in a high performance computing cluster, in Challenges of Large Applications in Distributed Environments (CLADE), ACM, New York, NY, USA, June 2008
Selected Talks
On End-to-End Scientific Data Management using Workflows. Invited talk at the Scientific Workflows Workshop, 2008.
Transforming Scientific Research through Cloud Technology. Talk at the Indian Institute of Science, Bangalore, 2008.
Cloud Computing: A Technical Overview. Tutorial at the MSR eScience Workshop, Indianapolis, 2008. [Slides] [C# Code] [Java Code]
Awards
Supercomputing 2008 Storage Challenge Winner. GrayWulf: Scalable Cluster Architecture for Data Intensive Computing. Alexander Szalay, Maria Nieto-Santisteban, Jan Vandenberg, Alainna Wonders, Randal Burns, Eric Perlman, Ani Thakar, Mike McCarty and Dean Zariello (Johns Hopkins University); Gordon Bell, Tony Hey, Roger Barga, Yogesh Simmhan and Catherine Van Ingen (Microsoft Research); and Michael Thomassy and Lubor Kollar (Microsoft Corporation); Robert Grossman, David Hanley, Yunhong Gu and Michael Sabala (University of Illinois at Chicago); Jim Heasley (University of Hawaii); and Tim Carrol, Eric Barnes and Mike Rowland (Dell, Inc.)
Service
Provenance Challenge Workshop, 2009. Organizing Committee member.
Workshop on Semantic Web and Provenance Management, 2009. Program Committee member.
Scientific Workflow Workshop, 2007-09. Program Committee member.
Current Projects
Pan-STARRS
The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) is the next generation of digital sky surveys that builds on the success of the Sloan Digital Sky Survey (SDSS). Equipped with the world’s largest digital camera, this next generation system leverages SQL Server 2008, Windows Workflow Foundation and the Trident Scientific Workbench to handle the much larger data generated by Pan-STARRS (30TB/year) and the need to make that data available to astronomers promptly (incrementally updated each week).
This project is in collaboration with Alex Szalay of Johns Hopkins University and Jim Heasley of University of Hawai’i. I am actively involved in the project in incorporating scientific workflows to automate the data pipeline that continuously brings processed telescope detections into databases that are science ready.
Trident Scientific Workflow Workbench
The Trident Workbench provides a rich set of tools to run scientific workflows in the Cloud. Built on top of the Windows Workflow Foundation runtime, Trident adds tools such as a visual workflow composer, service registry, provenance tracking and integration with Windows HPC scheduler that make it an effective workbench for eScience in the Cloud. Originally designed for the NEPTUNE Oceanography project, Trident is now being generalized to other scientific domains and being used in the Pan-STARRS project.
The Trident project is lead by Roger Barga at Microsoft Research. I am working on the provenance collection aspects within Trident and in driving the design of the framework through its application in Pan-STARRS.
Karma3 Provenance Framework
The Karma provenance framework was initiated as part of my Ph.D. research to build an effective and light-weight provenance collection system for scientific workflows and it was applied to the LEAD meteorology project. Development on Karma continues, funded by an NSF SDCI grant to make Karma general purpose and to use the provenance captured to automate workflow composition. This work will also make Karma compatible with the emerging Open Provenance Model specification.
This project is in collaboration with Beth Plale and David Leake of Indiana University and Dennis Gannon of Microsoft Research. I am a Co-PI on the NSF SDCI award.
Semantic Provenance in Life Sciences Grid
The Life Science Grid (LSG) is an open-source plugin framework from Eli Lilly that allows researchers in the Life Science domain to use information services, encapsulated as plugins, in a collaborative manner to perform scientific research and discovery. This project extends the capabilities of LSG by capturing semantic provenance on user interactions with the information sources through LSG that helps track research direction, helps collaborative research and presents a rich source for data mining. The project uses Karma for the provenance capture and S-OGSA for semantic annotations and querying.
This project is in collaboration with Beth Plale of Indiana University and Carole Goble of University of Manchester, and sponsored by Eli Lilly Pharmaceuticals.
« Background • Research • Publications • Talks • Projects »



