POSTDOC RESEARCHER
Microsoft External Research
eScience Group
Los Angeles Office
1100 Glendon Ave, Suite 1080,
Los Angeles CA 90024 [Map]
Redmond Office
B99/4611, 14820 NE 36th St,
Redmond, WA 98052 [Map]
+1 (540) 449-4770 [Cell]
+1 (425) 538-6245 [Work/FAX]
+1 (425) 704-8891 [Redmond]
« Background • Publications • Talks • Projects »
Background
I am interested in end-to-end management of data in distributed platforms such as clusters, Clouds and Grids. Data management is particularly challenging in data intensive sciences, where querying metadata, managing data repositories, tracking provenance, and archiving experimental results are important to the scientific process. I work on scientific workflow frameworks that address some of the above challenges and provide a platform for in silico experiments. Clouds are emerging as a feasible alternative for scalable scientific analyses and I am exploring the scope of these eScience applications and the middleware to support them. Applying these tools to support science is a key goal, and I work/have worked with scientists in meteorology, astronomy and genomics domains in this pursuit.
2009
- Yogesh Simmhan, Catharine van Ingen, Roger Barga, Alex Szalay, and Jim Heasley, Building Reliable Data Pipelines for Managing Community Data using Scientific Workflows, in eScience Conference, IEEE, 9 December 2009
- Yogesh Simmhan, Roger Barga, Catharine van Ingen, Ed Lazowska, and Alex Szalay, Building the Trident Scientific Workflow Workbench for Data Management in the Cloud, in International Conference on Advanced Engineering Computing and Applications in Sciences (ADVCOMP), IEEE, October 2009
- Beth Plale, Bin Cao, Girish Subramanian, Carole Goble, Paolo Missier, and Yogesh Simmhan, Semantically Annotated Provenance in the Life Science Grid, in Semantic Web in Provenance Management (SWPM) Workshop, October 2009
- Bin Cao, Beth Plale, Girish Subramanian, Ed Robertson, and Yogesh Simmhan, Provenance Information Model of Karma Version 3, in International Workshop on Scientific Workflows (SWF), IEEE, July 2009
- Yogesh Simmhan, Maria Nieto-Santisteban, Roger Barga, Tamas Budavari, Laszlo Dobos, Nolan Li, Michael Shipway, Alexander S. Szalay, Ani Thakar, Jan Vandenberg, Alainna Wonders, Sue Werner, Richard Wilton, Dan Fay, Michael Thomassy, Catharine van Ingen, Jim Heasley, and Conrad Holmberg, GrayWulf: Scalable Software Architecture for Data Intensive Computing, in Hawaii International Conference on System Sciences (HICSS), Computer Society Press, January 2009
2008
- Maria Nieto-Santisteban, Yogesh Simmhan, Roger Barga, Laszlo Dobos, Jim Heasley, Conrad Holmberg, Nolan Li, Michael Shipway, Alexander S. Szalay, Catharine van Ingen, and Sue Werner, Pan-STARRS: Learning to Ride the Data Tsunami, in Microsoft eScience Workshop, December 2008
- Yogesh Simmhan, Roger Barga, Catharine van Ingen, Ed Lazowska, and Alex Szalay, On Building Scientific Workflow Systems for Data Management in the Cloud, in IEEE eScience Conference, December 2008
- Yogesh Simmhan, Roger Barga, and Catharine van Ingen, Automatic Provenance Recording for Scientific Data using Trident, in American Geophysical Union (AGU) Fall Meeting, December 2008
- Roger Barga, Jared Jackson, Nelson Araujo, Dean Guo, Nitin Gautam, and Yogesh Simmhan, The Trident Scientific Workflow Workbench, in IEEE eScience Conference, IEEE, December 2008
- Maria A. Nieto-Santisteban, Tamas Budavari, Laszlo Dobos, Nolan Li, Michael Shipway, Alexander Szalay, Ani Thakar, Suzanne Werner, Richard Wilton, Yogesh Simmhan, Catharine van Ingen, Jim Heasley, and Conrad Holmberg, GrayWulf: Conquering Astronomical Databases, in Astronomical Data Analysis Software and Systems (ADASS), Astronomical Society of the Pacific, November 2008
- Yogesh Simmhan, End-to-End Scientific Data Management Using Workflows, in Scientific Workflows Workshop, IEEE Congress on Services, IEEE Computer Society, Los Alamitos, CA, USA, July 2008
- Roger S. Barga, Dan Fay, Dean Guo, Steven Newhouse, Yogesh Simmhan, and Alex Szalay, Efficient scheduling of scientific workflows in a high performance computing cluster, in Challenges of Large Applications in Distributed Environments (CLADE), ACM, New York, NY, USA, June 2008
Selected Talks
- On End-to-End Scientific Data Management using Workflows. Invited talk at the Scientific Workflows Workshop, Honolulu, 2008.
- Cloud Computing: A Technical Overview. Tutorial at the MSR eScience Workshop, Indianapolis, 2008. [Slides] [Video]
- Transforming Scientific Research through Cloud Technology. Talk at the Indian Institute of Science, Bangalore, 2008.
Current & Recent Projects
BioInformatics & Cloud Computing
Coming soon...Microsoft Biology Initiative; Windows Azure; DryadLINQ
Pan-STARRS
The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) is the next generation of digital sky surveys that builds on the success of the Sloan Digital Sky Survey (SDSS). Equipped with the world’s largest digital camera, this next generation system leverages SQL Server 2008, Windows HPC Clusters, Windows Workflow Foundation and the Trident Scientific Workbench to handle the much larger data generated by Pan-STARRS (30TB/year) and the need to make that data available to astronomers promptly (incrementally updated each week).
This project is in collaboration with Alex Szalay of Johns Hopkins University and Jim Heasley of University of Hawai’i. I was actively involved in the project in incorporating scientific workflows to reliably automate the data pipeline that continuously brings processed telescope detections into databases that are science ready.
Trident Scientific Workflow Workbench
The Trident Workbench provides a rich set of tools to run scientific workflows in the Cloud. Built on top of the Windows Workflow Foundation runtime, Trident adds tools such as a visual workflow composer, service registry, provenance tracking and integration with Windows HPC scheduler that make it an effective workbench for eScience in the Cloud. Originally designed for the NEPTUNE Oceanography project, Trident is now being generalized to other scientific domains and being used in the Pan-STARRS project.
The Trident project was lead by Roger Barga at Microsoft Research.
New: Download and try the Project Trident CTP !
Karma3 Provenance Framework
The Karma provenance framework was initiated as part of my Ph.D. research to build an effective and light-weight provenance collection system for scientific workflows and applied to the LEAD meteorology project. Development on Karma v3 continues, funded by an NSF SDCI grant to make Karma general purpose and to use the provenance captured to automate workflow composition. This work will also make Karma compatible with the emerging Open Provenance Model specification.
This project is in collaboration with Beth Plale and David Leake of Indiana University and Dennis Gannon of Microsoft Research. I am a Co-PI on the NSF SDCI award.
Semantic Provenance in Life Sciences Grid
The Life Science Grid (LSG) is an open-source plugin framework from Eli Lilly that allows researchers in the Life Science domain to use information services, encapsulated as plugins, in a collaborative manner to perform scientific research and discovery. This project extends the capabilities of LSG by capturing semantic provenance on user interactions with the information sources through LSG that helps track research direction, helps collaborative research and presents a rich source for data mining. The project uses Karma for the provenance capture and S-OGSA for semantic annotations and querying.
This project was in collaboration with Beth Plale of Indiana University and Carole Goble of University of Manchester, and sponsored by Eli Lilly Pharmaceuticals.
Service
Recent Service
- FGCS Special Issue on Using the Open Provenance Model to Address Interoperability Challenges. Guest Editor. [CFP Open till Dec/15 '09][txt | pdf | submit]
- Provenance Challenge 3 Workshop, 2009. Organizing Committee member.
- Workshop on the role of Semantic Web in Provenance Management (SWPM), 2009. Program Committee member.
- Scientific Workflow Workshop, 2007-09. Program Committee member.
- Manuscript Reviewer for Journals: IEEE T-KDE, T-SC, T-ASE; FGCS; CPE.
- Member: IEEE, ACM, ASHG
Past Interns
- Girish Subramanian, Indiana University (Summer 2009)
Awards
Supercomputing 2008 Storage Challenge Winner. GrayWulf: Scalable Cluster Architecture for Data Intensive Computing. Alexander Szalay, Maria Nieto-Santisteban, Jan Vandenberg, Alainna Wonders, Randal Burns, Eric Perlman, Ani Thakar, Mike McCarty and Dean Zariello (Johns Hopkins University); Gordon Bell, Tony Hey, Roger Barga, Yogesh Simmhan and Catherine van Ingen (Microsoft Research); and Michael Thomassy and Lubor Kollar (Microsoft Corporation); Robert Grossman, David Hanley, Yunhong Gu and Michael Sabala (University of Illinois at Chicago); Jim Heasley (University of Hawaii); and Tim Carrol, Eric Barnes and Mike Rowland (Dell, Inc.)
« Background • Publications • Talks • Projects »



