Scientific applications have diverse data and computation needs that scale from desktop to supercomputers. Besides the nature of the application and the domain, the resource needs for the applications also vary over time—as the collaboration and the data collections expand, or when seasonal campaigns are undertaken. Cloud computing offers a scalable, economic, on-demand model well-matched to evolving eScience needs.
Azure Ocean — A Sea of Data in the Cloud
The Ocean Observatories Initiative (OOI) is an NSF funded program to establish the ocean observing infrastructure of the 21st century benefiting research and education. The OOI program promises to deliver a cabled observatory to study environmental processes of the ocean at various scales, from coastal shelf-slope exchange processes to the deep ocean. The OOI’s data distribution network lies at the heart of its architecture and will enable a multitude of science and education applications, ranging from data analysis, to processing, and visualization. The magnitude of the data from the cabled observatory, along with the complexity of scientific analysis and diverse user base, demands a cloud computing platform. We illustrate Windows Azure for storage of OOI data collections and generation of data visualizations on demand, which are delivered to a rich client that is running the COVE ocean visualization toolkit. The result is that an ocean science researcher anywhere in the world can access a sea of ocean data on Azure to carry out their research from their local client machine. This project is a collaborative effort between the Cloud Computing Futures group of XCG (Microsoft Research) and the eScience Institute of the University of Washington.
Bioinformatics Computation in the Cloud
(Click to view a larger image)The field of Bioinformatics has experienced a deluge of vast amounts of Genomics data in recent years, necessitating the use of massive computation and data repository resources in order to make use of those data. By combining the power of rich client applications with that of the scalability and flexibility of the cloud, the proliferation of research in this area is better enabled. We have shown that, when combining the Excel 2010, an extremely rich experience is enabled on a desktop or even laptop environment. By then combining this client experience with a seamless connection to deep genomic database search capabilities provided on Azure via the AzureBlast service, the researcher is able to discover new insights and explore areas not otherwise possible when using these technologies in isolation. This project is a collaborative effort among multiple organizations at Microsoft, including major portions of which are available as open source.
ModisAzure — Azure Service for Remote Sensing Geoscience
ModisAzure is a science pipeline for download, initial processing, and reduction of satellite imagery. It is developed by Microsoft Research, UVa, UCB. The service dramatically lowers resource and complexity barriers to use satellite imagery for terrestrial hydrology and geoscience. Common imagery location is determined and uploaded from diverse sources into Azure blob store. Common reprojection and harmonization is performed at scale in Azure to produce science-ready imagery with the same length, time and quality attributes. Optional scientist-provided reduction algorithm (.NET, Java, or MatLab) can also be run in the Cloud. This provides on-demand scalability beyond the local desktop or cluster.
The ModisAzure service is in use now to compute 10 year continental scale water balance for North America. Per year:
- 500 GB (~60K files) upload of 9 different source imagery products from 15 different locations
- 400 GB reprojected harmonized imagery consuming ~3500 cpu hours
- 5 GB reduced science result leveraging reported field data aggregates consuming ~3600 cpu hour
Additional science requests to expand the service to cover Europe and to include additional source imagery products and formats is pending.
- Catharine van Ingen
- Jie Li
- Marty Humphrey (UVA)
- Youngryel Ryu (UCB)
- Deb Agarwal (BWC/LBL)
- Keith Jackson (BL)
- Jay Borenstein (Stanford)
- Team SICT: Vlad Andrei, Klaus Ganser, Samir Selman, Nandita Prabhu (Stanford)
- Team Nimbus: David Li, Sudarshan Rangarajan, Shantanu Kurhekar, Riddhi Mittal (Stanford)
Tools to Transform the Potential of Cloud to Reality for Research
Cloud holds promise for science users across the computing spectrum, from desktop users who need on-demand compute power to HPC cluster users who wish to scale out further. The potential for effective research in the Azure Cloud will come to fruition only when there is a better understanding of its capabilities (and limitations) and tools are available to make it accessible to scientists. We present work that addresses these two issues to enable research across desktop client and the Cloud.
AzureScope is an online resource for practitioners and architects to understand the performance characteristics of Azure. This helps them select applications that are best suited to leverage the Azure platform. Besides repeatable micro-benchmarks that measure the performance characteristics of storage services like Blobs, Tables, Queue, XDrive and SQL Azure, it also monitors the healthy status of Azure fabric services. At a higher level, we report benchmarks of common distributed algorithms like Matrix Multiplication, Gigasort and Bio-Sequence Comparison, and identify best practices for running applications on Azure.
The Generic Worker framework makes the Azure platform more accessible to desktop clients. Instead of requiring science users to write code to deploy their existing desktop applications to Azure, we provide simple tools that register .NET, Java, MatLab and commandline applications to “Generic” workers in Azure. Usually, activating the remote cloud applications from desktop clients like workflows requires additional code for parameter passing through queues and file transfer between desktop and cloud storage. Our framework provides tools for transparent, on-demand invocation of registered cloud applications without writing code from commandline, workflows or APIs to bridge the gap between client and the cloud.
- Jie Li, Deb Agarwal, Marty Humphrey, Catharine van Ingen, Keith Jackson, and Youngryel Ryu, eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform, in IPDPS, IEEE, 2010
- Jie Li, Youngryel Ryu, Keith Jackson, Deb Agarwal, Marty Humphrey, and Catharine van Ingen, A Cloud Computing Service for Remote Sensing Data Reprojection and Science Reduction, in American Geophysical Union Fall Meeting, December 2009
- Yogesh Simmhan, Catharine van Ingen, Girish Subramanian, and Jie Li, Bridging the Gap between the Cloud and an eScience Application Platform, no. MSR-TR-2009-2021, 15 November 2009
- Girish Subramanian and Yogesh Simmhan, Tools for Genome Haplotyping in the Windows Azure Cloud, in Microsoft Research eScience Workshop, Microsoft Research, 16 October 2009