Scientific Data Intensive Computing Workshop 2004

A Suite of Web-Services for Wavelet-based Analysis of Large Atmospheric Datasets
Cyrus Shahabi (shahabi@usc.edu)

Modern scientific data analysis systems need to perform complex statistical queries on very large multidimensional datasets; thus, a number of multivariate statistical methods must be supported. On top of that, the desired accuracy varies per application, user and/or dataset and it can well be traded-off for faster response time. These characteristics lead us to believe that the wavelet transform, with its inherent multi-resolution property, will become a likely tool for future scientific database query processing. We are building a general system that utilizes the wavelet decomposition of multidimensional data to not only enhance answering aggregate queries but also to be able to facilitate data-mining functionalities.

Our implementation efforts are focused around a NASA/JPL application called "GENESIS". This project has been funded by NASA under the ESIP program from 1998 to 2003 and now under the REASoN program until 2008.

Currently, our 3-tier architecture provides flexible data access to Level 1, 2a and 2b products of GPS occultation data from all three NASA receivers on CHAMP, SAC-C and GRACE. The middle tier of our architecture consists of a suite of web-services for wavelet transformation, update and range-aggregate queries. We are extending our web-services to make them more I/O efficient as well as adding new query services to support more complex statistical queries on the dataset.

Artificial Intelligence techniques for distributed workflow management
Yolanda Gil (gil@isi.edu)
http://pegasus.isi.edu

Yolanda Gil is Associate Division Director at the Information Sciences Institute of the University of Southern California, and Research Associate Professor in the Computer Science Department. She received her M.S. and Ph. D. degrees in Computer Science from Carnegie Mellon University. Since joining USC's Information Sciences Institute in 1992, Dr. Gil has formed a group that researches various aspects of Interactive Knowledge Capture (http://www.isi.edu/ikcap). Her research interests include interactive knowledge capture, intelligent user interfaces, knowledge-based reasoning, grid computing, and the semantic web. An area of active work is workflow management in grids, including the integration of AI planning techniques with existing grid services (http://www.isi.edu/ikcap/cognitive-grids). This work has resulted in the development of a system called Pegasus (http://pegasus.isi.edu) for automatic generation of executable grid workflows that has been used to analyze data collected from the Laser Interferometer Gravitational Wave Observatory (LIGO), the largest single enterprise ever undertaken by the US National Science Foundation and that aims to detect gravitational waves predicted by Einsteins theory of relativity. She was Program Chair of the Intelligence User Interfaces (IUI) Conference in 2002, and was co-founder and conference co-chair of the First International Conference on Knowledge Capture (K-CAP) in 2001. Dr. Gil was recently elected to the Council of the American Association of Artificial Intelligence (AAAI).

WSRF.NET: Grid Computing on .NET supporting the Web Services Resource Framework
Marty Humphrey (Humphrey@cs.virginia.edu)

Grid Computing is often defined as a virtual machine abstraction over resources in multiple administrative domains. The Open Grid Services Architecture (OGSA) is the overall architectural vision of Grid Computing that highly leverages both Web Services and the traditional approach for Grid Computing that has been developed in academia and national labs (as exemplified by Globus). The Web Services Resource Framework (WSRF) is a specific rendering of individual services in OGSA. WSRF.NET is a set of software libraries, tools and applications which implement the WSRF specifications on top of .NET. WSRF.NET allows easy authoring of WSRF-compliant services and clients and integrates many Microsoft technologies, such as WSE and ADO.NET. WSRF.NET builds upon the success of our OGSI.NET project.

A Metadata Catalog Service (MCS) for the Grid
Ewa Deelman (deelman@isi.edu)

MCS is a metadata catalog service that stores descriptive information (metadata) about logical data items. MCS has been developed as part of the Grid Physics Network (GriPhyN) and NVO projects. The aim of these projects is to support large-scale scientific experiments. MCS is a standalone catalog that stores information about logical data items (e.g., files). It also allows users to aggregate the data items into collections MCS provides system-defined as well as user-defined attributes for logical items and collections. One distinguishing characteristic of MCS is that users can dynamically define and add metadata attributes. MCS can also provide the names of the user-defined attributes. As a result, different MCS instances can be created with alternative contents. MCS have been implemented to run on top of standard web services or on top of the OGSA-DAI grid service. In the latter case MCS leverages the OGSA-DAI's authentication capabilities to provide secure access to the metadata.

Condor Project
(condor-admin@cs.wisc.edu)

The goal of the Condor Project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput.

Middleware Support for Data Driven Applications in Science, Engineering, and Biomedicine
Tahsin Kurc (kurc@bmi.osu.edu)

The multiscale computing laboratory is part of the Biomedical Informatics Department at the Ohio State University. The goals of the lab are to develop the middleware technology and techniques to enable management, sharing, and manipulation of data at multiple scales across heterogeneous, dynamic collections of storage and computation systems. Some the target application areas include:

  1. large scale, collaborative biomedical studies that reference and integrate molecular, clinical, and image data,
  2. management and processing of very large biomedical image datasets,
  3. analysis and simulation of oil reservoirs and data driven control of oil reservoir management,
  4. analysis of multi resolution, multiple-grid simulation datasets

Our research targets techniques and tools to support optimized distributed data storage, indexing, retrieval and processing of large datasets distributed across many distributed storage systems. Our research efforts have led to the development of several middleware systems:

DataCutter is a component-based middleware framework designed to provide support for user-defined processing of large multi-dimensional datasets across a wide-area network. In DataCutter, application processing structure is implemented as a network of components, referred to as filters. The DataCutter runtime system supports combined use of task and data parallelism, and execution of filters on heterogeneous collections of storage and compute clusters in a distributed environment. Processing, network and data copying overheads are minimized by the ability to create multiple copies of filters and place them on different platforms.

STORM is a services-based middleware, implemented using DataCutter, that is designed as a set of coupled services that collectively support execution of SQL-style SELECT queries on scientific datasets stored in files on distributed storage systems. These services collectively provide support for 1) Selection of the data of interest. The data of interest is selected based on either the values of particular attributes or ranges of attribute values (i.e., range queries). The selection operation can also involve user-defined filtering operations. 2) Transfer of data from storage nodes to compute nodes for processing. If the data analysis program runs on a cluster, STORM supports application-specific partitioning and parallel transfer of data elements to the destination processors.

Mobius is a middleware framework designed for efficient metadata and data management in dynamic, distributed environments. Mobius provides a set of generic services and protocols to support distributed creation, versioning, management of database schemas, on-demand creation of databases, federation of existing databases, and querying of data in a distributed environment. Its services employ XML schemas to represent metadata definitions and XML documents to represent and exchange metadata instances.

These frameworks have been employed in a number of application projects, including a virtual microscope system for remote pathology, analysis of multi-terabyte datasets in oil reservoir management studies, and analysis of large scale biomedical image data.

Windows HPC Community Portal
Community Windows HPC website founded in 1999, funded by MSR, for providing information about use of Windows HPC, user guides, and development/benchmarking data.

.NET/GRID development for engineering applications.

Examples include:

  • Engineering design optimisation portal
  • Wind Tunnel Grid project, using SQL Server, .NET Web Services and Globus.
  • Flight simulation and aircraft design web services development, e.g. www.futureflight.org & www.sfsim.com

Windows HPC Tools, Libraries and Information
Cornell Theory Center
This site is intended to provide a comprehensive list of all tools, libraries and information for Windows High-Performance Computing.

SkyServer.Org
This site is intended to support people wishing to creat a SkyServer or SkyNode. It contains the latest MySkyserver downloads - see MySkyServer

Sloan Digital Sky Survey
This website presents data from the Sloan Digital Sky Survey, a project to make a map of a large part of the universe.

Microsoft Web Sites

.NET Framework Developer Center
Home for developer information on the .NET Framework on MSDN

Using the .NET Framework 
Explore in-depth technical information and get started building the next generation of applications and XML Web services using the .NET Framework

Patterns & Practices 
Patterns & practices provide proven architectures, production quality code, and lifecycle best practices. Microsoft patterns & practices guides contain specific recommendations illustrating how to design, build, deploy, and operate architecturally sound solutions to challenging business and technical scenarios. They offer deep technical guidance based on real-world experience that goes far beyond white papers to help enterprise IT professionals, information workers, and developers quickly deliver sound solutions.

Migrating Applications to .NET 
Help on migrating applications

Writing High-Performance Managed Applications : A Primer
Learn about the .NET Framework's Common Language Runtime from a performance perspective. Learn how to identify managed code performance best practices and how to measure the performance of your managed application.

.NET Performance Center
Contains information on logging, tracing, profiling and other diagnostic techniques for analyzing and monitoring your .NET applications.

Microsoft Enterprise Instrumentation Framework 
The Microsoft Enterprise Instrumentation framework (EIF) enables applications built on the .NET Framework to be instrumented for manageability in a production environment. This framework provides an extensible event schema and unified API which leverages existing eventing, logging and tracing mechanisms built into Windows, including WMI, the Windows Event Log, and Windows Event Tracing. An application instrumented with this framework can publish a broad spectrum of information such as errors, warnings, audits, diagnostic events, and business-specific events. In addition, Enterprise Instrumentation enables tracing by business-process or application service, and can provide statistics such as average execution time for a given process or service.

SQL Server 2005 
Information on the new and enhanced features available in SQL Server 2005, formerly codenamed "Yukon".

Data Access and Storage Developer Center 
Portal site to provide you with the technical information you need to create solutions for storing and manipulating data

Windows HPC site
This site discusses Microsoft HPC solutions and demonstrates how Windows Server 2003 is the premier platform for customers seeking performance, scalability, and reliability.

Microsoft Research Downloads 
Demos & Downloads from various groups/projects in Microsoft Research - contains TerraService code

Web Services Developer Center
Portal site for information on XML Web Services

Advanced Web Services Developer Center
Information on how Web services will evolve over time. The Web services architecture defines a framework that augments the basic Web service with generic higher-level services like security, reliability, and transactions, which are required by many distributed applications and are not specific to a particular problem domain.

Web Services Enhancements (WSE)
Web Services Enhancements for Microsoft .NET (WSE) is a supported add-on to Microsoft Visual Studio .NET and the Microsoft .NET Framework providing developers the latest advanced Web services capabilities to keep pace with the evolving Web services protocol specifications

Indigo
Indigo is a new breed of communications infrastructure built around the Web services architecture. Advanced Web services support in Indigo provides secure, reliable and transacted messaging along with interoperability. Indigos service-oriented programming model is built on the .NET Framework and simplifies development of connected systems. Indigo unifies a broad array of distributed systems capabilities in a composable and extensible architecture, spanning transports, security systems, messaging patterns, encodings, network topologies and hosting models. Indigo will be an integral capability of Windows Longhorn and will also be supported on Windows XP and Windows Server 2003.

MSDN Code Sample Center
A centralized location that provides links to code samples and sample applications for developer-related products and technologies.

GotDotNet: The Microsoft .NET Framework Community
.NET Framework Community Site - includes users sample code, workspaces, articles

Channel 9 MSDN Developer Blog site

Relevant Publications

"There Goes the Neighborhood: Relational Algebra for Spatial Data Search", Alexander S. Szalay, Gyorgy Fekete, Wil OMullane, Aniruddha R. Thakar, Gerd Heber, Arnold H. Rots, MSR-TR-2004-32, Apil 2004

"The Revolution in Database Architecture," Extended abstract of keynote talk at ACM SIGMOD 2004, Paris, France, June, 2004, Also MSR-TR-2004-31, March 2004

"Extending the SDSS Batch Query System to the National Virtual Observatory Grid", Maria A. Nieto-Santisteban, William O'Mullane, Jim Gray, Nolan Li, Tamas Budavari, Alexander S. Szalay, Aniruddha R. Thakar, MSR-TR-2004-12, February 2004

"Scientific Data Federation", J. Gray, A. S. Szalay, The Grid 2: Blueprint for a New Computing Infrastructure, I. Foster, C. Kesselman, eds, Morgan Kauffman, 2003, pp 95-108.

 "Data Mining the SDSS SkyServer Database,"J. Gray, A.S. Szalay, A. Thakar, P. Kunszt, C. Stoughton, D. Slutz, J. vandenBerg Distributed Data & Structures 4: Records of the 4th International Meeting, pp 189-210 W. Litwin, G. Levy (eds), Paris France March 2002, Carleton Scientific 2003, ISBN 1-894145-13-5, also MSR-TR-2002-01, Jan. 2002

"The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS", A.R. Thakar, A.S. Szalay, P.Z. Kunszt, J. Gray, May 2003, Computing in Science and Engineering, V5.5,Sept 2003, IEEE Press. pp. 16-29

"SkyQuery: A Web Service Approach to Federate Databases", T. Malik, A.S. Szalay, T. Budavari, A.R. Thakar 2003, Proceedings of CIDR 2003 (Conference on Innovative Data Research 2003), Asilomar CA.

"Spatial Clustering of Galaxies in Large Datasets," Alexander S. Szalay, Tams Budavari, Andrew Connolly, Jim Gray, Takahiko Matsubara, Adrian Pope and Istvn Szapudi, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,

"Web Services for the Virtual Observatory," Alexander S. Szalay, Tams Budavria, Tanu Malika, Jim Gray, and Ani Thakar, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,

"Petabyte Scale Data Mining: Dream or Reality?," Alexander S. Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,

"Online Scientific Data Curation, Publication, and Archiving," Jim Gray; Alexander S. Szalay; Ani R. Thakar; Christopher Stoughton; Jan vandenBerg, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii,

 "The World Wide Telescope: An Archetype for Online Science," Jim Gray; Alex Szalay, Microsoft Research TR 2002-75, pp 4, CACM, Vol. 45, No. 11, pp. 50-54, Nov. 2002

 "The World Wide Telescope"Szalay, A.S., Gray, J., Science, V.293 pp. 2037-2038. 14 Sept 2001. (MS-TR-2001-77)

"Large Databases in Astronomy" Szalay, A.S., Gray, J., Kunszt, P., Thakar, A. and Slutz, D., Mining the Sky, Proceedings of MPA/ESO/MPE workshop, Springer, pp. 99-118,(2001).