*
Quick Links|Home|Worldwide
Microsoft*
Search for


External Research & Programs
eScience  

Awarded Projects in eScience

Learn more about the eScience projects awarded by Microsoft Research.


2004 Microsoft Research eScience and SciData Awards

eScience Projects

SciData Projects

eScience Projects

Advanced Biomedical Computing Systems for Cancer Research
May Dongmei Wang, Georgia Institute of Technology and Emory University

According to the American Cancer Society, almost 560,000 people died of cancer in 2002. A significant portion of this mortality could be prevented if cancer can be diagnosed and treated at an early stage. The identification of cancer molecular signatures is a key step for early detection and diagnosis and will provide molecular targets for new approaches in prevention and therapy. However, a major roadblock is the lack of computational tools that can integrate and interpret the large amounts of molecular and anatomical cancer data.

In collaboration with Winship Cancer Institute, my group is developing a computation-based cancer research system. The system consists of databases, cluster-based computing, and immersive visualization. With this system, we will be able to integrate large amounts of genomic, proteomic, and molecular/organ imaging data obtained from cultured cancer cells, clinical tissue specimens, and solid tumors to analyze and guide clinical cancer research.

The long-term goal is to develop biocomputing technologies for “individualized oncology” in which individual cancer patients can be diagnosed and treated based on their unique genetic profiles. The practical outcome of this research is an advanced software system that takes advantage of leading-edge computation and visualization tools and is available for immediate commercialization.

http://www.miblab.gatech.edu

^ back to top

Parallel Analysis and Visualization of Astronomical Data in SQL Databases
Thomas R. Quinn, University of Washington

The retrieval and analysis of astronomical data from archives and catalogs worldwide is being facilitated by the development of the US National Virtual Observatory. Making the data available via SQL Server, such has been done with the Sloan Digital Sky Survey/SkyServer project, has not only enhanced scientific productivity, but has made state-of-the-art astronomical data accessible to educators and the general public.

Many interesting questions can be easily posed as an SQL query (e.g., find all galaxies with a certain color); however, many of the questions that scientists ask of this data cannot be so easily expressed (e.g., find all galaxies with another galaxy nearby) or requires significant computational and communication resources for which the database is not, in general, optimized. We have developed a framework for parallel analysis and visualization of astrophysical simulation data on compute clusters. It is designed to interactively perform computationally intensive analysis on the large datasets produced by massively parallel simulations. We will extend the capabilities of this tool to interface with SQL databases to allow parallel analysis of any dataset (such as the SDSS) running on SQL Server. In doing so we will combine the accessibility, interoperability, and ease-of-use of an SQL database with the computational power of cluster computing to enable more sophisticated and computationally intensive analysis to be easily performed on the data. Although our immediate target is astronomical datasets, the tool could be used for any scientific dataset. The resulting tool will be distributed under an Open Source license.

http://hpcc.astro.washington.edu/nchilada/bin/view/Nchilada/WebHome

^ back to top

Pictorial Query Specification for Searching a Spatially Referenced Breast Cancer Image Database
Hanan Samet, University of Maryland

Breast cancer is one of the leading causes of death in women. Computer aided detection (CAD) and pre-screening can be used to increase the effectiveness of radiologists to avoid missed diagnoses. Alternative medical imaging approaches, such as ultrasound or MRI, could be more effective than mammography at detecting cancers or evaluating malignancy in certain types of women. A large database of medical images with analysis is required to help train and test the CAD and pre-screening systems. A database with images from multiple technologies like mammograms, MRI, and ultrasound will also enable research into the effectiveness and usefulness of each technique at cancer screening and the determination of malignancy. We propose to create and store this database using SQL Server and use it to provide doctors with a Web-based query tool where the queries are specified pictorially to access the data. Our tool will enable users to easily create complex queries to find images with similar parts, characteristics, or sections through the specification of the spatial interrelationships between the features which involve both distance and direction. A small pilot study by us using some of these features found 90 percent of the malignant cases, comparing favorably with radiologists and others in this field. With this system, we hope to improve on these results and be able to diagnose malignancy without surgery.

http://www.cs.umd.edu/~hjs/

^ back to top

A Comprehensive Protein Database Indexed by Spatial Motifs “MotifSpace”
Wei Wang, University of North Carolina at Chapel Hill Chapel Hill

We propose to build and disseminate a comprehensive database of candidate spatial protein motifs based on our recently developed data mining algorithms. Motifs are recurring structural elements associated with specific protein functions. The classic experimental approach to finding spatial motifs is time consuming and labor intensive. We envision our database as a tool to accelerate this discovery process by orders of magnitude. Many scientific areas stand to benefit from the creation of a spatial motif database including pharmacology, biochemistry, genetics, and phylogenetics.

Our approach to identifying candidate spatial motifs searches for frequently recurring subgraphs within graph representations of proteins belonging to a given family, which occur infrequently outside of the family. We will index each protein structure in the protein data bank according to the candidate motifs it contains and the protein families each motif is common too. We will also construct indices to support queries of more general spatial properties and motif co-occurrence patterns within protein subsets. We will develop an extensive graphical query and visualization backend to our database that will allow for the visual inspection and comparison of proteins in the context of the associated motifs and protein peptide sequences. This interface would be implemented as a client-side application and be developed by using Microsoft .NET framework.

^ back to top

A Parallel Cross-Match Engine for Astronomy
Maria Nieto-Santisteban, Johns Hopkins University

The project will develop a scalable SQL Server cluster capable of running parallel joins between very large catalogs in astronomical databases. As a proof of concept, we will cross-match existing catalogs with cardinalities of a billion rows, a task exceeding the capabilities of current tools. Data will be automatically partitioned across servers, thus joins and scanning queries will be executed in parallel. We expect a performance that scales linearly with the number of servers—up to the scale of Beowulf clusters. We will experiment with different inter-connect mechanisms and partitioning approaches. Users will interact with this parallel framework through a Web services front end.

Although the focus will be on cross-matches, the system will be generic and capable of processing other astronomy inquiries. The research will build on existing elements, the SkyServer database, the OpenSkyQuery Web service-based cross-match engine, and the CasJobs workbench environment, all developed in collaboration with and supported by Microsoft Research. This framework will also benefit the 100 TB database system that we are building in collaboration with mechanical engineers to analyze numerical simulations of turbulent flows. Finally, a successful demonstration of such a parallel system opens up new possibilities in tackling similar database intensive problems in other scientific areas like particle physics and genomics.

^ back to top

SQL.CT: Using Database Systems for Remote, Web-Based Visualization of Tomographic Data
Julian Humphries, Greg Lavender, and Shirley Cohen, The University of Texas at Austin

We propose to build a prototype system that illustrates the benefits of combining database systems and volume rendering visualization for tomographic data. The goal of the project is to demonstrate how the organizational, indexing, and parallelism capabilities of a database system can optimize the overall rendering process. The work is driven by the genuine scientific problems of our Digimorph user community and has potential for wide-spread dissemination within the medical imaging, paleontology, and computational fluid dynamics communities.

^ back to top

PetaByte Data Management and Analysis Services for Data-Driven Science
Johannes Gehrke, Cornell University

We propose to implement data management and analysis services for two scientific applications—a large-scale astronomical survey and high-energy physics. In both cases the amount of data is very high (Petabytes and hundreds of Terabytes, respectively) and there is a large user community who depends on fast access to their data.

We will implement a new system for these users using Microsoft tools: SQL Server relational database with middleware using ASP.NET and a Web Services interface. We have promising initial results, and there is complementary funding for infrastructure for the astronomy project. With funding from this grant we could fully develop the required infrastructure for data management and analysis, making the system available for scientific analysis. Success of this project will demonstrate to scientists the benefits of designing and implementing their systems using Microsoft tools.

^ back to top

.NET-enabled Data Management for Adaptive Parallel Numerical Applications on Computational Grids
Sathish Vadhiyar, Indian Institute of Science

Due to large load dynamics and high fault rates of the Grid resources, it is necessary to migrate parallel numerical applications executing over the Grid to different sets of resources during the various stages of application execution to withstand faults and to make use of “better” resources. When a parallel problem is remotely invoked, the input data for the problem is staged from the Web client’s host to the remote computational resources. Similarly, when an executing parallel application is migrated across different Grid resources during various stages of the application execution, the intermediate data used in the application corresponding to the various stages are stored across different Grid resources. It is necessary to efficiently manage these different input and intermediate data of the parallel application and to use the data efficiently for future problem invocations and when the application is migrated across system resources. In .NET-DapPerAppCG, comprehensive set of solutions will be developed to efficiently manage and utilize input and intermediate parallel data for adaptive parallel Web services executing on computational Grid resources.

Various .NET mechanisms will be utilized for achieving the goals of the project, namely, UDDI for managing and discovery of parallel data distributed across Grid resources, Microsoft SQL Server for maintaining metadata about the scattered data segments, and SOAP-based protocols for remote data staging. Since in the current scientific applications, large-size data are considered to be the major bottlenecks in the applications’ performance, the solutions that will be developed in the project will significantly reduce and/or eliminate the data movement costs thereby improving the overall performance of the parallel Web services.

^ back to top

Sangam
Shahram Ghandeharizadeh, University of Southern California (USC)

Anxiety (or stress) disorders are the most common mental illness in America, affecting about 19.1 million American adults. Some types of these disorders can be associated with other illnesses, such as eating disorders, depression, or drug dependency. Many scientists devote considerable energy trying to understand the causal mechanisms underlying this important clinical problem, as is apparent from communities such as the Endocrine Society, which currently numbers 11,000 members in 80 countries. Such scientists have obtained much data suggesting that anxiety disorders are caused by dysfunction within specific brain circuits, but the precise relationships between these circuits and the way in which they are recruited by stress signals is unclear. Understanding this is critical for treating stress disorders.

http://dblab.usc.edu/Sangam

^ back to top

The Gateway to Biology Pathways
Keyuan Jiang, Purdue University Calumet

The functional interactions of biological components are conceived in the form of networks of biological pathways. The databases of known biological pathways provide an invaluable source of information for elucidating biological activities in a living organism. However, the lack of a unified way to query pathways poses a challenge to biologists in interpreting pathway data. In this eScience Applications project, we propose to develop a Web application, called “The Gateway to Biological Pathways,” to aggregate and unify the existing pathway databases and provide Web services for querying the aggregated datasets based upon the open standard for pathway data interchange BioPAX Level 1. A desktop application will be created to consume the Web services and generate graphical views of biological pathways.

^ back to top

Web Service Multimodal Tools for Strategic Biodiversity Research, Assessment and Monitoring
Claudia Bauzer Medeiros, UNICAMP

This is a joint proposal from Computer Science and Biodiversity researchers at the University of Campinas (UNICAMP), Brazil. Its goal is to provide scientists who work in biodiversity issues with a system that supports exploratory queries over heterogeneous biodiversity data sources. These sources include images (e.g., photographs of living beings or their habitats), geographic data (e.g., maps of the regions where these beings have been found), ontologies, and domain-specific metadata (e.g., habitat and ecosystem descriptions).

Unlike other biodiversity data management systems, queries will be multimodal, supporting content-based predicates (typical of image databases), spatial predicates (as in geographic databases), and traditional textual predicates (over data, ontologies, and metadata). Data sources will be accessed via Web services, whereas query formulation, pre-processing, and result visualization will run as a client-side application.

System development will take advantage of previous experience acquired in software and database development for biodiversity projects. Some of the proposed functionalities have already been tested in prototypes, in research and teaching environments. Though a generic system, it will be validated with data on Brazilian butterfly and fly species from UNICAMP’s Institute of Biology.

http://www.lis.ic.unicamp.br/projects/webios/

^ back to top

SciData Projects

Dynameomics: Internet Database and Web Portal for Molecular Dynamics Simulations of Proteins
Valerie Daggett, University of Washington

Publicly accessible scientific databases are essential for the dissemination and sharing of cutting-edge knowledge and information obtained from experiment and theory. In our field of protein science, the Protein Data Bank (PDB) has been a tremendously useful repository of experimentally derived, static protein structures that have stimulated many important scientific discoveries. While the utility of static physical representations of proteins is not in doubt, as these molecules are fluid in vivo, there is a larger universe of knowledge to be tapped regarding the dynamics of proteins. We propose to construct a complementary database comprised of molecular dynamics (MD) structures for representatives of all protein folds—an effort we are calling dynameomics. We are simulating the native (biologically active) state and complete unfolding pathways by MD, the time-dependent integration of the classical equations of motion for molecular systems. Of the approximately 1130 known non-redundant folds, we have simulated the first 30 which represent about 50% of all proteins using symmetric multiprocessing (SMP) high-performance computing (HPC) clusters. With the results of these simulations stored in an appropriate form, we and others will be able to quickly and easily mine for the patterns and general features of protein dynamics, folding and unfolding. The resulting information will be used to improve protein structure prediction algorithms and drug design methods.

http://depts.washington.edu/daglab

^ back to top

Large-Scale Integration of Different Data Modalities for Computational Medical Sciences
Mark Garbey, University of Houston

The main objective of this proposal is to build an infrastructure to serve a community of users with interests in biomedical data processing. The philosophy of this project is based on two premises, namely: (a) data analysis take priority over computation which can be provided by other existing infrastructures and (b) a common software environment to facilitate our work and speed up our research by merging several types of data into a common framework.

Our goal is to merge three main different channels of information coming from (stereo) video camera, top of the edge infrared thermal imaging and dense-array electroencephalographic in order to make some break through in the computational tracking of human learning, the monitoring of human physiology at a distance, a multimodal face recognition and facial expression analysis, and eventually a software environment to enhance the analysis of human behavior.

^ back to top

SCORM Public-Access Repository
Carlos Alberto Cobos Lozada, Universidad del Cauca

The main goal of this project is to promote the use of online education in Colombia, offering a repository of sharable content objects that can be managed through Internet. The meta-data based on SCORM 1.2 will be stored in SQL Server database, the contents will be stored in a virtual directory of IIS, and using Visual Studio .NET we will develop the interfaces to access and to manipulate the repository. One of the interfaces will be a set of XML Web Services so that it can be accessed from any Learning Management Systems (LMSs) and the other ones will be Web interfaces and Windows Pocket PC.

http://www.prometeo.unicauca.edu.co/scorm

^ back to top

OpenArXiv = arXiv + RDBMS + Web Services
Dongwon Lee, Penn State University

The arXiv is one of the popular scientific digital libraries, and it has been the major forum for dissemination of scientific results in disciplines such as Physics, Mathematics, Nonlinear Sciences, Computer Science, and Quantitative Biology. It currently contains about 300,000 scientific publications in various formats (e.g., PS, PDF, DOC, TEX).

The OpenArXiv project aims to significantly improve this arXiv digital library in two ways: (1) By exploiting the state-of-the-art database techniques available in Microsoft SQL Server, we will build a large-scale scientific digital library solely using an RDBMS and (2) by utilizing the standard XML-based Web Services paradigm and Microsoft .NET framework, we will build a programmable interface to arXiv so that not only human users but also software agents can freely access the contents of arXiv in many applications.

http://openarxiv.ist.psu.edu/

^ back to top

Migrating E-Transit Databases and Web Services to a TerraService Model
Uma Shama and Lawrence J. Harman, Bridgewater State College

This project provides an opportunity to use the latest hardware and software capabilities to design Web mapping services that will provide for increasing demand and increasing user productivity from the consumers of Web-based public services. The Internet-based mapping applications accomplished under the umbrella of www.e-transit.org have become increasingly useful to transportation coordinators throughout the Commonwealth of Massachusetts as they assist consumers in finding “transit first” solutions to jobs and job training. On Cape Cod, the Internet-based transit planner with real-time bus mapping and estimated time of arrival (ETA) prediction have shown significant growth in usage by local, national, and international tourists attempting to navigate the transit system of this popular transit destination. Lastly, the GeoGraphics Lab developed a mapping application using historic maps and historic and current aerial photography to monitor the changes in land use associated with the nation’s largest public works project: the Central Artery/Tunnel project in Boston. As these Web-based mapping and database applications gain national attention and increasing utility, it is important to maintain the state of the art in hardware and software capability to ensure that the consumer doesn’t experience degraded system performance. The objective of this research is to develop and deploy Microsoft SQL Server, MapPoint Web Service, XML, SOAP, and .NET-based e-transit applications assisting low-income job seekers and/or persons with disabilities in accessing transit information on the Internet and to upgrade the Central Artery/Tunnel transportation and land development Web-application entitled “Boston through Time.”

www.e-transit.org

^ back to top

Web Service Access to Streaming NEXRAD Level II Radar Data
Beth Plale, Indiana University

The weather research community has built over the past 7-8 years a sophisticated network of sensing devices for gathering information about weather conditions. For instance, the Collaborative Radar Acquisition Field Test (CRAFT) established the first distribution network for WSR-88D (NEXRAD) Level II radar data. Today its successor Integrated Radar Data Services (IRaDS) gathers and disseminates data from 130 radar sites worldwide over Internet 2. The Unidata IDD data dissemination system involves 150 research institutions in the dissemination of 30 different weather product types including upper air balloon data, satellite data, and gridded model data.

The weather data from these and other sources is generally open and available to the public and has been for some time, yet it remains obscure to the community of severe storm researchers beyond a couple sophisticated centers. We believe one reason for this is difficulty of access. Access generally requires highly skilled and knowledgeable experts deploying customized programs to convert the data from specialized formats before it can be analyzed and understood. Linked Environments for Atmospheric Discovery (LEAD), an NSF funded large scale ITR building cyberinfrastructure for severe storm forecasting, aims to improve access through a grid service architecture to enable access to data products, services, and processes for the severe storm researcher and educator. Plale is a PI on LEAD and Alameda is a Senior Scientist.

But we believe access to the streaming data, such as the Level II radar data, could take a revolutionary step forward by using database technology to store and serve the streaming data. Atmospheric science data gathered by weather researchers has not traditionally resided in databases. We believe this has to do with the timeliness demands on data use and on the continuous generation of voluminous amounts of data due to the atmosphere being continuously scanned, sensed, and monitored in real time for weather phenomena. User interest is highest in the most recently generated data that is data that is less than 10 minutes old, but drops off rapidly as the data ages. Once data ages beyond use in a current forecast, interest tends to cluster around the formation and occurrence of severe storms. Longitudinal studies that span a block of time are rarer, but we believe interest in longitudinal studies would increase if access to the data were improved. While rapid generation rates of voluminous amounts of data may have deterred people from database solutions in the past, high-performance database and server technology exists today at a price point that makes solutions more amenable.

^ back to top

 
eScience Workshops
 
Microsoft Research
 
Related Links

©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement