Microsoft Research hosted a two-day e-science workshop on Thursday,
October 6, 2005 and Friday, October 7, 2005 in Redmond, Washington. This
workshop was a follow-on workshop to the successful SciData 2004 Workshop.
The eScience Workshop provided a unique opportunity to learn and affect what is
happening in the realm of data intensive scientific computing within Microsoft.
Attendees learned first hand from early adopters using Microsoft Windows,
Microsoft .NET, Microsoft SQL Server, and Web services in these problem spaces,
as well as explored in-depth how modern database technologies and techniques are
being applied to scientific computing. By providing a forum for scientists and
researchers to share their experiences and expertise with the wider academic and
research communities, this workshop fostered collaboration, facilitated the
sharing of software components and techniques, and established a vision for
Microsoft Windows and .NET in data intensive scientific computing.
|
|

(Click
to view a larger image.) |
|
Thursday,
October 6, 2005 |
8:35-9:00 |
Welcome
Dan Fay and Jim Gray, Microsoft Research
Webcast:
eScience The Revolution Is Starting |
9:00-10:30 |
Web Services |
|
Predicting
Tornados with Data Driven Workflows: Building a Service
Oriented Grid Architecture for Mesoscale Meteorology
Research
Dennis Gannon, Indiana University
Each year the loss of life and property due to mesoscale
storms such as tornados and hurricanes is substantial.
The current state of the art in predicting severe storms
like tornados is based on static, fixed resolution
forecast simulations and storm tracking with advanced
radar. It is not good enough to accurately predict these
storms with the accuracy needed to save lives. What is
needed is the ability to do on-the-fly data mining of
instrument data and to use the acquired information to
launch adaptive workflows that can dynamically marshal
resources to run ensembles simulations on-demand. These
workflows need to be able to monitor the simulations
and, where possible, retarget radars to gather more data
to initialize higher resolution models that can focus
the predictions. This scenario is not possible now, but
it is the goal of the NSF LEAD project. To address these
problems we have built a service oriented architecture
that allows us to dynamically schedule remote data
analysis and computational experiments. The Grid of
resources used include machines at Indiana, Alabama,
NCSA, Oklahoma, UNC and Ucar/Unidata, and soon Teragrid.
The users gateway to the system is a Web portal and a
set of desktop client tools. Five primary persistent
services are used to manage the workflows: a metadata
repository called MyLEAD that keeps track of each users
work, a WS-Eventing based pub-sub notification system, a
BPEL based workflow engine, a Web service registry for
soft-state management of services and the portal server.
An application factory service is used by the portal to
create transient instances of the data mining and
simulation applications that are orchestrated with the BPEL workflows. As the workflows execute they publish
status metadata via the notification system to the users
MyLEAD space. The talk will present several open
research challenges that are common to many e-Science
efforts.
Webcast:
Predicting Tornados with Data Driven Workflows: Building a Service Oriented Grid Architecture for Mesoscale Meteorology Research |
|
Exposing
the National Water Information System to GIS Through Web
Services
Jonathan Goodall, Duke University
The National Water Information System (NWIS) is a
hydrology data repository with stream flow, water
quality, and groundwater observations maintained by the
United States Geological Survey (USGS). The database
includes 1.5 million monitoring stations in the United
States and Puerto Rico, some with nearly 100 years of
data. A Web service was developed using Visual Basic
.Net to better expose this national-scale data resource
to client applications within the hydrologic community.
One such client application is an extension to the ArcMap that was developed by the author for plotting
time series and performing basic water balance analysis
within a mapping environment. The plotting extension was
originally created to read from local databases,
requiring the user to manually download time series and
format them into a certain database structure. Now that
the software has been extended to consume the NWIS Web
service, it is possible to create “on-the-fly” plots of
hydrologic observations for any station within nation.
Webcast:
Exposing the National Water Information System to GIS Through Web Services |
|
Grid
Computing Using .NET
Marty Humphrey, University of Virginia
The broad goal of our WSRF.NET project at the University
of Virginia is to facilitate Grid computing on the .NET
platform. In this talk, we give an update on our
progress in exploiting and extending the .NET/Windows
platform for Grid Computing - including our recent
support for GridFTP on .NET, an OGSA-based Authorization
Service based on Windows, and our alternative software
stack for OGSA-based grids (based on WS-Transfer, WS-Eventing,
etc.). The talk culminates with a live *demo* of how we
have integrated this support for Grids on .NET/Windows
with the Globus toolkit to form the basis for the UVa
Campus Grid (UVaCG).
Webcast:
Grid Computing Using .NET
|
10:30-11:00 |
Break |
11:00-12:30 |
Web Services & Computations |
|
Web
Services for HPC — Making Seamless Computing a Reality
David Lifka, Cornell Theory Center
Seamless HPC has been a goal of Computer and
Computational Scientists for over a decade. Allowing
researchers to focus on their research and not the
quirks of complex HPC environments has been a dream
waiting for a solution. Today the solution exists but
many still dont know how to apply the tools (Web
Services and SQL databases) to the problems. This talk
will discuss several applications of Web Services and
SQL databases to real world HPC applications including
solutions for embarrassingly parallel tasks, wide area
distributed computing applications, and eScience/Data
Intensive Computing. Real examples of applied solutions
for each of these HPC problems will be presented.
Examples of programming techniques to support the use of
Web services installed on hundreds of distributed
computing and data resources will be also be presented.
Webcast:
Web Services for HPC — Making Seamless Computing a Reality |
|
Computational Data Grid for Scientific and Biomedical
Applications Marc Garbey and
Victoria Hilford, University of Houston
The goal of this project is to develop a Microsoft
Windows-based Computer Grid infrastructure that will
support high performance scientific computing and
integration of multi source biometric applications. The
University of Houston Microsoft Windows-based Computer
Grid (WING) includes not only the Computer Science and
the Technology Department networks, but also includes
nodes in China, Germany, and several other countries.
The total amount of available storage exceeds 4
Terabytes. Four specific biomedical applications
developed at University of Houston are the basis of this
project:
- Computational tracking of Human Learning using
Functional Brain Imaging
- Monitoring Human Physiology at a Distance by using
Infrared Technology
- Multimodal Face Recognition and Facial Expression
Analysis
- Relating Video, Thermal Imaging, and EEG Analysis —
integrate and analyze simultaneously recorded brain
activity, infrared images, and 3D video
This Biomedical Data Grid project meets the following
technical requirements:
- Rapid application development (use of the Microsoft
Visual Studio .NET technology)
- Visual modeling interfaces (forms driven Graphical
User Interfaces)
- Database Connectivity (interface with Microsoft SQL
Server 2005)
- Query support (clients can store, update, delete,
retrieve database metadata)
- Context-sensitive, role-based access (Microsoft
Windows Server 2003, ASP.NET)
- Robust security (HIPPA compliance through
Microsofts Authentication and Authorization from IIS
and ASP.NET)
- Connectivity to other biomedical resources (PACS,
DICOM, XML)
The Biomedical Data Grid application is developed
using Microsoft Windows Server 2003, Microsoft Virtual
Server 2005, Microsoft Visual Studio .NET Beta 2, and
the Microsoft SQL Server 2005. A Web client will be able
to securely upload biomedical files to a Web server
while metadata related to these files will be stored in
the SQL Server 2005 database for the purpose of
querying, data mining, etc. Post-processing and
simulation steps on biomedical data will be using a
Master node Web Service that automatically distributes a
large set of parameter or sensitivity analysis tasks to
Slave nodes on the Computing Grid. We will give an
overview of our project and provide a few examples of
our biomedical applications.
Webcast:
Computational Data Grid for Scientific and Biomedical Applications |
|
SETI@home
and Public Participation Distributed Computing
Dan Werthimer, University of California at Berkeley
Werthimer will discuss the possibility of life in the
universe and the search for radio signals from other
civilizations. SETI@home analyzes data from the worlds
largest radio telescope using desktop computers from
five million volunteers in 226 countries. SETI@home
participants have contributed two million years of
computer time and have formed Earths most powerful
supercomputer. Users have the small but captivating
possibility their computer will detect the first signal
from a civilization beyond Earth. Werthimer will also
discuss plans for
future SETI experiments, petaop signal processing, and
open source code for public participation distributed
computing (BOINC Berkeley Open Infrastructure for
Network Computing).
Webcast:
SETI@home and Public Participation Distributed Computing |
12:30-2:00 |
Lunch and
Tools Discussion — Engineering tools for eScience
Simon Cox, University of Southampton |
2:00-3:30 |
Applications #1 |
|
Making
NEXRAD Precipitation Data Available to the Hydrology
Community
Tomislav Urban, University of Texas at Austin
Next Generation Doppler Radar (NexRad) has enabled the
possibility of collecting high-resolution precipitation
data across the country that is of high value to
hydrologists studying, amongst other things, flooding,
evaporation, and drought. These data however, available
only is non-standard, binary, formats, and in file
structures not conducive to the types of queries
typically performed in the domain, have been difficult
to integrate easily into the hydrologists research. For
example, whereas a hydrologist may seek to obtain data
for a single variable over a small geographical entity
for a fairly significant temporal extent on the order of
months or years; the files on the other hand are
typically available for one hour periods extending over
a large region or even over the entire country. Since
the level of IT support for these researchers can be
low, this has presented an impediment to the ready
access to NexRad data. This project seeks to provide
simple Web application- and Web service–based access to
these data in whatever spatial and temporal extents are
most convenient to the user. By storing the data in SQL
Server, we are able to quickly generate output files for
precisely the variables, geographies, times and formats
that are required. Additionally, as we are using the ArcHydro schema developed by out partners at the Center
for Research in Water Resources (CRWR) also at the
University of Texas; the data can be easily output as
geo-referenced points or polygons allowing the user to
bring to bear an array of GIS-based analytic tools
already generally available. Looking ahead, we see this
collection growing into a major repository of
hydrology-related data including stream and rain gauge
point data and water quality.
Webcast:
Making NEXRAD Precipitation Data Available to the Hydrology Community |
|
Integration
and Visualization in Bioinformatics
Mehmet Dalkilic, Indiana University
One of the greatest benefits of escience
— the use of
distributed computing and data resources for scientific discovery
— is the opportunity for scientists to begin
working with data sets that would have been too large to
work with otherwise and, consequently, ask questions
that would have not been possible. There are many
obvious challenges escience faces because of its
distributed nature, but other challenges that, while not
uniquely escientific, remain sufficiently
domain sensitive that solutions do not seem easily
shareable. One particularly difficult problem is
integration — how to coherently bring together
disparate, massive data sets. Focus has been generally
placed on the physical layer, borrowing from the three
layers of data modeling, where details of implementation
predominate. This problem will likely continue, though
there is some hope leveraging smart architectures like
smart clients. Logical integration — how to meaningfully
bring together massive, disparate data sets
— from the
scientists perspective is even more challenging.
Another challenge of escience is creating meaningful,
interactive visualizations of massive data sets. A
direct benefit of this kind of visualization is allowing
the scientist to freely explore in a setting that is
more familiar and intuitive. In this presentation will
we discuss three ongoing projects, CATPA (Curation and
Alignment Tool for Protein Analysis), INGeNE
(Integrated, Gene Network Explorer), and SNPEx (SNP
Explorer) that address the challenges of integration and
visualization. CATPA is a smart client application that
allows for the curation of protein families at the
residue level, including deletions. Interaction is done
visually. INGeNE is an application that allows for
functional genomic discovery by building networks of
relationships where an edge is a determined by a
combination of microarray data, protein-protein data,
gene-gene interaction data, and phenotypic expression
data. SNPEx is an application that includes a novel
algorithm to find the most informative set of tagging
SNPs. Additionally, we decided to implement SNPEx in
both Java/MySQL and C#/SQL Server 2000 to compare
performance of the two systems and found the later to be
superior in our suite of tests.
Webcast:
Integration and Visualization in Bioinformatics |
|
The WiFi
eTransit Village
Uma Shama, Bridgewater State College
As a foundational research project of the Federal
Transit Administration, the GeoGraphics Laboratory at
Bridgewater State College has developed of a Web-based
transit technology prototype focused on the needs of the
consumer to access safe and secure transit service while
also providing for enhanced personal productivity and
travel assistance while on-board the transit vehicle and
at the bus stop. The project takes advantage of emerging
community-wide outdoor Internet connectivity and very
large scale data storage as the enabling technology for
a full-featured e-transit village. The project uniquely
uses the wireless local area network infrastructure (WLAN)
and international standards (WiFi or wireless fidelity
802.11b) to demonstrate customer-oriented applications
of transit technology for community transportation
providers. The transit technology prototype uses the
campus transit system for Bridgewater State College
provided by the Brockton (MA) Area Transit Authority
(BAT) and the surrounding New England village of
Bridgewater, Massachusetts. Proof of concept milestones
to date include GPS-based automatic vehicle location
mapping transit vehicles with a one-second refresh rate
using Microsoft’s Web service and transmission of
video from the transit vehicle with GPS date/time and
latitude/longitude at simultaneous one-second intervals
for real-time Internet display and archiving on
Microsofts custom built 2-terabyte server. Research
continues in developing an opportunistic approach to
optimizing reacquisition of access points and
reauthorization of wireless local area network security
from a moving transit vehicle. A field operational test
of the proof of concept is planned for 2006 with an
opportunity for deployment by the local sponsoring
transit authority in 2007.
Webcast:
The WiFi eTransit Village |
3:30-4:00 |
Break |
4:00-5:30 |
Client Applications |
|
Streamlining Scientific Research via Electronic
Laboratory Notebooks and Wireless Sensors
Patrick Anquetil, MIT BioInstrumentation Lab
This talk will discuss the use of computing to assist
research in an academic laboratory environment. Two
projects conducted at the MIT BioInstrumentation
Laboratory within the framework of the MIT/Microsoft iCampus project will be discussed. These two projects
are named iLabNotebook and iDat.
The iLabNoteBook is an experiment in which we attempt to
replace traditional laboratory notebooks with Windows XP
powered Tablet PCs. This new computing platform offers a
multimedia environment for scientists and students to
document their work and conduct scientific research. The
virtual laboratory notebook empowers researchers not
only to record experimental procedures digitally but
also to add multiple data-format content to a lab
notebook page. In addition these electronic notebooks
can be easily searched, backed-up, transported and
shared amongst colleagues worldwide. Evaluation of this
technology was conducted for a one year period among
fourteen scientists at MIT.
The goal of the iDat project is to develop Web-based
wireless iDAT sensors specifically designed as
multidisciplinary educational tools to teach
instrumentation to students in a diverse range of
fields, including physical sciences, engineering,
biological science and neuroscience. Imagine you are a
curious student keen to tie the theoretical knowledge
you have acquired in your undergraduate courses to real
measurements. For example, you might be interested in
measuring finger acceleration during piano playing,
measuring the forces and accelerations involved in
playing tennis, or measuring heart rate, peripheral body
temperature, and foot-pedal force while riding a
bicycle. At present, it is essentially impossible for
students to make such measurements. iDat is a major
educational initiative where, for the first time,
measurement of a huge range of phenomena will become
very easy, both in terms of use and cost.
Webcast:
Streamlining Scientific Research via Electronic Laboratory Notebooks and Wireless Sensors |
|
NeuroScholar: A Practical Solution Addressing
Information Overload in Systems-Level Neuroscience
Gully Burns, University of Southern California
Systems-level neuroscience lacks a formal theoretical
structure, relying on argumentation based on
experimental findings expressed in the primary
literature. Theoretical models may typically be
represented as summary diagrams in a paper’s discussion.
Within a subject as complex and multifaceted as
neuroscience, this lack of formalization leads
inevitably to problems of information overload for
individual researchers as it is a significant challenge
to manage and manipulate large volumes of information
from a distributed resource such as the literature and
scientists’ own individual records. We present “NeuroScholar,”
a knowledge base desktop application that specifically
targets literature- and laboratory-based information,
providing a structured knowledge engineering approach
for neuroscience. It provides a general object-oriented
data model to encapsulate complex data into entities,
and a graph-theoretical approach that represents
relations between entities edges between nodes in a
graph. The system has frameworks for unit testing, plugins (to embed external applications within
NeuroScholar), proxies (to export NeuroScholars
knowledge management capabilities to external
applications) and knowledge acquisition based on
questionnaires. Specialized plugins include an
annotation mechanism for pdf files (built with
Multivalent, a third party library); an electronic
laboratory notebook component; an annotation
mechanism for vector graphics; and NeuARt, a
neuroanatomical data viewer based on standard atlases
that can also use the proxy framework to act as a
standalone neuroanatomical data management tool. The
knowledge acquisition subsystem provides an easy way to
link free-form document annotation with structured
knowledge representations for specific types of
experiment. We are applying the system directly in two
systems-level neuroscience laboratories, one focused on
neuroanatomy, the other on neuroendocrinology. It is
anticipated that NeuroScholar may provide a platform for
theoretical research in neuroscience by delivering
knowledge engineering capabilities directly to
experimental scientists to facilitate analysis and
communication.
Webcast:
NeuroScholar: A Practical Solution Addressing Information Overload in Systems-Level Neuroscience |
|
WorldWind
Patrick Hogan, NASA
NASA World Wind, a Smart Client application built
almost effortlessly on the .NET platform, lets you zoom
from satellite altitude into any place on Earth.
Leveraging Landsat satellite imagery and Shuttle Radar
Topography Mission data, World Wind lets you experience
Earth terrain in visually rich 3D, just as if you were
really there. Virtually visit any place in the
world. Look across the Andes, into the Grand Canyon,
over the Alps, or along the African Sahara.
NASA World Wind is a free an open source application,
providing an excellent opportunity to understand and
work with Smart Client architecture and the .NET
framework, be it an academic exercise in understanding
the technology or to better appreciate development of
scientific research tools. All data leveraged is in the
public domain. The technology allows for implementing a
variety of formats, including ESRI Shapefiles, and
server protocols (for example, WMS).
Webcast:
WorldWind |
5:30-7:00 |
Break |
7:00-8:30 |
Dinner and Talks |
|
Computationally-intensive biomedical research
projects supported by the National Institutes of Health
Milton Corn, M.D., NIH
The need for computational partnerships in biomedical
research has increased sharply in recent years as the
Human Genome project and other high-throughput
biomedical research has underscored important new
requirements for data processing, information retrieval,
database design, data mining, and quantitative biology.
At the National Library of Medicine as well as a number
of other Institutes at the National Institutes of Health
campus, research funding opportunities increasingly
require significant computational expertise, and
specifically require applicants to include in the
project collaborations between biologists and
computational experts. This talk will provide a survey
of current computationally-intensive opportunities at
NIH, suggestions for computer scientists and engineers
looking for biomedical partners, and some guidance about
the NIH grant processes.
Webcast:
Computationally-intensive
biomedical research projects supported by the National
Institutes of Health |
|
Creating the Personal Supercomputer
Kyril Faenov, Microsoft
As computing power has increased so have the
complexities of our computer simulations. Were at a
point now where many scientists, engineers, and
researchers are hitting the upper limit of their high
end workstations, further driving the need for
supercomputing resources. Microsofts goal in entering
the high performance computing space is to enable what
we call “personal supercomputing” which sounds like an
oxymoron. What we want to do is move super computing
resources out of distant labs and bring them closer to
the people that use those resources. In most cases it
would be a workgroup sized system with 32 or 64 nodes,
but in the most extreme case, the personal
supercomputing case, it would mean a small 4-8 node
cluster sitting in a scientists office running off 15
amp wall power. Come hear why we think this is the
direction of supercomputing and how well make it a
reality.
Webcast:
Creating the Personal Supercomputer |
|
Using .NET and Web Services to build an e-Science
Application: Looking for White Dwarfs
Savas Parastatidis
The Web Services Grid Application Framework (WS-GAF)
project (Jan 2004 - Jan 2005) aimed to demonstrate the
value of using standard, widely-accepted, well-supported
Web Services technologies for scientific and commercial
Internet-scale (a.k.a. “Grid”) applications. The
scientific application developed as part of this project
is a tool aimed at astronomers who wish to combine and analyse information from the SuperCOSMOS (UK) and Sloan
Digital Sky Survey (US) scientific archives. This
presentation will discuss the WS-GAF approach to
building Internet-scale applications, the steps followed
in creating a tool for scientists, and the
implementation challenges and solutions.
Webcast:
Using .NET and Web Services to build an e-Science Application: Looking for White Dwarfs |
|
|
|
Friday, October 7, 2005 |
8:30-9:00 |
Cyberinfrastructure for E-Science
Tony Hey, Microsoft
Webcast:
Cyberinfrastructure for E-Science |
9:00-10:30 |
Data & Databases |
|
The Gateway to Biological Pathways: A Platform to
Enable Semantic Web-Based Biological Pathway Datasets
Keyuan Jiang, Purdue
Biological pathways represent our current
understanding of biological processes. A large amount of
biological pathway data has been accumulated either by
curation of the scientific literature or by automatic
machine inference of high-throughput laboratory
experiments. There exist over 180 biological pathway
databases, covering metabolic pathways, signal
transductions, protein-protein interactions, and
regulatory pathways. The data have been collected by
diverse research organizations with particular
interests, various techniques, incompatible schemas, and
different access methods. Biologists utilize the pathway
data to formulate hypotheses, verify experiment results,
and share research outcome. Due to the incompatibility,
depth and breath of database coverage, it is not
uncommon for biologists to query multiple datasets, a
time-consuming and error-prone process, to address
intriguing biological problems.
The Gateway to Biological Pathways project leverages the
BioPAX standard in storing and providing pathways
datasets consumable by Semantic Web applications, and
offers a unified interface to query biological pathway
data. The proposed BioPAX standard provides a common
format for exchange biological pathway datasets. The
BioPAX ontology, written in W3C recommended Web Ontology
Language (OWL), supports the vision of Semantic Web.
With the BioPAX, a pathway is composed of a number of
entities and relationships among the entities.
In the Gateway application, the pathway entities are the
basic data unit that is naturally stored in its XML
format in a native XML datatype column of a SQL Server
2005 database. The support of native XML format eases
the database design by which the number of tables can be
reduced while relationships among pathway entities are
still maintained. Storing XML data in XML datatype
column provides an efficient way of accessing and
processing data. The XQuery provided facilitates diverse
searching functionality with the XML datasets. The
Gateway application provides a Web service by which
biological pathways can be queried and the data returned
are of BioPAX format. In addition, the HTTP GET and POST
methods are implemented for directly querying the
pathway data. The pathway datasets of E. coli and Human
from BioCyc are currently available at the Gateway, and
more data are to be added. A client capable of consuming
the BioPAX format data is being developed for
visualizing and navigating biological pathways.
Webcast:
The Gateway to Biological Pathways: A Platform to Enable
Semantic Web-Based Biological Pathway Datasets |
|
Environmental Science from Satellites
Jeff Dozier, University of California at Santa Barbara
Imagery from Earth-orbiting satellites provides a rich
but voluminous source of raw data for scientific
investigation of environmental processes and trends.
Analyses of the data are, however, generally outside the
traditional realm of image processing. Instead, we
think of an image as a geospatial raster of radiometric
values, and an images resolution includes spatial,
spectral, radiometric, and temporal attributes.
Translation of images into a suite of geophysical
products requires technologies and procedures that
support extensive computation and spatial operations on
large objects, along with mechanisms to track the legacy
of computations performed and allow revisiting as
algorithms change.
Webcast:
Environmental Science from Satellites |
|
DopplerSource: .NET Framework for Accesing Doppler Radar
Data
Beth Plale, Indiana University
Doppler radar data, which has proven its value in
meteorology research, has tremendous potential for use
in many other research endeavors if only it werent so
difficult to work with. In DopplerSource we are removing
the hurdles that prevent broader use of the data through
a service-based framework for storing, operating on, and
serving the data. The 130 WSR-88D (Doppler) radars
located throughout the United States generate Level II
data continuously 24x7. The data has been valuable in
many aspects of meteorology research and education, for
instance, for the real time warning of hazardous spring
and winter weather, for initializing numerical weather
prediction models, and for verifying the occurrence of
past events, such as the location of damaging hail. But
it has broader potential. Level II data is used in bird
and insect migration student, bird strike avoidance,
urban pollution transport, and the tracking of hazardous
atmospheric releases. This larger goal of facilitating
additional avenues of science cannot be fully realized
without significant improvements in the accessibility
and availability of the data over what exists today.
In this project partially funded through Microsoft
e-Science, we are constructing a .NET framework for
storing, operating on, and serving NEXRAD Level II data
and the knowledge products derived from the data. Our
pilot project is aimed the six nearest radars
surrounding Bloomington, Indiana. The project focus
areas are in:
- Storing and indexing large volumes of streaming data
using a SQL Server database
- Generating metadata on-the-fly to describe data and
capture features of time-sequence in which the data
arrived
- Simple retrieval of Doppler data through a
spatial-temporal interface. The user selects a region of
interest, and specifies a temporal range.
- Support services to query, process, clean, filter,
and fuse data on the fly
- Authentication mechanisms to avoid denial of service
abuse by over-taxing the computational resource
- Scalability-level of performance that balances
continuous input stream arrival, computationally intense
user services, and rich query access over highly
correlated temporal and spatial data
- Log analysis to characterize arrival and anticipate
user workload. Logs from related meteorology services
used to analyze patterns of use that allow us to better
anticipate future usage patterns
The storage needs for the pilot radars alone is
substantial. The 6 radars generate 27.5 TB per year of
raw Level II data that can be compressed to 1/25th size,
requiring 1TB/yr of storage. A useful transformation of
the data is into the binary netCDF format. The converted
data adds another 2.5 TB/year. The arriving data
products are tagged with metadata to facilitate
searching. The metadata needs for the pilot data
products are estimated at 170GB/yr. The knowledge
products generated on-demand by statistical analysis and
data mining services are estimated at 0.5 TB/yr. This
places the total storage need at 4.5TB/year of data. The
tools used include Web service framework (.NET),
database management system (SQL Server), XML metadata
schema (leveraging LEAD Metadata Schema from the NSF
LEAD project), and Integrated Radar Data Services (IRaDS)
support for the Doppler streams. The hardware testbed
includes 16 dual Opterons with 16GB RAM each, a 3.5 TB
SAN storage array, a dual Opteron, 4GB RAM, 2TB RAID 1
disk, Windows 2003 as the database server, and the
Indiana University MDSS fault tolerant mass store server
with a collective 1 Petabyte of storage.
Webcast:
DopplerSource: .NET Framework for Accesing Doppler Radar
Data |
10:30-11:00 |
Break |
11:00-1:00 |
Data & Databases |
|
CasJobs and
MyDB for the Virtual Observatory: Towards Distributed
Asynchronous Web Services for Data Intensive Science
Ani Thakar, Johns Hopkins University
The Sloan Digital Sky Survey (SDSS) Catalog Archive
Server (CAS) provides online access to the multi-TB SQL
Server-based SDSS Science Archive via the
SkyServer Web portal.
This synchronous, ASP-based Web access is fine for
casual and quick queries that request moderately sized
resultsets, but for the data intensive queries that are
necessary for serious research with the SDSS archive, we
have developed CasJobs, a C#.NET batch query workbench
Web service that provides asynchronous queue-based
access to the SDSS CAS and a personal SQL Server
database for every user (MyDB) to save their query
results. I will describe amd briefly demo the batch
query workbench, and discuss the future of CasJobs: a
distributed CasJobs/MyDB for the Virtual Observatory
(VO). CasJobs provides two modes of query execution:
quick (synchronous) and batch (queued) execution. Quick
queries are limited to 1 minute execution time and can
be run even without login, while batch queries (which
require login) are virtually unlimited. Results from
batch queries are routed to the users MyDB by default.
Users can then preview and download these results at
their convenience and in their chosen format (ASCII/CSV,
binary or XML). They can also share their MyDB tables
with other collaborators, and use them in other queries
and stored procedures to perform complex data intensive
tasks like neighbor searches and cross-matches.
Distributed CasJobs will require distributed security
and storage. The international VO community is
converging on the VOStore standard which will
essentially combine Web services security with MyDB-like
data stores accessible via asynchronous Web services.
VOStores will also enable distributed Web services like
Open SkyQuery
to operate asynchronously so that large cross-matches
between catalogs can be performed on demand.
Webcast:
CasJobs and MyDB for the Virtual Observatory: Towards
Distributed Asynchronous Web Services for Data Intensive
Science |
|
A Platform for Computational Comparative Genomics on
the Web
Sun Kim, Indiana University
We have been developing a Web-based system for comparing
multiple genomes,
PLATCOM,
where users can choose genomes and perform analysis of
the selected genomes with a suite of computational
tools. PLATCOM is built on internal databases such as
GenBank, COG, KEGG, and Pairwise Comparison Database (PCDB)
that contains all pairwise comparisons (97,034 entries)
of protein sequence files (.faa) and whole genome
sequence files (.fna) of 312 replicons. The pre-computed
PCDB makes it possible to complete genome analysis very
fast even on the Web, so that users can choose any
combination of genomes and analyze them with data mining
tools. Genome comparison requires combining many
sequence analysis tools. However, combining multiple
tools for sequence analysis requires a significant
amount of programming work and knowledge on each tool,
thus it is very challenging to provide a service for
comparing genomes on the Web by using standard sequence
analysis tools. Thus, to make genome comparison be done
on the Web, well-defined data mining concept and tools
are very important since they can make genome comparison
much easier. It is also important that the data mining
tools for genome comparison should be scalable. We have
been developing such scalable tools: a sequence
clustering algorithm BAG, a metabolic pathway analysis
tool MetaPath, a gene fusion event detection tool
FuzFinder, a gene neighborhood navigation tool OperonViz,
an algorithm for mining correlated gene sets MCGS, a
genome sequence alignment tool GAME, a multiple genome
sequence alignment algorithm by clustering local matches
mgAlign, and a pairwise genome visulization tool COMPAM.
The analysis results are summarized with visualization
tools. We are currently working on integrating the data
mining modules such that users can combine these in a
very flexible way. In addition to sequence data, PLATCOM
will include more data types such as gene expression
data.
Webcast:
A Platform for Computational Comparative Genomics on the
Web |
|
Querying Breast Cancer Image Databases
Hanan Samet, University of Maryland
Breast cancer remains a leading cause of cancer deaths
among women in many parts of the world. In the United
States alone, over forty thousand women die of the
disease each year. Mammography is currently the most
cost-effective method for early detection of breast
cancer. Alternative medical imaging approaches such as
ultrasound or MRI could be more effective than
mammography at detecting cancers or evaluating
malignancy in certain types of women. A database with
images from multiple technologies like mammograms, MRI,
PET, and ultrasound will enable research into the
effectiveness and usefulness of each technique at cancer
screening and the determination of malignancy. We
created this database with Microsoft SQL Server and will
be using it to develop a tool that will use it to
provide doctors with a Web-based query tool to access
the data via Web services. Doctors will also be able to
find cases similar to a current patient thereby
improving the accuracy of the diagnosis. This database
will be an invaluable tool for the improvement of
computer aided detection (CAD) techniques by providing
quality data sets, the storage of feature sets for
comparison, a tool for the complex combination of
features through spatial relationships and across
images, and built-in statistical analysis. We will
develop a pictorial query specification system for this
tool that will enable users to specify queries by
identifying the desired features, shapes, or
characteristics and specifying the spatial relationship
between them using distance and direction. Additionally,
the secure data storage and retrieval enables
long-distance, electronic image transmission (telemammography/teleradiology)
for clinical consultations. The database will match
images from the same patient, thus improving the
capability of comparing images through time, which will
enable the determination of extremely early cancerous
indicators, and thus hopefully improve the cancer
survival rate.
Webcast:
Querying Breast Cancer Image Databases |
|
SANGAM: A
System for Integrating Web Services to Investigate
Stress-Circuitry-Gene Coupling
Shahram Ghandeharizadeh, University of Southern
California
In 1993, NIH launched the Human Brain Project (HBP) to
develop and support neuroinformatics as a new science to
make experimental data pertaining to the brain publicly
available on the Internet. The success of HBP is
demonstrated by the
Society of
Neuroscience maintaining a directory of 83 databases
and 48 knowledge bases developed and maintained by
different academic, government, and commercial
institutions. A challenge is how to integrate data from
these diverse sources to answer a scientific enquiry.
SANGAM focuses on this challenge from the perspective of
Stress-Circuitry-Gene coupling. It strives to address
the following scientific question: Does every type of
stress stimulus recruit the same set of brain circuits
and activate the same genes, or do such circuits and
genes vary across different stressors? An answer to this
question helps clinicians and drug manufacturers to
develop better treatments and drugs for stress
disorders. Currently, a prototype of SANGAM is
operational and in-use by our neuroscientists. A key
insight from developing SANGAM is a general framework
for neuroscience information integration consisting of 3
components: Run-time integration (RTI), Plan Composition
(PLC), and Schema and Data Mapper (SDM). We present an
overview of these components along with performance
results from both centralized and distribution (using
WSE 2.0) implementation of RTI component.
Webcast:
SANGAM: A System for Integrating Web Services to
Investigate Stress-Circuitry-Gene Coupling |
1:00-2:00 |
Lunch & Discussion |
2:00-4:00 |
Workflow |
|
Bio-Workflow Using Biztalk
Paul Roe, Queensland University of Technology
Workflow is an important enabling technology for
eScience. Research into workflow systems for eScience
has yielded several specialized workflow engines. We
have been investigating the nature of scientific
workflow and experimenting with the BizTalk business
integration server to support scientific workflows. In
particular we have built a simple Web portal for
bioinformatics which uses Biztalk as the underlying
workflow engine. The portal has a novel Web based
interface making it accessible to a wide variety of
users. In this presentation we will describe the overall
system and demonstrate some simple workflows.
Webcast:
Bio-Workflow Using Biztalk |
|
Developing GEMSTONE, a Next Generation
Cyberinfrastructure
Karan Bhatia, San Diego Supercomputer Center
We are developing an integrated framework for accessing
grid resources that supports scientific exploration,
workflow capture and replay, and a dynamic
services-oriented architecture. The framework, called
GEMSTONE for “grid enabled molecular science through
online networked environments,” provides researchers in
the molecular sciences with a tool to discover remote
grid application services and compose them as
appropriate to the chemical and physical nature of the
problem at hand. The initial set of application services
to date include molecular quantum and classical
chemistries (GAMESS, APBS, Polyrate), along with
supporting services for visualization (QMView),
databases, auxillary chemistry services, as well as
documentation and education materials.
This presentation will focus on the technologies used to
build the GEMSTONE frontend: a rich-client application
built using the Mozilla framework that provides access
to remote registries for application discovery using RSS,
dynamically loaded user interfaces using XUL, and
visualization services (both local and remote) using SVG,
Flash, and OPENGL. The GEMSTONE frontend supports the
GSI-based secure Web services infrastructure being
created by the National Biomedical Computation Resource
(NBCR), an NIH funded center at UCSD, and supports the
Grid Account Management Architecture (GAMA) for
credential management. The remote Web services support
large-scale clusters for parallel and high-throughput
jobs and provide science-oriented strong datatypes for
semantic composition. Finally, GEMSTONE adds a workflow
composition tool, based on the Informnet engine, that
composes existing Web services into workflows that are
accessible as new Web services.
Webcast:
Developing GEMSTONE, a Next Generation
Cyberinfrastructure |
|
A
Web
Interface to Large, High-Resolution X-Ray Computed
Tomography Data Sets
Julian Humphries, University of Texas at Austin
High-resolution X-ray computed tomography (HRXCT)
provides highly detailed three-dimensional data on the
exterior form and interior structure of solid objects.
The data produced by the UTCT Lab facility at the
University of Texas at Austin are HRXCT scans of various
biological organisms, ranging from dinosaurs to mice and
geological objects including meteorites and deep sea
cores. The Digital Library of Vertebrate Morphology (or
Digimorph Project)
has for the last eight years acquired and scanned some
of the worlds most spectacular organisms. To date,
these data have been released as highly compressed
renderings in the form of movies and Web sized versions
of the data. In order to provide access to full sized
datasets and enhance research tools for viewing these
data we have developed UTCT. These data sets are large
(1-4 GB in size) and the display and dissemination of
these datasets is a challenge. We have built a SQL
Server based system which hosts metadata and raw imagery
and which allows rapid and flexible access to volumetric
data.
The visualization options on the Web site range from
simple to complex. Users can currently choose from a
“light-table” viewer or a Java slice viewer from this
site. They can also download all or parts of the data at
multiple bit-depths and file formats. Finally, users
will soon have the option to remotely volume render
these data using one of several tools being developed.
One approach uses the Meshviewer/Vista combination on
Maverick, a TeraGrid-funded visualization system to
remotely render their volume using a VNC client and a
high-speed Internet connection. Other possibilities are
also under development. The combination of these options
gives users a rich set of tools for exploring the data.
Webcast:
A Web Interface to Large, High-Resolution X-Ray Computed
Tomography Data Sets |
|
The
Zecosystem: Cyberinfrastructure Education and Discovery
for the Next Generation
Krishna P.C. Madhavan, Purdue University
Learning experiences of the future will be
multi-sensory, engage technologies and significant
computational power continuously and invisibly, and will
be completely engaging. The Zecosystem will offer
cyber-services that incorporate science, technology,
engineering, and mathematics concepts into the students
everyday experiences seamlessly. Through this project,
we expect to transform common day-to-day student
activities such as gaming, eating at the cafeteria, or
visiting the library into learning experiences. Our
vision is to develop a Cyberinfrastructure Education
Ecosystem where learning co-exists with students
lifestyles, technology choices, and emerging national cyberinfrastructure. To this end, we will leverage
significant on-going R&D in computational
infrastructure, middleware, and science gateways funded
by the National Science Foundation (NSF) and other
industrial partners at Purdue University.
Our goal is to leverage the national cyberinfrastructure
effort for day-to-day discovery and learning practices.
Given the emergence of highly cross-disciplinary areas
such as nanotechnology, bioinformatics, and
computational science as critical for scientific
progress, teaching and learning at colleges and
universities can no longer be locked behind
computational walls. Furthermore, several national
reports have identified the dire need to train and
develop the next generation of students to take up
careers in science, technology, and engineering. We
strongly believe that in order to reach the current
generation of students the
aptly labeled Gen-Z #151; information
technology needs to be at the heart of educational
efforts and play more than an add-on role. Simply put,
we need to rethink education ground-up.
The goal of the first phase of the project is to develop
a robust set of Cyberinfrastructure (CI) Education
Services that will extend the capabilities of existing
and emerging science gateways such as the nanoHUB to a
mobile environment. We are developing a series of Web
services that plug into middleware and infrastructure
layers currently being developed and deployed at Purdue
University. These services will be made available to the
larger scientific and education community, while
simultaneously consumed to develop new cutting-edge
CI-services focused and tailored at students. This
project will allow students to deploy large
computational jobs to the national cyberinfrastructure
from their cell phones, PDAs, gaming environments, and
other mobile devices.
In this talk, we will focus on the vision of the
Zecosystem and provide concrete arguments that are
derived both from a scientific discovery, as well as
from the pedagogical viewpoints. We will also provide
examples of various project elements that are already in
progress. In many cases, the prototypes are expected to
be ready by the end of the coming academic year. All of
the projects, which will be highlighted, are either
funded by large NSF grants or by support from our
industrial partners.
Webcast:
The Zecosystem: Cyberinfrastructure Education and
Discovery for the Next Generation |
4:00-430 |
Workshop Wrap-up |
|
5:00-7:00 |
Light dinner and drinks in the Elk River room |
The Microsoft Research eScience Workshop was held at the Marriott Redmond
Town Center Hotel in Redmond, Washington.
Marriott Redmond Town Center Hotel
7401 164th Avenue NE
Redmond, WA 98052
For more information or if you have any questions, send an e-mail message to
escience@microsoft.com.
|