|
Open Source eScience Geospatial Visualization Using .NET Technology
Patrick Hogan, NASA
The need for massive communication and dynamic sharing of scientific data has never been greater than it will be in the world
that awaits our children. The ability to integrate, analyze, and exchange both local and global information is critical to
maximizing our understanding of our circumstances, whether for ground-truthing of satellite data (Earth’s carbon budget),
coalescing field data for regional projections (North Africa to North India locust intervention), or simply innovative analyses
coming from world-wide access to global data, and whether it be on behalf of academia, governments, or enfranchised
individuals from the global community. This realm of scientific understanding needs the kind of innovation that comes from
coding environments that provide the greatest opportunity for the development of solution-based technology. Competition in
this realm should be based purely on results engendered by access to the scientific data. The .NET programming environment
provides a compelling solution for scientific endeavors to maximize solution-based analyses and it also equally serves the
geospatial visualization technology needed to effectively share this information.
Optimizing Life Sciences Data Transfer to Mobile Devices
Greg Quinn, University of California, San Diego
Within the past few years, numerous cell phone platforms have come to market that provide more than sufficient technical
capability to enable advanced information visualization. Accompanying these advances in telecommunications hardware is the
increasing maturity and capability of Smart Phone operating systems such as Windows Mobile 6.0. This has led to the increasing
dependence of people from all walks of life on their cell phone to provide not only telecommunications functionality but also
Internet-based information access and entertainment capability. Here we describe work in progress to utilize the Windows
Communications Foundation capability in the .Net Framework version 3.0 to efficiently serve bioinformatics data on-the-fly to
Smart Phones devices running the Windows Mobile operating system. We will also discuss the use of binary-formatted data
transfer as a means to increase the download and processing efficiency of Protein Data Bank (PDB) data stored in a Microsoft
SQL Server database.
Smart Irrigation Control based on Cognitive Wireless Sensor Networks
Supratik Mukhopadhyay, Utah State University; Krishna Shenai, University of Toledo; Ramesh Bharadwaj, NRL
World demand for fresh water is increasing, and competition for allocation of water between the urban and agricultural sectors
is rapidly growing in arid and semi-arid climates. This has brought an emphasis on intensive water management to achieve
greater system efficiencies, especially in irrigated agriculture in arid regions such as the western US. Further, studies by the FAO
(Food and Agricultural Organization) and others predict that in the coming 20 years, this competition for water will present
potentially serious economic, political, and social problems for much of the population in both the urban and rural areas of
developing countries, especially in the arid and semi-arid regions of the world. We present a novel irrigation control system to
intelligently and reliably manage large soil and water ecological system for environmental and agricultural applications.
Reliability is an important concern in precise monitoring and control of soil and water properties, since any malfunction can
result in financial as well as environmental disaster. Our controller consists of novel sensor and uses state-of- the-art distributed
information fusion and networking technologies for multi-zone implementation. It integrates intelligent sensor coordination
and data fusion techniques to access, retrieve, process, and communicate with disparate wireless sensors in an ad-hoc manner
to deliver reliable dynamic decisions and provide adequate information management. Our approach drastically reduces the
hardware cost almost by a factor of 10 and removes the main bottleneck in irrigation control arising from wired sensors. Apart
from this it provides a smart control mechanism with formal reliability guarantees that is reconfigurable at runtime in response
to changing requirements.
Programming in the Large: Integrating Simulation and Visualization
Christoph Hoffmann, Voicu Popescu, Purdue University
Visualization is a core task in scientific computations, and in interdisciplinary settings it becomes even more important in view
of the need to communicate insights across disciplinary expertise in the team. We explain how to integrate state-of-the-art
finite element analysis and visualization systems. Instead of replicating functionality of one system in the other, we federate the
systems by automated translation of FEA results into a form suitable for the animation/visualization system. This includes
bridging the gap between different geometry conceptualizations, inverting and visually concretizing abstractions convenient for
FEA, deriving visualization strategies that scale with the number of simulation elements and states, and placing the simulation
results in the context of the surrounding scene. We demonstrate our approach with the recently completed simulation and
animation of the crash of AA-11 into the North Tower of the World trade Center, a video that has been downloaded more than
1.3M times to date. We discuss some of the research issues that arose and describe some of the benefits for the FEA when
high-end visualization is considered part of the effort. In the broader context, our work finds applications in VR training, in
forensics, and in communicating with a wide audience outside of the scientific community.
Declarative and Efficient Querying on Biological Datasets
Jignesh Patel, University of Michigan
Modern life sciences explorations often need to analyze and manage large volumes of complex biological data. Unfortunately,
existing life sciences applications often employ awkward procedural querying methods and use query evaluation algorithms
that do not scale as the data size increases. For example, data is often stored in flat files and queries are expressed and
evaluated by programs written in Python. The perils of employing such procedural querying methods are well known to a
database audience, namely a) severely limiting the ability to rapidly express complex queries, and b) often resulting in very
inefficient query plans as sophisticated query optimization and evaluation methods are not employed. The problem is likely to
get worse in the future as many life sciences datasets are growing at a rate faster than Moore's Law. Furthermore, the queries
that scientists want to pose are also rapidly increasing in their complexity. The focus of this talk is on a database approach to
querying biological datasets. The talk describes ongoing work in the Periscope project in which we are developing a system for
declarative and efficient querying on biological graphs and sequence databases. This talk will also highlight how these database
methods allow a scientist to work in a loop of a) first posing queries, b) viewing the results, c) then refining and reposing a
modified query, and d) continuing through this iterative process until an answer has been found. The efficiency of the system
enables the scientist to explore even large biological databases in real time.
Creating and Querying Workflows by Analogy
Claudio Silva, Juliana Freire, Carlos Scheidegger, David Koop, Huy Vo; University of Utah
Workflow systems have recently emerged as an alternative to ad-hoc approaches to constructing computational tasks widely
used in the scientific community. These systems can capture complex analysis processes at various levels of detail and
systematically capture the provenance information necessary for reproducibility, result publication, and sharing. Although the
benefits of using workflow systems are well known, the fact that workflows are hard to create and maintain has been a major
barrier to wider adoption of the technology in the scientific domain. Constructing complex analysis processes requires expertise
in both in the domain of the data being explored, and in using a number of different analysis and visualization tools.
Furthermore, the path from ``data to insight'' requires a laborious trial-and-error process, where users successively assemble,
modify, and execute multiple workflows. We advocate a data-centric view of workflow-based computational processes, where
the workflows and information about their evolution are stored, along with their impact on the data they manipulate. This
information captures detailed provenance of the steps followed in exploratory processes. We propose a new frame work that
lets users explore and re-use this detailed provenance information through intuitive interfaces. Our framework consists of two
key components: a query-by-example interface for querying workflows whereby users query workflows through the same
familiar interface they use to create them; and a mechanism for semi-automatically creating and refining workflows by
analogy}, without requiring users to directly manipulate or edit the workflow specifications. In this talk, we will describe the
framework and demonstrate its use in VisTrails (www.vistrails.org), a publicly-available open-source system.
Scientific and Technological Challenges in Developing a Real-Time Syndromic Surveillance
System
Vicki Hertzberg, Douglas Lowery-North, Walter Orenstein, James Buehler, Lance Waller, Eugene Agichtein; Emory
University
Rapid detection of disease outbreaks and response to cases is an important public health function. Definitive diagnoses and
subsequent reporting can lag initial case presentation by days or weeks, a critical weakness in outbreak detection. In addition,
timely notification of outbreaks to healthcare providers by a central public health authority is also crucial. However, the best
strategies for such notification have not been determined. We describe here the potential for developing a real-time syndromic
surveillance (SS) system using three healthcare systems in a large urban area with reciprocal interface from the state PH agency.
These systems cover patients presenting in the hospital emergency departments (four adult, three pediatric) and primary care
clinics as well as related laboratory and radiology orders. This system presents many scientific and technological challenges.
How can we best integrate data sets within and between systems rapidly? Is there benefit to monitoring the health status of a
particularly vulnerable population comprising one of the hospitals? What tools are necessary to detect “blips” suggesting
events of interest? Can we automate epidemiologic investigation of such events? Can we apply performance improvement
tactics to reduce waste and improve value in SS data collection, analysis, and reporting? How can free text records, such as
dictations, be utilized to improve sensitivity and positive predictive value of SS? How can we best give meaningful real time
feedback to clinicians regarding PH alert information? What is the most valuable information to provide to these clinicians?
What are the most valuable actions for providers to accomplish with such information? Should space be reserved in electronic
RAY: A System Supporting Multiple Contending Scanning Queries on Large Scientific Data
Sets
Robert Grossman, Dave Hanley, University of Illinois; Jennifer Schopf, Argonne National Laboratory
Many applications perform queries to large scientific data sets that involve scanning the entire data set in the sense that each
record must be checked to see if a given condition is satisfied. In contrast, there is often an implicit assumption by the database
developers that latency must be optimized, and an expectation that data is indexed in such a way that a relatively small amount
of the data needs to be retrieved in order to satisfy the query. We are interested in the case seen by applications including
SDSS, BLAST, and others in which there are multiple contending scanning queries and the end user wishes to optimize total
throughput. In this paper, we define a system called RAY that collects scanning queries as they arrive, presents them with the
entire database chunk by chunk, and releases them after the entire database has been scanned, thereby increasing the
performance of multiple contending scanning queries by reducing the number of aggregate disk reads. We present
experimental studies using a large astronomy data set from the Sloan Digital Sky Survey and realistic queries from that
experiment that touch varying amounts of data, from 100% down to 20%. We show that RAY is significantly faster than directly
passing the queries to the database. When 100% of the data is touched this can be true even when there is no contention, and
for less data touched in the scan, RAY can achieve better performance for as few as 2 or 3 contending scanning queries.
Controlled Sharing of Scientific Data using SecPAL
Marty Humphrey, Sang-Min Park, Jun Feng, Norm Beekwilder, Glenn Wasson, Jason Hogg, Brian LaMacchia, Blair
Dillaway; University of Virginia
Access control policy languages today are generally one of two extremes: either extremely simplistic, or overly complex and
challenging for even security experts to use. In this presentation, we explicitly identify requirements for an access control policy
language for scientific data and then consider six specific data access use-cases that have been problematic in multiinstitutional
collaborations: attribute-based access, role-based access, “role-deny” access, impersonation-based access,
delegation-based access, and capability-based access. We evaluate the Microsoft Research Security Policy Assertion Language
(SecPAL) against those requirements, specifically in the context of these six use-cases involving GridFTP.NET. We find that while
some of these six use-cases are individually possible via existing authorization systems, we believe that SecPAL uniquely offers a
single approach that meets the requirements of a multi-institutional access control policy language, thereby creating support
for a wide range of expanded scenarios for controlled sharing of scientific data.
Science 2.0
Bora Zivkovic, Public Library of Science
Online technologies are fundamentally changing the world of science: how research is performed, how science is taught and
communicated, and how scientists' networks are formed. Meteoric rise in number, quality and prestige of Open Access
journals, rise in interest in Open Notebook Science, proliferation of science blogs, increased use of existing social networks (e.g.
Facebook) and formation of science-specific networks (e.g., Postgenomic, Connotea), all contribute to big changes in the
structure of the scientific enterprise which upset the traditional model.
Online Notes-Taking-Sharing System
C. Augusto Casas, St Thomas Aquinas College
Taking notes is the most common activity of students in the classroom. College students' use of technology has increased
significantly in the last several years. Students now attend class armed with PDAs, laptops and especially cell phones. These last
devices are more than a telephone. Cell phones include calculators, web browsers, instant messaging software, phone books,
digital cameras, video players, calculators, and games. Research conducted by the author found that students can benefit
academically from such technology. More specifically, class experiments demonstrated that using personal computers to take
and share notes student class participation and test scores increase. Microsoft Office Live Meeting was used as the underlying
technology. Lectures were given to students divided in two groups. One group shared notes with Live Meeting. The other group
took notes individually. A day after the lecture both groups took the same test. The experiment was conducted multiple times
with different pools of students. Results showed that students using the notes-taking-sharing system were more actively
engaged in class and scored better in the test. The results were consistent across all groups tested. With the Live Meeting
system, each student was assigned a section of an online whiteboard. Each student took notes in her/his area while looking at
the notes taken by classmates. At the end of the lecture students that use the online system could save and keep a copy of the
online whiteboard. The experiments showed that students are more likely to engage in class and less likely to be distracted with
other activities when they are working within this collaborative environment. The next research phase intends to determine if
such a system helps disadvantaged students.
Understanding Computational Requirements for Preservation and Reconstruction of
Computer-Assisted Decision Processes
Peter Bajcsy, Sang-Chul Lee, NCSA/UIUC
We discuss the problem of understanding computational requirements for preservation of computer-aided decisions.
Computer-aided decisions increasingly impact our society. These decisions have to be documented semi-automatically and the
electronic records have to be appraised and understood in terms of the preservation and reconstruction cost. Currently there is
no simulation framework that could support understanding and forecasting of computational requirements for preservation
purposes. Our objective has been to develop such an exploratory simulation framework that allows archivists and other users to
explore and evaluate computational costs as a function of several key preservation variables of appraised records. Thus, the
application of our simulation framework is in supporting investigations of preservation tradeoffs and improving appraisals of
electronic records. We first outline such prototype simulation software called Image Provenance To Learn (IP2Learn) that has
been developed for a class of computer-aided decisions based on visual image inspection. The current software enables to
explore some of the tradeoffs related to (1) information granularity (category and level of detail), (2) representation of
provenance information, (3) compression, (4) encryption, (5) watermarking and steganography,
(6) information gathering mechanism, and (7) final report content (level of
detail) and its format. The simulation software consists of Image Viewer
(visual inspection of images), Event Tracker (information gathering), Event
Reviewer (decision reconstruction), and Final Report Editor (semi-automatic
report generation). We will also illustrate example tradeoff studies using
IP2Learn for a specific image inspection task.
Rapid Adoption of Visualization Cyber infrastructure in the Atmospheric Sciences
Classroom
David Lee, Perry Samson, Erik Hofer, University of Michigan
In early 2007 the department of Atmospheric Oceanic and Space Sciences (AOSS) and the School of Information (SI) at the
University of Michigan collaborated on the installation of a 50 million pixel OptIPortal, or tiled display, utilizing OptIPuter
technologies for applications spanning high-resolution image exploration to multi-modal atmospheric visualizations. In addition
to research and persistent display tasks, the OptIPortal was incorporated into the undergraduate curriculum by requiring use of
the display in demonstrating their understanding of principals in atmospheric sciences. This presentation discusses the rapid
adoption of ultra high resolution visualization cyber infrastructure in a classroom setting. The AOSS student group
demonstrated the ability to effectively utilize advanced cyber infrastructure using the interfaces provided by a software stack,
enabling them to rapidly prototype compelling applications that take advantage of the high resolution display despite the
technical complexity of the system. Utilizing these tools, the students produced projects ranged from conventional PowerPoint
presentations, to distributed and parallel rendering of movie files, to dynamic multi-modal and multi-resolution weather
visualizations to aid in the prediction or understanding of atmospheric phenomena. In analysis of their achievements,
observations and interactions with the student group provided insight into how the OptIPuter software driving the tiled display
enabled students to rapidly prototype meaningful visualizations aiding their course projects. Considering these results we are
optimistic that these experiences point to the feasibility and utility of the introduction of OptIPortals to the classroom as well as
lessons for the next generation of control software for high resolution displays.
Grid2Win: Porting gLite to Windows-based Platforms
Fabio Scibilia, Dario Russo, INFN-Catania
The grid paradigm has emerged as the next step in the evolution of distributed computing. The gLite middleware (http://www.glite.org) is one of the most popular grid middlewares and it is developed in the context of the EGEE project (http://www.eu-egee.org) which built the largest grid infrastructure for e-Science in the world. At present, gLite essentially runs on
Linux platforms and this has up to now taken Microsoft Windows users and applications out of the EGEE infrastructure. The aim
of the Grid2Win project is to port basic gLite services to run under MS-Windows to let Windows user’s access to grid facilities as
well as to make possible the integration of Windows applications with the grid. Among all gLite services, we focus on the User
Interface (UI), which is the set of command line tools to access the grid resources, and the Computing Element (CE), which is the
grid service managing the computing power of the grid. Each CE wraps a Local Resource Management System (LRMS) exploiting
its computing power. Using Cygwin as a POSIX emulation environment, we successfully ported the gLite User Interface to run
under MS-Windows XP and developed a GUI on top of it. Moreover, we ported the Torque/MAUI (free release of the PBS job
scheduler) based CE as first Windows CE. Encouraged by the results obtained, we also successfully managed to integrate
Microsoft Compute Cluster Server (CCS) into gLite as first Windows native LRMS recognized by gLite. The presentation will make
the point on the activities carried out so far as well as on the future plans.
ChemXSeer: An eChemistry Web Search Engine and Repository
C. Lee Giles, Prasenjit Mitra, Levent Bolelli, Xiaonan Lu, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, James Z. Wang, Karl
Mueller, William Brouwer, James Kubicki, Barbara Garrison, Joel Bandstra, Pennsylvania State University
In chemistry, the growth of data has been explosive, and timely, effective information and data access is critical. We propose
the NSF-funded ChemXSeer architecture, a portal for academic researchers in environmental chemistry, which integrates the
scientific literature with experimental, analytical and simulation datasets. ChemXSeer will be comprised of information crawled
from the web, manual submission of scientific documents and user submitted datasets as well as scientific documents and
metadata provided by major publishers. Information crawled by ChemXSeer from the web and user submitted data will be
publicly accessible whereas access to publisher resources can be provided by linking to their respective sites. Thus, instead of
being a fully open search engine and repository, ChemXSeer will be a hybrid, limiting access to some resources. ChemXSeer
intends to offer some unique aspects of search not yet present in other scientific search services. We are developing algorithms
for the extraction of tables, figures, equations and formulae from scientific documents enabling users to search on those fields.
ChemXSeer intends to provide the search features including; full text search Author, affiliation, title and venue search Figure
and table search Equation and formulae search, citation and acknowledgement search, and citation linking and statistics. For
dataset search, we are developing tools that automatically annotate published data representations such as figures, and that
permit researchers to annotate their datasets by providing both document-level and attribute-level metadata in OAI-PMH
format to facilitate searching data more effectively both at the attribute and semantic levels, browsing datasets, and linking to
existing scientific literature and other datasets.
Design and Synthesis of Minimal and Persistent Protein Complexes
David Green, Steven Skiena, Stony Brook University
A major problem in synthetic biology is the tendency of bacterial systems to eliminate any genes that do not directly benefit the
organism, as a result of natural selection favoring shorter genome lengths, which can be replicated more quickly. We are
working on advances in computational protein and gene design that directly address this problem. We have previously
demonstrated an algorithm capable of creating the shortest nucleotide sequence that encodes two given proteins, taking
advantage of multiple reading frames and the redundancy of the genetic code. We also have expertise in computational
approaches to the redesign of proteins to satisfy particular functions. We are currently working to integrate these technologies
in achieving two particular goals. The first involves the interleaving of an antibiotic resistance gene with a particular protein
whose expression is desired. Challenging bacteria containing this construct with the appropriate antibiotic will lead to a
selective pressure to keep the inserted gene; as the sequence of the protein of interest overlaps this coding sequence, the
deletion of the desired protein from the genome will be avoided. Secondly, we are developing methods to directly reduce the
coding length for a given protein, taking a two-step approach: (1) redesign a multi-domain protein consisting of a single
polypeptide sequence into a protein complex; (2) overlap the coding sequences of the two components, leading to a
substantially reduced length of DNA that codes for a functionally equivalent protein. Our approach integrates protein design,
coding-sequence optimization, and validation in a experimental context to address a major problem in the long term viability of
synthetic biological networks. We will present our initial results in targeting these problems.
Computational Biology Applications Suite for High Performance Computing (BioHPC.net)
Jaroslaw Pillardy, Cornell University
One of the challenges of High Performance Computing (HPC) is the user accessibility. At the Cornell University Computational
Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that
allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use web interface.
Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, MrBayes
et al. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our
servers currently processes approximately 10,000 job submissions per year, many of them requiring massively parallel
computations. It also has a built in user management system which can limit software and/or database access to specified
users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database,
are both using our system for storage and data analysis. The suite will be released along with its source code this year. The
system consists of a web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running
Microsoft Windows, ftp server and file server. Users can interact with their jobs and data by a web browser, ftp or e-mail.
Remote HPC clusters can be accessed via JSDL protocol. The interface is accessible at
http://BioHPC.net/.
Accelerating Scientific Computations using a GPU: Fast N-Body Simulation with CUDA
Jan Prins, University of North Carolina, Chapel Hill; Lars Nyland, Mark Harris, Nvidia Corp.
Acceleration of computational kernels using a GPU is becoming simpler using improved GPU programming models. We examine
the all-pairs computational kernel for N-body simulation and its implementation using the NVIDIA CUDA programming model.
We show how the parallelism available in the all-pairs computational kernel can be expressed in the CUDA model and how
various parameters can be chosen to effectively engage the full resources of the first GPU to support the CUDA model, the
NVIDIA GeForce 8800 GPU. We report on the performance of a familiar N-body kernel for astrophysical simulations. For this
problem the GeForce 8800 calculates over 10 billion interactions per second performing 100 integration time steps per second
to simulate a system with 10,000 bodies. At 20 flops per interaction, this corresponds to a sustained performance in excess of
200 gigaflops. This is close to the theoretical peak performance of the GeForce 8800 GPU. The all-pairs approach is typically
used as a kernel to determine the forces in close-range interactions. The all-pairs method is then combined with a faster
method based on a far-field approximation of longer range forces, which is only valid between parts of the system that are well
separated. In all cases, a fast all-pairs kernel is essential to the overall performance of the n-body simulation.
Virtual Institute for Integrative Biology (VIIB): an eScience Paradigm for Latin America
David Holmes, Life Science Foundation; Fernado González-Nilo, Center for Bioinformatics and Molecular Simulation; Raúl
Isea, Apartado Postal 40336
This presentation examines the case of the Virtual Institute for Integrative Biology (VIIB) as a Latin American paradigm for
achieving global collaborative eScience. Biology has emerged as one of the major areas of focus of scientific research
worldwide, providing new challenges in eScience and grid computing. Whereas major efforts to meet these challenges have
been mounted in various parts of the world, less appears to have been accomplished in Latin America and the VIIB was
developed to fill this need. The scientific agenda of the VIIP includes: construction and operation of databases for comparative
genomics of particular relevance to Latin America, bioinformatics services and protein simulations for biotechnological and
medical applications. Human resource development through shared teaching, co-sponsored students and seminars is also an
integral component of the collaborative effort. eScience challenges include: connectivity concerns, high performance
computing (HPC) limitations, development of a customized Grid framework, language issues, maintenance of open access
without compromising security and the dissemination of scientific and technical information. Finally, it was recognized that
computational frameworks and flexible workflows were required to efficiently exploit shared resources without causing
impediments to the user who has little interest in the underlying information technology (IT). Overall, the VIIB has proved an
effective way for small teams to transcend the critical mass problem, to overcome geographic limitations and to harness the
power of large scale, collaborative science; as such, it may prove a useful model for promoting additional eScience initiatives in
Latin America and other emerging regions.
eScience in Biomedical Engineering Research: Cancer Modeling and Simulation
Nahuel Olaiz, Esteban Mocskos, Mariano Perez Rodriguez, Lucas Colombo, Alejandro Soba, Cecilia Suarez, Graciela
Gonzalez, University of Buenos Aires; Luis Nuñez, Argonne National Laboratory; Marcelo Risk, Guillermo Marshall,
University of Buenos Aires
Here we describe an application in biomedical engineering. In cancer tumor drug treatment nothing can reach tumor cells
without passing through the vessel wall and the interstitial matrix. Physicochemical and physiological barriers could hinder the
main transport mechanisms, thus leading to heterogeneous therapeutic agent accumulation and some cells remaining
untreated. Use of electric currents in chemotherapy greatly enhances drug transport and delivery. Cancer electrochemical
treatment consists in the passage of an electric current, whether direct (EChT) or micro-/nano-pulsed (ECT), through two or
more electrodes inserted locally in the tumor tissue. Extreme pH changes at tissue level (EChT) or the creation of membrane
porous channels at the cell level (facilitating penetration of anticancer drugs into the cell, ECT), are the main tumor regression
mechanisms. We study tumor drug transport for cancer treatment with nanoparticles (loaded with therapeutic agents) during
EChT and ECT through a combined modeling methodology: in vivo with BALB/c mice bearing a subcutaneous tumor, in vitro
with multi-cellular spheroids and collagen gels, and in silicon using the Nernst-Planck, Poisson and Navier-Stokes equations for
ion transport, electric field distribution and fluid flow, respectively. The main goal is to find nano-particle/drug combinations,
electric field intensities and pulse frequencies that optimize tumor treatment. In this interdisciplinary approach we use I-labs
web based for confocal and fluorescent microscopy image processing, and HPC computing on a low latency cluster under MS
CCS platform. Preliminary results suggest that using nano charged drugs and tuned electrical fields, significantly increases drug
Measuring Circadian Activity Rhythms for Home Healthcare: Clinical Potentials and Home Automation Benefit
Gilles Virone
This summary presents a custom Software for Automatic Measurement of Circadian Activity Deviation called SAMCAD. The
primary goal of this software is to extract, from raw activity data collected through passive monitoring, Circadian Activity
Rhythms (CAR) or home human behaviors, for various types of populations who may benefit from a home assistive technology.
Based on a pattern mining algorithm, SAMCAD establishes the life rhythm of a resident in approximately three weeks from
empirical observations, then tracks for any behavioral changes eventually occurring during daily life at home. Early clinical trials
show the potential to detect chronic pathologies such as urinary infections or to evaluate cognitive decline or rehabilitation
treatments. The knowledge of life habits, given by a derived type of CAR activity patterns based on the user presence in every
room, permits also to setup various home automation functions such as power management. For example, half duplex radio
transmissions which are highly solicited during long-term in-home wireless activity monitoring in sensor networks, can be
efficiently regulated for energy saving by mapping motes' behavior to the resident behavior, while preserving a high quality of
monitoring. The detection of the deviation of these home behaviors, part of the CAR model, can be as well useful in the field of
privacy to re-enforce rules based systems dealing with dynamic Role Based Access Control. Privileges to access personal medical
data belong first to patients. However, they may be willing to automatically provide permissions to caregivers in case of shortterm
at-risk situations (falls, cardiac arrests), or in longer situations involving abnormal CAR behavioral context. Such behavioral
anomalies, which may be indicative of a cognitive decline, can be used to warn caregivers for investigations.
Computational insights into the social life of zebras
Tanya Berger-Wolf, University of Illinois at Chicago; Daniel Rubenstein, Princeton University; Mayank Lahiri, Chayant
Tantipathananandh , University of Illinois at Chicago; David Kempe, University of Southern California; Habiba Habiba,
University of Illinois at Chicago; Jared Saia, University of New Mexico
Computation has fundamentally changed the way we study nature. Recent breakthroughs in data collection technology, such as
GPS and other mobile sensors, are giving biologists access to data about wild populations that are orders of magnitude richer
than any previously collected. Such data offer the promise of answering some of the big ecological questions about animal
populations: Unfortunately, in this domain, our ability to analyze data lags substantially behind our ability to collect it. In
particular, interactions among individuals are often modeled as social networks where nodes represent individuals and an edge
exists if the corresponding individuals have interacted during the observation period. The model is essentially static in that the
interactions are aggregated over time and all information about the time and ordering of social interactions is discarded. We
show that such traditional social network analysis methods may result in incorrect conclusions on dynamic data about the
structure of interactions and the processes that spread over those interactions. We have extended computational methods for
social network analysis to explicitly address the dynamic nature of interactions among individuals. We have developed
techniques for identifying persistent communities, influential individuals, and extracting patterns of interactions in dynamic
social networks. We will present our approach and demonstrate its applicability by analyzing interactions among zebra
populations and identifying how the structure of interactions changes with demographic status.
Time-Space Continuity of Daily Maps of Fractional Snow Cover and Albedo from MODIS
Jeff Dozier, James Frew, University of California, Santa Barbara
Using reflectance values from the 7 MODIS “land” bands with 250 or 500m resolution, along with a 1km cloud product, we
estimate the fraction of each 500m pixel that snow covers, along with the albedo of that snow. Such products are then used in
hydrologic models in several mountainous basins. The daily products have data gaps and errors because of cloud cover and
sensor viewing geometry. Rather than make users interpolate and filter these patchy daily maps without completely
understanding the retrieval algorithm and instrument properties, we use the daily time series in an intelligent way to improve
the estimate of the measured snow properties for a particular day. We use a combination of noise filtering, snow/cloud
discrimination, and interpolation and smoothing to produce our best estimate of the daily snow cover and albedo. We consider
two modes: one is the “predictive” mode, whereby we estimate the snow-covered area and albedo on that day using only the
data up to that day; the other is the “retrospective” mode, whereby we reconstruct the history of the snow properties for a
previous period.
A Swiss-Army Knife for Parallel Sequence-Search in Biomedical Informatics
Jeremy Archuleta, Wuchun Feng, Eli Tilevich, Virginia Polytechnic Institute and State University
The biomedical and life sciences communities make heavy use of BLAST (Basic Local Alignment Search Tool) to characterize an
unknown sequence by comparing it against a database of known sequences. The similarity between pairs of sequences enables
biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. For example, it can be
used for phylogenetic profiling, bacterial genome annotation, and pathogen detection. Unfortunately, BLAST has proven to be
too slow to keep up with the current rate of sequence acquisition. Searching for a given sequence against the nucleotide
database takes nearly three times longer today than it did in 2004 despite faster hardware. Thus, we created mpiBLAST, a novel
parallelization of BLAST that runs on many OS platforms, including Microsoft Windows. mpiBLAST can deliver super-linear
speed-up and scale to tens of thousands of processors due to an array of integrated features including database and query
segmentation, advanced job scheduling and load balancing, and parallel I/O. Currently, mpiBLAST v1.4 delivers 305-fold speedup
when running on a 128-processor cluster. By abstracting the execution characteristics of sequence-search algorithms such as
BLAST, mpiBLAST has evolved to efficiently transform any given serial sequence-search tool into a parallel one, thus delivering
the above performance to an entire class of sequence-search algorithms. This new version of mpiBLAST (v2.0) achieves the
above by utilizing “mixing layers” to separate functionality into complementary modules and “refined roles” within each layer
to improve the inherently modular design, thus enhancing maintenance and extensibility, e.g., allow advanced algorithmic
features to be developed and incorporated while routine maintenance of the code base persists.
Global Climate Warming in the Machine Room
Wuchun Feng, Virginia Polytechnic Institute and State University
For decades now, the notion of performance has been synonymous with speed. For example, the performance of
supercomputers running on our n-body cosmology code may have improved nearly 10,000-fold since 1992; the performance
per watt only improved 300-fold and the performance per square foot only 65-fold. The “mere” 300-fold increase in
performance per watt implies that supercomputers are not making as significant advances in power efficiency as in
performance; interdependently, the relatively miniscule 65-fold increase in performance per square foot (or alternatively,
performance per square meter) means that advances in space efficiency, when compared to performance, have been virtually
non-existent. These smaller gains in efficiency oftentimes result in the design and construction of new machine rooms, and in
some cases, require the construction of entirely new buildings. Unfortunately, this particular focus has led to the emergence of
supercomputers that consume egregious amounts of electrical power and produce so much heat that extravagant cooling
facilities must be constructed to ensure proper operation. In addition, the emphasis on speed as the performance metric has
adversely affected other performance metrics, e.g., reliability. As a consequence, all of the above has contributed to an
extraordinary increase in the total cost of ownership (TCO) of a supercomputer. Therefore, we espouse the importance of being
green in high-performance computing and even argue for a complementary list to the TOP500: The Green500 List.
E-Malaria: Getting into the Blood of Young Scientists
Jeremy Frey, University of Southampton
The e-Malaria project aimed to bring together 16-18 year old school students with university researchers to explain aspects of
computational drug design using the example the hunt for new anti-malarial drugs. Malaria kills a child every thirty seconds,
and 40% of the world’s population lives in countries where the disease is endemic. Resistance to existing drugs is increasing and
with global warming the range of the malaria carrying mosquitoes is expected to increase, so there is a growing need for new
drug compounds. The challenge was presented to school students who to use a distributed drug search and selection system via
a web interface to design potential drugs to act on the DHFR enzyme. The project makes use of industrial code for the docking
study (“GOLD” from CCDC) and as such presents valuable lessons in how to achieve the integration of industrial programs into a
“free” outreach environment. The results of the trials are displayed in an accessible manner, giving students an opportunity for
discussion and debate both with peers and university researchers, to lean about computational drug design and Chemistry in
general. The initial outreach project was extended to provide a similar challenge for undergraduate chemists as part of a
chemical informatics course. For this course more complex design and modeling challenges were devised, that used the same e-Malaria core programs, but at a level relevant to more advances chemical skills. The types of problems devised will be
illustrated in the presentation.
Xbox Science: Video Games Where Everybody Wins!
Leonard McMillan, University of North Carolina at Chapel Hill
What if solving nature's puzzles was entertaining as well as fulfilling? Would you rather play a first-person shooter, or be the
first person to figure out a gene's function? Or is it possible to do both? This is the challenge that I gave a class graduate
students. We explored the potential of game interfaces, game-design principles, and game production approaches for
constructing bioinformatics tools. You might ask why? 1) Set-top Supercomputers. The most powerful computer in most homes
today is a video-game console. Today's machines boast multiple cores and 100+ MFlop performance with high-end graphics.
Moreover, at $299, they represent one of the best MFlop per dollar ratios in history. 2) Most bioinformatics applications stink.
Typical bioinformatics tools require their user to be literate in statistics, computer science, and biology. Imagine if, in order to
drive a car, you had to simultaneously be a test-driver, mechanic, and combustion engineer. This is what is expected of today's
biologists. Lab software focuses on function and features rather than usability. In contrast, video game manuals are seldom
read. Is it possible to build scientific tools that are usable by anyone? Can we make them fun? 3) Leverage an insatiable
resource. Can we harness the minds and reflexes of the billion-plus gamers worldwide to find cures for disease with incentives
of being a high scorer rather than securing drug-patent rights? Many of the tasks confronted by biologists amount to
combinatorial puzzles, not unlike the game "Bejeweled". A biologist may spend years searching for patterns within a gene
expression array. What if hundreds of gamers joined in, and explored their datasets in parallel? In this talk, I will share our
experiences in writing video games with a purpose. This will include discussions of some of the underlying biology, as well as
game demonstrations.
Green Computing: A Power-Aware Run-Time System for Datacenter Environments
Wuchun Feng, Virginia Polytechnic Institute and State University
Since the advent of the computer, performance has always been defined with respect to speed. As a consequence,
microprocessor vendors have not only doubled the number of transistors (and speed) every 18-24 months, but they have also
doubled the power densities. Consequently, keeping a datacenter environment functioning properly requires continual cooling
and exhaust, thus resulting in substantial operational costs, e.g., the annual cost of powering and cooling computer servers
worldwide is fast approaching the annual spending on new machines. In addition, the increase in power densities has led to a
decrease in system reliability, thus leading to lost productivity. To address these problems in the datacenter, we present a
power-aware scheduling algorithm that automatically and transparently adapts its voltage and frequency settings to achieve
significant power reduction and energy savings with minimal impact on the performance of datacenter workloads. We evaluate
our power-aware scheduling algorithm on actual platforms based on AMD and Intel platforms, which support PowerNow! and
demand-based switching, respectively. For sequential and parallel scientific workloads in datacenters, the energy savings
averages 20% and 25%, respectively, with maximum energy savings reaching as high as 70%. The energy savings for business
workloads in datacenters is even higher given their transaction-based execution profiles.
Frontiers in metadata management for e-Science applications: the S-OGSA approach
Oscar Corcho, Paolo Missier, Pinar Alper, Sean Bechhofer, Carole Goble; University of Manchester
eScience applications are usually characterized by their distributed and knowledge-intensive nature, what poses interesting
new metadata management challenges, such as metadata distribution across application components, access control,
evolution, etc. Given the role of metadata in these applications, we think that it should be treated as a first class entity,
coexisting with other entities in the system (Web services, datasets, sensors, documents, etc.). This shift in the treatment of
metadata allows dealing appropriately with the previous challenges. This is what we propose in the S-OGSA architecture (which
stands for Semantically-enriched Open Grid Service Architecture, originally proposed as a semantic extension of Grid
applications), and what we have implemented in its supporting reference technological infrastructure. In S-OGSA, metadata can
refer to any first-class entity that an application is dealing with (services invoked by a workflow engine, datasets, sensors,
scientific documents, etc.), and it can be represented in multiple forms (natural language documentation, user-defined tags,
ontology instances, etc.). Metadata is stored in metadata containers, called Semantic Bindings, which are linked to the entities
that they refer to and which can be accessed either independently or jointly in a system, regardless of their physical
distribution. Access control can be applied with different levels of granularity, since Semantic Bindings may contain small or
large pieces of metadata from a specific resource, and metadata lifetime can be managed by means of appropriate event-driven
notification mechanisms that trigger transitions between metadata states. We describe the main design principles of S-OGSA
and how they can be applied in different e-Science scenarios, with examples of a prototype developed in the domain of satellite
image quality analysis.
Model and Architecture for Policy-Based Governance
Munindar Singh, Yathiraj Udupi, North Carolina State University
Collaboration among peers is common in large-scale scientific computing (as in production grids). Often, resources (e.g., data,
compute servers) need to be shared among multiple parties in a manner that respects both the overall needs of the collective
and the individual. The famous example of preemptive scheduling is a case in point. Currently, computational support for
collaborative resource sharing is inadequate. A common approach is to apply policy engines. This poses two challenges. One,
when autonomous peers interact, a centralized policy engine cannot make decisions for all of them. Two, current approaches
lack a deep conceptual model of how collaboration takes place in scientific computing (or service engagements broadly). We
define Governance as the process by which peers achieve agreement about how they will administer themselves. We contrast
governance with management, which (as the current mindset) applies to a superior managing his or her subordinates -- clearly
inapplicable among peers. We have developed a conceptually well-grounded approach for Governance. This models
organizations based upon our formalization of commitments. Each organization is defined in terms of the standing
commitments among its members. These commitments constrain the members' behaviors. Organizations can enter into
contract with one another. Our conceptual model includes a rich vocabulary by which interactions among peers (such as for
administering organizations) can be captured, and appropriate policies stated for each peer to satisfy both collective and
individual needs. This is how we achieve policy-based governance. A multi-agent prototype demonstrates our model and
architecture. Our research seeks to capture important technical properties of policy-based governance. This presentation
summarizes work previously reported in AAAI 06 and SCC 06 and 07.
High Performance Computing Mortgage Pricing Project
Richard Buttimer, The University of North Carolina at Charlotte
Mortgages are one of the major fixed-income investment classes in the U.S. They are held by financial institutions, pension
funds, mutual funds, and hedge funds. They are also frequently held in the investment portfolio of non-financial firms.
Mortgages are an extremely complex financial instrument for a variety of reasons: they are long-lived, they are extremely
interest rate sensitive, and they have embedded within them the borrower's options to default and prepay. In practice,
mortgage pricing is nearly always done through very lengthy and computationally-intensive Monte Carlo simulation. Microsoft,
RENCI, and UNC Charlotte are working together to develop a mortgage pricing system utilizing the Microsoft Hosted High
Performance Computing system. This system will initially be used in advanced MBA courses. Students in these courses will be
assigned the task of managing simulated mortgage portfolios similar to those held by large money-center banks. They will
utilize the pricing model to determine not only the prices of the securities they hold, but also their risk characteristics. The
system will also provide prices and risk characteristics for a variety of alternative investment and hedging vehicles. This system
will provide the students with a near "real world" mortgage portfolio management experience. Microsoft, RENCI, and UNC
Charlotte will each gain experience with hosted high-performance computing applications. Although the system will initially
utilize a publicly-available model, the Office of Thrift Supervision (OTS) regulatory model, the model could potentially be
expanded to be a commercially viable system.
Informative Robotic Sensing for Environmental Applications
Amarjeet Singh, Maxim Batalin, William Kaiser; University of California, Los Angeles
Networked InfoMechanical Systems (NIMS) provide a family of robotic platforms for diverse environment monitoring
applications. We provide an overview of these systems and their applicability through several real world sensing campaigns that
provided scientists with the data at a scale and resolution that was not previously possible. The new class of observational
methods is also supported by experimental design that optimizes measurement fidelity by combining knowledge of
measurement objectives, phenomena models, and system constraints. We have developed and demonstrated the generally
applicable, Iterative experimental Design for Environmental Applications (IDEA), methods and systems to efficiently use
distributed sensing and computing for understanding the high spatial and temporal variability associated with environmental
applications. Next, we model the observed natural system as a Gaussian Process and present a resource-cost-aware informative
path planning approach. In this approach, we compute a set of most informative observation locations that can be visited by
the mobile robot with a constraint on the upper bound of the resource capacity of the robot, such as limited sensing time or
limited battery capacity. For this NP hard problem, we provide strong approximation guarantees for the single robot scenario
and extend it for multiple robots providing near optimal approximation guarantee. The NIMS family of sensing systems,
together with a systematic experimental design approach that also involves phenomena modeling, enabled the first high
resolution imaging of several important scientific phenomena such as contaminant concentration and algal bloom dynamics.
This work is currently being applied to survey entire river systems in interdisciplinary investigations providing scientists with
important new characterization of primary national water resources.
Enhanced kNN-QSAR Modeling of Aquatic Toxicity of Diverse Organic Compounds Tested
by Fathead Minnows
Lin Ye, Hao Zhu, Alexander Golbraikh, Alexander Tropsha, University of North Carolina at Chapel Hill
Predictive models for acute fish toxicity (96 hour fathead minnow LC50) have been developed. A dataset consisting of 587
molecules with experimentally determined LC50 values was compiled. The entire dataset was randomly divided into modeling
set (470 compounds) and external validation set (117 compounds) and this procedure was repeated ten times to generate 10
modeling-validation set pairs. Molecular descriptors were calculated by Dragon and MolConnZ software for all compounds in
every subset. Each modeling set was split into multiple training-test sets using a diversity sampling approach. QSAR models
were developed for individual training sets by kNN methods and the resulting models were validated using the respective test
sets. The models that satisfied the cutoff (both leave-one-out cross-validation Q2 for the training set and linear fit R2 for the
test set greater than 0.6) were kept. All the successful models were used to make the consensus prediction of the external
validation set. The statistical results of all 10 external validation experiments were similar (R2 range from 0.67 to 0.83, Mean
Absolute Error (MAE) range from 0.46 to 0.66). The results were improved by removing outliers of the modeling set compounds
in the chemical space before model development: for the external validation sets the range of R2 was between 0.76 and 0.82,
and MAE was 0.41 and 0.44.
Context-aware Optimized Sensing of Physiological Signals
Winston Wu, Maxim Batalin, William Kaiser, University of California, Los Angeles
Recent advancement in micro sensor technology permits miniaturization of conventional physiological sensors. Combined with
low-power, energy-aware embedded systems and low power wireless interfaces, these sensors now enable patient monitoring
in home and workplace environments in addition to the clinic. Low energy operation is critical for meeting typical long
operating lifetime requirements. Important challenges appear as some of these important physiological sensors, such as
electrocardiographs (ECG), introduce large energy demand because of the need for high sampling rate and resolution, and also
introduce limitations due to reduced convenience of user wearability. Energy usage of the distributed sensor node systems may
be reduced by activating and deactivating sensors according to real-time measurement demand. Indeed, as will be described,
not all the physiological sensors are required at all times in order to achieve high certainty diagnostics. Our results show that
with proper adaptive measurement scheduling, an ECG signal from a subject may be needed for analysis only at certain times,
such as during or after an exercise activity. This demonstrates that autonomous systems may rely on low energy cost sensors
combined with real time computation to determine patient context and apply this information to properly schedule use of high
cost sensors, for example, ECG sensor systems. We have implemented a wearable system based on standard widely-used
handheld computing hardware components. This system relies on a new software architecture and an embedded inference
engine developed for these standard platforms. The performance of the system is evaluated using experimental data sets
acquired for subjects wearing this system during an exercise sequence. This same approach can be used in context-aware
monitoring of diverse physiological signals in a patient's daily life.
Using Low-Cost A-GPS Cell Phones and Web Mapping Applications for Multi-Jurisdictional
Emergency Response Mobilizations
Uma Shama, Lawrence Harman, Juozas Baltikauskas, Daniel Fitch, Glen Kidwell; Bridgewater State College
We document the collaboration of the GeoGraphics Laboratory at Bridgewater State College and the Town of Brewster (MA)
Fire and Rescue Department to develop a low-cost automatic vehicle location system using commercial-off-the-shelf (COTR)
military-specification cell phones and web mapping applications to provide situational awareness and post-action analysis for
emergency response command and control personnel in a mobilization involving multiple jurisdictions. Using open-source
software, a program was written to send assisted-global positioning systems (A-GPS) data at very high refresh rates (2-4
seconds) using inexpensive data-only cell phones and standard Internet communications. The web mapping application
provides a rich no-cost display of the AVL data on public domain web service
www.geolabvirtualmaps.com (Southeastern MA
Emergency Response) with the capacity to add custom features defined by the local emergency response and emergency
management personnel. It is hosted on Microsoft Virtual Earth but uses GeoRSS standards for creating points, lines and areas
for geographic objects added to the application. It also provides a dynamic reverse geo-coding feature that displays the nearest
street address on the vehicle location label of the web display for emergency response commanders. The system was tested as
a part of the Fourth of July Provincetown (MA) Fireworks Mobilization involving ambulances and emergency response
personnel from six towns. This presentation will provide the design features, a geo-spatial analysis of the mobilization and debriefing
of the mobilization commander. This assessment will critique the performance of the technology before, during and
after the mobilization.
The Virtual Space Interaction Test Bed (VISIT)
Thomas Finholt, Erik Hofer, David Lee University of Michigan
The School of Information at the University of Michigan recently launched the Virtual Space Interaction Test bed (VISIT) project.
VISIT demonstrates a number of "ultra-resolution" collaboration capabilities. Using OptIPortals of varying sizes (e.g., arrays of
commodity LCD displays coupled with computing clusters and high performance networking), VISIT supports visualization of
images and data at very high resolution (currently 50 megapixels) alongside uncompressed HD video of distant collaborators.
Previous use of OptIPortals has emphasized collocated collaboration and visualization. A key feature of VISIT is distributed
installation of OptIPortals to enable distant collaboration. Requirements for distant collaboration are much different. For
example, with limited or reduced shared visual access, it is necessary to create or simulate many of the cues used in shared
spaces to coordinate conversation and to orient to common visual references. Therefore, VISIT explores the use of multi-modal
sensor data, artifacts (e.g., shared electronic posters), and visual cues to allow distributed collaborators to use OptIPortals both
to conduct their scientific work better as well as to improve awareness of the availability and presence of remote colleagues.
This model of OptIPortal use emphasizes socio-technical aspects of the technology, seeking to produce gains in scientific
understanding by improving the process of collaboration, as well as through the introduction of advanced visualization
capabilities. Therefore, a key goal of VISIT is evaluation of use in terms of the impact on creation and maintenance of social
network ties among scientists, research performance (e.g., time to produce publications), and usability.
Enabling Pivot Charts on Massive Multidimensional Datasets
Mehrdad Jahangiri, Cyrus Shahabi; University of Southern California
Spreadsheets allow us to perform complex data analysis on scientific datasets. However, they cannot operate efficiently on
large multidimensional datasets generated by the current data acquisition methods. Current science practice is to store the
original data in databases or ftp sites and then manually generate a smaller subset of the data (by sampling, aggregating, or
categorizing). Yet, this time-consuming process suffers from one major drawback. By losing the detailed information and
working with the second-hand dataset, we conduct a biased study of the data by verifying our known hypothesis rather than
being surprised with unknown facts. One of the mostly exercised functionalities of spreadsheets is to generate meaningful plots
over the data. However, to the best of our knowledge no other work has studied plots as "queries" on large datasets. A Plot
query summarizes how a fact changes over a set of attributes and is visually represented in various forms of charts. The
valuable insight provided by these queries comes from the illustrated relationship among the plot points. Thus it is essential to
preserve this relationship in approximate or progressive answering rather than conserving the accuracy of each individual plot
point. Here, we propose a wavelet-based technique that exploits I/O sharing across plot points to evaluate the query
progressively and efficiently. The intuition comes from the fact that we can decompose a plot query into two sets of aggregate
and slice-and-dice queries. Subsequently, we can effectively compute both as investigated in our earlier studies. Our technique
is not only efficient as an exact algorithm but also very effective as an approximation method in case of limited query time or
storage space. We believe this study can proactively lead us toward building an interactive pivot chart on massive
multidimensional datasets.
An Infrastructure for Combining Geospatial Research with Computational Intensive Social
Sciences
Tiberiu Stef-Praun, Ian Foster, Computation Institute/University of Chicago; Robert Townsend, Economics Dept/
University of Chicago
We report on a project that seeks to scale up this approach to larger quantities of data, more computationally demanding
analytic methods, and a larger population of economist and student users. At the core of this project is an infrastructure that
integrates spatial data services for organizing, accessing, analyzing, and displaying spatial data, and computational services that
allow for the distributed processing of models on Grid-enabled resources. Integration via Web Services allows users to pose
questions that are answered by extracting data from GIS data sources, running substantial computations on that data and
depositing derived data back into the spatial data store.
The Data Playground: An Data-Driven Workflow Specification Environment
Carole Goble, Andrew Gibson, Matthew Gamble, Katy Wolstencroft; The University of Manchester; Tom Oinn, The
European Bioinformatics Institute
Workflow environments like Taverna (www.mygrid.org.uk) are great for scientists who have a clear understanding of their task
and goals. However, a significant amount of bioinformatics does not have such well defined goals. We present the Data
Playground, an environment designed to encourage the uptake of workflow systems in bioinformatics through more intuitive
interaction by focusing the user on their data rather than on the processes. A prototype plug-in for the Taverna workflow
environment shows how we can promote the creation of workflow fragments by automatically converting the users'
interactions with data and Web Services into a more conventional workflow specification. We claim that this exploratory mode
is more natural to users, and enables workflow development by example.
Combinatorial QSAR Analysis of Histone Deacetylase Inhibitors and QSAR-based Database
Mining
Hao Tang, Alexander Tropsha, Simon Wang, The University of North Carolina at Chapel Hill; Alan Kozikowski, University
of Illinois at Chicago; Bryan Roth, The University of North Carolina at Chapel Hill
Histone deacetylases (HDAC) play a critical role in transcription regulation. Small molecule HDAC inhibitors are an emerging
target for treating cancer and other cell proliferation diseases. Several previous reports have studied 3D Quantitative Structure-
Activity Relationship (QSAR) to assess the possibility of computer based drug mining for HDAC inhibitors. We employed variable
selection k Nearest Neighbor approach (kNN) and Support Vector Machines approach (SVM) to generate QSAR models for 59
chemically diverse compounds with inhibition activity on class I histone deacetylase. MOE and MolConnZ based 2D descriptors
were combined with kNN and SVM approaches independently to improve the predictability of models. Rigorous model
validation approaches were employed including randomization of target activity (Y-randomization test) and assessment of
model predictability by consensus prediction on two external datasets. Highly predictive QSAR models were generated with
leave-one-out cross validation R2 (q2) values for the training set and R2 values for the test set as high as 0.81 and 0.80,
respectively with MolconnZ /kNN approach and 0.94 and 0.81, respectiveley with MolconnZ/SVM approach. Validated QSAR
models were then used to mine four chemical databases which included a total of over 3 million compounds resulting in 48
consensus hits, including two reported HDAC inhibitors not included in the original data set.
Provenance in Kepler-based Scientific Workflow Systems
Meiyappan Nagappan, North Carolina State University; Ilkay Altintas, San Diego Supercomputing Center ; George Chin,
Pacific Northwest National Lab; Daniel Crawl, San Diego Supercomputing Center; Terence Critchlow, Pacific Northwest
National Lab; David Koop, University of Utah; Jeffrey Ligon, North Carolina State University; Bertram Ludaescher,
University of California, Davis; Pierre Mouallem, North Carolina State University; Norbert Podhorszki, University of
California, Davis; Claudio Silva, University of Utah; Mladen Vouk, North Carolina State University
Scientific workflow management systems are used to automate scientific discovery. Increasing complexity of such workflows,
and sometimes legal reasons, is fueling a demand for more run-time and historical information about the workflow processes,
outputs, environments, etc. Properly constructed run-time and provenance information collection framework can help manage,
integrate and display the needed information. In this paper we present the provenance system developed by the Department
of Energy Scientific Data Management Enabling Technology Center's Scientific Process Automation group. The solution adds to
the successful Kepler scientific workflow support system by integrating Kepler with a standard LAMP - Linux Apache MySql PHP
environment to provide a very flexible and readily deployable (K)LAMP scientific workflow support environment for e-science.
The solution is sufficiently modular to allow use of other workflow engines and other component solutions. This paper discusses
the architecture of the solution, its deployment and some of the principal challenges it is solving: how to collect provenance
information in a standardized and seamless way and with minimal overhead, how to store this information in a permanent way
so that the scientist can come back to it at anytime, and how to present this information to the user in a logical manner. Also,
part of the issue is privacy policies and strict security policies that apply to Department of Energy (DoE) national laboratories.
Discovery of Novel Geranylgeranyltransferase Inhibitors through Virtual Database Mining
Yuri Peterson, Duke University; Simon Wang, The University of North Carolina at Chapel Hill ; Patrick Casey, Duke
University; Alexander Tropsha, The University of North Carolina at Chapel Hill
Geranylgeranyltranferase inhibitors (GGTIs) are small molecule drugs that inhibit C20 lipid modification of CaaX motif proteins.
Attenuating function of these proteins will provide therapeutic benefit in cancer, inflammation, multiple sclerosis, viral infection
(HepC/HIV), apoptosis, angiogenesis, rheumatoid arthritis, atherosclerosis (vascular disease), psoriasis, glaucoma and diabetic
retinopathy. However, there are only two publicly known chemical scaffolds available for GGTIs at present. We have developed
the combinatorial quantitative structure-activity relationship (QSAR) models for 48 known GGTIs, using k-nearest neighbor
(kNN) method, automated lazy learning (ALL) method and partial least square (PLS) method. The models were rigorously
validated based on several statistical criteria, including the randomization of the target property (Y-randomization), the
verification of the training set models' predictive power using test sets, and the establishment of the models' applicability
domain. The validated QSAR models were used to mine major publicly available chemical databases, including the National
Cancer Institute database of ca. 250,000 compounds, the Maybridge database of ca. 54,000 compounds, the ChemDiv database
of ca. 630,000 compounds, the WDI database of ca. 59,000 compounds, and the ZINC 7.0 database of ca. 6,500,000 compounds.
These searches resulted in multiple consensus hits and had revealed several new chemical scaffolds for GGTIs. They had been
validated by biological assays and patented recently. This study illustrates that the combined application of predictive QSAR
modeling and database mining may provide an important avenue for rational computer-aided drug discovery.
A Disease Search Engine for Early Incidence Warning and Monitoring
Hanan Samet, University of Maryland; Jagan Sankaranarayanan, University of Maryland; Michael Lieberman, University of
Maryland; Adam Phillippy, University of Maryland
eScience techniques can be used to understand the source and spread of disease epidemics to contain future outbreaks,
thereby possibly reducing the potentially massive toll on human life in underdeveloped nations. Even though epidemiological
information is available for many pathogenic microbes, incidence reports are scattered and are difficult to summarize. We have
built a system to automatically extract, classify and organize incidence reports based on geographic location and type for
analysis by domain experts. Documents from the U.S. National Library of Medicine (www.pubmed.gov) and the World Health
Organization (www.who.int) have been tagged according to their spatial and temporal relationships to specific disease
occurrences, and presented graphically via a map interface. This work has leveraged our experience with the SAND Spatial
Browser and Spreadsheet to provide spatial and textual search capabilities on the web (e.g., documents on "influenza" near
"Hong Kong"). Users can also see the phrases in the documents that satisfy the query, thereby facilitating easy verification as
well as dismissal of false positives due to errors in identification of geographical references, which are difficult to avoid. The
user interface also provides the ability to restrict the search result to a particular time period. In addition, newspaper articles
have been tagged and indexed to bolster the surveillance of ongoing epidemics, while examining past epidemics using our
system leads to improved understanding of the sources and spreading mechanisms of infectious diseases. In our paper, we will
describe the design of our system which combines state of the art technologies from different areas of computer science and
demonstrate the working and the usefulness of our system.
Collection Processing and Comparative Studies in GPFlow
James Hogan, Paul Roe; Queensland University of Technology
Modern scientific enquiry, particularly in bioinformatics, is increasingly characterized by fine-grained comparative analyses over
large data sets. Such studies require the automation of software tools to operate across multiple data values, and sensible
strategies for managing the explosion of outputs which may result. Modern scientific workflow systems, therefore, must
provide support for these activities and for the active involvement of the user in selection, combination and filtering. In this talk
we present a new version of the GPFlow scientific workflow system which provides extensive support for collection processing,
but does so in a manner largely transparent to the user, and which avoids the need for the scientist to take direct control of
operational plumbing. GPFlow is a novel, web-accessible workflow system which makes large-scale comparative studies
accessible without programming, and eases the transition from small-scale experimentation to large scale serious analyses. In a
typical comparative study, several tools and services are used in concert, and all must be lifted to operate across sets of values
to implement the analysis, with some components drawing upon outputs from multiple precursors. Data must be combined
and filtered at each end of the process. The model and its implementation are presented in the context of a core Bioinformatics
problem - the search for regulatory motifs. The model is novel in allowing a workflow on a single data value to be automatically
lifted to operate on a set of values. Users may thus prototype on the small-scale and execute on the large, a process which
requires no changes to the underlying workflow. The model follows our previous work in supporting combined interactive and
batch operation.
The North Carolina State University Virtual Computing Laboratory “Providing an Efficient e-
Science Environment”
Eric Sills, Sam Averitt, Michael Bugaev, Aaron Peeler, Henry Schaffer, Josh Thompson, Mladen Vouk; North Carolina State
University
North Carolina State University has developed a computational and application resource brokering, differentiation, and delivery
system called Virtual Computing Laboratory (VCL). VCL allows sharing of a common hardware infrastructure by a range of
applications from CAD packages. Initially, VCL virtualized the STEM computing environments to deliver applications students
needed for their course work and research via their personal computing devices rather than at a physical computer lab on
campus. As development of VCL progressed, the hardware resources flowed back and forth between production Linux cluster
nodes serving typical HPC workloads and providing on-demand student-computing applications on various operating systems.
Demand curves for these two uses tend to be out of phase with student computing demand, building as the academic semester
progresses, and HPC demand peaking following the end of exams. This allows much better utilization of the hardware
resources. VCL has been in production use at North Carolina State University for about three years. Flexibility of VCL has proven
to be essential in easily supporting specialized university research computing demands, and our experience is that VCL-based
hardware and application management provides a much greater service at a considerably lower cost per unit of service. In
addition, VCL provides the various standard and customized services with much less intervention of the central IT staff than
previously necessary. This paper discusses the details of the VCL architecture, economics, security and versatility.
Environmental Monitoring through Acoustics using a Network of Smartphones
Richard Mason, Binh Pham, Paul Roe, Queensland University of Technology
Sound is a rich medium carrying lots of information which is tractable for analysis. The natural environment is rich in sounds;
potentially fauna, weather, and machinery can be located and recognized. Environmentalists use sound to measure the health
of the environment by monitoring key species such as birds which are early indicators of environmental change. We have
designed a sensor network based on smart phones for monitoring environmental change. The platform comprises smart phones
running a custom application for recording bird song. Sensors are managed in an autonomic fashion to ensure that they operate
reliably and efficiently for long periods of time. Recorded birdsong is uploaded to a relational database through a 3G telephony
network. The nature of acoustic sensing means that large volumes of data are collected so data communication and
optimization is important. Sensor recording can be remotely controlled through a web service interface. Sound data stored in a
database is analyzed to recognize different birds and bird calls using a neural network. A novel noise reduction technique is
employed prior to identification. The analyses potentially enable the location, type of bird and bird behavior (through bird call),
to be known. From this, temporal and spatial profiles of bird behavior can be studied and the effects of environmental change
can be known. A field study is being undertaken at Brisbane airport where a second runway is being constructed. Brisbane
airport is located in an environmentally valuable wetland area, which is the habitat for much wildlife including the rare Lewins
Rail. This study aims to address a number of questions regarding this bird using acoustic sensor networks. The sensors will
provide valuable information on the birds' habits as well as a measure of the impact of the new runways construction.
Accurate Differentiation of Docking Decoys using Quantitative Structure - Binding Affinity
Relationship (QSBAR) Classification Models with ENTess Chemical Geometrical Descriptors
Jui-Hua Hsieh, Simon Wang, Shuxing Zhang, Alexander Tropsha; The University of North Carolina at Chapel Hill
Molecular docking has become a common technique in structure-based drug design. Although state-of-the-art search
algorithms implemented in the docking software can generate native-like poses in the binding sites, the performance of the
scoring functions is still unsatisfactory. The failure to correlate the key interactions with binding affinities leads to the
"geometric decoys", poses deviating more than 3.0 angstrom RMSD from the native pose but with better energy scores.
(Shoichet BK. et al. J. Med. Chem. 2005, 48, 3714-3728.). k-Nearest Neighbor (kNN) binary QSAR models generated from 264
protein-ligand complexes in the Protein Databank using ENTess descriptors were applied to four geometric decoy datasets, e.g.
Thrombin, Dyhydrofolate Reductase (DHFR), Thymidilate Synthase (TS) and Acetylcholine Esterase (AchE). The ENTess
descriptors of protein-ligand interaction are based on computational geometry analysis of protein-ligand interfaces by Delaunay
Tessellation and Pauling atomic electronegativities (EN) (Zhang S. et al. J. Med. Chem. 2006, 49, 2713-2724.). The complexes
were classified into experimentally Strong Binders (SB) (pKd > 6.50) and Weak Binders (WB) (pKd < 6.50). 4930 QSBAR models
were generated with Correct Classification Rate (CCR) for both training and test sets equal or higher than 0.60; 185 of them had
CCRtrain above 0.80 and CCRtest above 0.85. The acceptable models were further validated by Y-randomization test and were
shown to have CCR value for external validation sets as high as 0.74. We conclude that models built with ENTess descriptors can
distinguish geometric decoys and prioritize high affinity poses, especially for binding sites with the strong presence of
electrostatic interactions.
Shared Genomics - Accessible HPC for Medical Genomic Research
David Hoyle, Iain Buchan, Peter Crowther; University of Manchester
Microarray technology for genome-wide Single Nucleotide Polymorphism (SNP) genotyping provides a unique opportunity to
study complex diseases. This opportunity also presents computational and knowledge management challenges, and the
statistical analysis presents a computational bottleneck in processing the raw data, motivating the need for a High Performance
Computing (HPC) based solution. Statistical analysis of the raw data produces an equally large volume of derived data. Making
sense of this derived data requires integrating the statistical analyses with information already known to the research
community, such as SNP location, gene regulation, relevant biochemical pathways etc. Leveraging this community knowledge
allows us to filter the statistical analysis and focus upon the most important genetic determinants of the diseases. The
community knowledge exists in the form of individual expertise of scientists and information deposited in distributed databases
and knowledge repositories. Easy access to both HPC infrastructure and community knowledge will be crucial for accelerating
new research findings from genome-wide SNP studies. At NIBHI we have begun to develop, in collaboration with Microsoft, the
necessary HPC infrastructure. The HPC facility will be accessed via a SharePoint portal site, providing a shared environment
through which collaborating scientists exchange results, analyses, comments and documents. Running the statistical analyses
on the HPC infrastructure is executed by initiating workflows from the portal site. Access to community knowledge will be done
through automatic retrieval of annotation data from distributed sources. This can be performed via integration with existing
bioinformatics workflow management systems, such as Taverna, that allow us to re-use workflows calling web services
accessing the knowledge repositories.
Simulating Air Quality and Other Wind Engineering Applications with an Urban Landscape
Alan Huber, The University of North Carolina at Chapel Hill
High-fidelity local-scale Computational Fluid Dynamics (CFD) simulation of pollutant concentrations within roadway and urban
landscapes is feasible using current high performance computing. Local-scale CFD simulations are able to account rigorously for
topographical details such as terrain variations and building structures in urban areas. Solar or anthropogenic heating may be
added to terrain and building surfaces. Real human environments may be directly simulated to support urban planning and
response to emergency situations. There are a wide range of potential applications where computational wind engineering will
become routine in coming years as computing hardware and software continues to grow and expand the frontiers for
application. This presentation will briefly review the history of developments of computational environmental fluid dynamics.
Modern day fluid dynamics has evolved much since Sir Isaac Newton's physical equations and the evolution of the Navier-
Stokes equation for fluid flow due to advancing computational hardware and software. The Navier-Stokes equation is the
general basis for all CFD applications, for example, from weather prediction to vehicular aerodynamics. Example applications
developed over the past few years while employed with the US Environmental Protection Agency are now being applied as an
adjunct research faculty of the University of North Carolina using the critically needed computing capacity of RENCI's Topsail
computing system. In particular, simulations of the air transport of pollutant emissions within the Madison Square Garden area
of New York City will be demonstrated. The virtual environment for midtown Manhattan was been developed to support
planning and response to potential accidental emissions or intended terror activities. The age of direct local-scale
environmental simulation has arrived.
QSAR Modeling of Blood and Brain Barrier Permeability of Diverse Organic Compounds
Liying Zhang, Hao Zhu, Alexander Tropsha; The University of North Carolina at Chapel Hill
We have developed robust QSAR models of Blood-Brain Barrier (BBB) permeability using k-Nearest Neighbors (kNN)
and Support Vector Machines (SVM) approaches and molecular topological
descriptors. The modeling set of 159 compounds was divided into external
evaluation set (15 compounds) and multiple training and test sets (the
remaining 144 compounds). The consensus QSAR model accuracies were q2=0.91
and R02=0.68 for self-validation and external evaluation sets, respectively.
These models were applied to additional external evaluation sets consisting
of 99 drugs (from the WOMBAT-PK dataset) and 267 organic compounds classified
as permeable (BBB+) or non-permeable (BBB-), and the best prediction
accuracies were 82.5% and 59.0%, respectively. Noticeable improvements in
prediction accuracy were achieved after applying applicability domain
threshold for the prediction of evaluation sets: the accuracy for the first
external evaluation set increased to R02=0.75 and for both of the additional
external sets to 100%. The resulting models can be used to guide the design
of pharmaceutically relevant chemical libraries towards drug-like compounds
with optimal BBB permeability.
Combinational QSAR Modeling of Chemical Toxicants Tested against
Tetrahymenapyriformis
Hao Zhu, Alexander Tropsha, The University of North Carolina at Chapel Hill
Selecting suitable quantitative structure-activity relationships (QSAR) approaches for a specific toxicity endpoint is one of the
critical issues for the development of robust predictive computational toxicity models. To this end, we have compiled an
aqueous toxicity dataset containing 1,093 unique compounds tested in the same laboratory over several years against
tetrahymenapyriformis. A modeling set consisting of 644 compounds randomly selected from the original set was distributed to
five chemoinfomatic groups to use their own QSAR approaches and descriptors for model development. The remaining 449
compounds in the original set were used as an evaluation set to test the predictive power of individual models. In total, our
virtual collaboratory generated 11 different validated QSAR toxicity models for the training set. The best models had the Leave
One Out (LOO) cross-validation correlation coefficient R2(q2) = 0.93 for the training set and the correlation coefficient R2 for
the external evaluation sets as high as 0.83. The results demonstrated that the evaluation of the models only based on the
statistical parameters obtained for the modeling set may mislead the selection of the externally predictive models. We have
developed a consensus model based on the average of the prediction results of all 11 models. The consensus model resulted in
the best prediction accuracy for the training and external evaluation sets as high as 0.95 (q2) and 0.86 (R2), respectively. The
utilization of the applicability domain could be included to balance the prediction accuracy with the chemistry space coverage
based on the requirement of the users with respect to the error tolerance level.
Ad Hoc Scientific Workflows through Data-Driven Service Composition
David Chiu, Gagan Agrawal; The Ohio State University
Scientific domains increasingly involve data that can be obtained from the deep Web, while having other datasets in low-level
formats. At the same time, an increasing number of Web or grid services are being made available. This leads to an interesting
question, ``Can we query low-level and deep Web data by automatically composing services and creating workflows''. Our work
is driven by a collaboration with geodetic sciences, funded by an NSF grant for Cyber infrastructure for Environmental
Observatories. Specifically, geospatial data is known to have: - Large Volumes: data may be collected in a continuous manner, -
Low-level Format: data is normally stored in native low-level format, rather than in databases. - High Dimensionality: high
dimensionality inherently alludes to nontrivial complexity for processing certain types of data. - Heterogeneous Data Sources:
disparate data sources can collect and represent the same information with different accuracy and format, all of which offer
various precision and accuracy but are ultimately used to describe the same information. - Temporal-Spatio Domain: since
geospatial data is highly volatile, rigorous maintenance of descriptors such as location and date are imperative to providing
accurate information. We propose a system that automatically constructs ad hoc workflows for answering high-level queries
based on both service and data availability. A specific contribution of this work is the so-called ``data-driven'' capability in which
we provide a framework to capture and utilize information redundancy that is present in heterogeneous data sources. We will
use ``machine-interpretable metadata'' to be able to understand and parse low-level datasets and use them with the services.
A Novel Approach to Structure-based Pharmacophore Search Using Computational
Geometry and Shape Matching Techniques
Jerry Ebalunode, North Carolina Central University; Zheng Ouyang, University of Illinois at Chicago; Jie Liang, Weifan
Zheng, North Carolina Central University
The structure-based drug design methods are typified by docking technologies that have been widely adopted by the
pharmaceutical industry for virtual screening and library design. They are often the computational tools of choice for both lead
generation and lead optimization. However, despite many reports of successful applications of off-the-shelf docking tools,
serious issues remain unsolved in terms of the accuracies of docking poses and affinity scores. Recently, more intuitive and
computationally more efficient structure-based methods have been reported that seek to find effective means to utilize
experimental structure information without employing detailed docking calculations. These tools can (should) be coupled with
efficient HTS technologies to improve the probability of success in the discovery process. For example, LigandScout has been
successfully applied in several virtual and experimental HTS projects. We report the development of a new method that
employs a rigorous computational geometry method and a deterministic geometric casting algorithm to derive the negative
image of a binding site. Once the negative image of the binding site is generated, a variety of computer vision methods can be
applied to compare and match small organic molecules with the shape of the negative image. We report the detailed
computational protocol and its validation using known biologically active compounds extracted from the WOMBAT database.
Models derived for selected targets are used to perform the virtual screening experiments to obtain the enrichment data for
various methods. It is found that our new approach (Shape4 for shape pharmacophore) affords significantly better enrichment
of hits than other methods studied in this work.
The Challenges for eScience with the Pan-STARRS Sky Surveys
Nick Kaiser, Jim Heasley, Eugene Magnier, Alex Szalay; University of Hawaii
The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) will use giga-pixel CCD cameras on multi-aperture
telescopes to survey the sky in the visible and near infra-red bands. A single telescope system (PS1) has been deployed on Maui
and a four-telescope system (PS4) will be sited on Mauna Kea on the Big Island of Hawaii. These systems will survey the sky
repeatedly and will generate petabytes of image data and catalogs of billions of stars and galaxies. The images will be combined
to generate a very sensitive multi-color image of the static sky, and differences between images will provide a massive database
for "time domain astronomy"; the study of moving, transient or variable objects. In addition to the challenge of building the
telescopes and detectors, the project is faced with the formidable challenges of processing the image data in near real time and
making the catalog data accessible via relational databases in order to facilitate the eScience that this project promises. This
talk will describe the scale and content of the data products and will outline the designs of the image processing and database
and archiving systems.
The eScience program at the University of Copenhagen
Eric Jul, Brian Vinter; University of Copenhagen
The University of Copenhagen has started an eScience graduate program in eScience and has established an eScience center to
further develop and enhance research in eScience. The university has recognized the importance of eScience and has therefore
established an eScience graduate degree in eScience. While it is possible to take many eScience related courses in most degree
programs at the University, the University feels that by establishing a separate eScience degree, a much stronger emphasis can
be put on eScience. The new program has achieved solid backing from all department of the Faculty of Natural Sciences. At the
workshop I, as Director of eScience Studies, would welcome the chance to present the approach that the University of
Copenhagen has taken to promote the new eScience graduate degree program - and the motivation for establishing an
eScience center that draws faculty members from many different areas of the Natural Sciences. As far as we know, our program
is one of the very first to provide a cross-disciplinary program to students and, at the same time, where they can interact with
researchers at a dedicated eScience research center. At the workshop, the motivation and rationale for the program will be
presented and the specific core courses will be described.
Analysis and Characterization of Reactive Cysteines in Protein Structures and Within Cellular
Signal Transduction Networks
Stan Thomas, Freddie Salsbury, Jr., Stacy Knutson, Leslie Poole, Jacquelyn Fetrow; Wake Forest University
Protein post-translational modifications play key biological roles by modifying the structure and function of proteins. A common
example is that of protein phosphorylation in signal transduction, metabolism and cellular differentiation. Analysis of
phosphorylation sites has led to a better understanding of kinase substrate specificity, methods for site prediction and a
combined experimental/computational approach resulting in a better understanding of the yeast phosphoproteome. Cysteine
sulfenic acid (Cys-SOH) is a catalytic intermediate at enzyme active sites, a sensor for cellular stress, a regulator of transcription
factors and an intermediate in redox signaling. The cysteine post-translational modification to sulfenic acid is not random;
features at or near the cysteine control its reactivity. To identify the features responsible for the propensity of certain cysteines
to be modified to sulfenic acid, a list of 47 proteins (containing 49 Cys-SOH sites) was compiled. Modifiable cysteines are found
in proteins from many structural and functional classes. The site itself is not located in any one type of secondary structure. To
further identify residues affecting cysteine reactivity, sites were analyzed using both functional profiling and electrostatic
analysis. The combined approach reveals mechanistic determinants not obvious from sequence comparison alone. The longterm
goals of this work are: 1) to combine structural and electrostatic feature analysis to predict Cys-SOH modification sites; 2)
to include other modifications and distinguish between types of reactive cysteines; 3) to create a publicly accessible database of
known and potential modification sites. The database would link sequence, structure, chemical and biological data to allow
researchers to assess the effects of mutations or the possibility of oxidative cysteine modifications in proteins.
Data Placement Services for eScience Workflows
Ann Chervenak, University of Southern California
Data management for eScience applications is a challenging problem. Data-intensive scientific applications may produce and
consume terabytes of data, which must be staged into and out of the high-performance computing resources on which the
application's computational analyses run. These analyses are often represented as scientific workflows that consist of millions
of interdependent tasks. Workflow management systems are increasingly used to manage the dependencies among these
computational tasks and the movement of data sets that are produced or consumed during task execution. The placement of
data sets on storage resources can have a significant impact on the performance of eScience workflows. For example, if data
sets are placed near high-performance computing resources, they can be staged efficiently into computations that execute on
those resources; moving data sets off computational resources quickly when task execution is complete can also improve
performance. In this talk, we consider the use of policy-driven Data Placement Services to improve the performance of eScience
workflows. We are studying a variety of placement policies that seek to place data sets in ways that are advantageous for
scientific workflow execution. Our research focuses on the relationship between data placement services and workflow
management systems, with the goal of making data placement largely asynchronous with respect to workflow execution, thus
reducing the need for on-demand data staging by the workflow system. The workflow system can also provide hints to the data
placement service system about the order in which data are accessed. Using two existing services, the Data Replication Service
for staging data and the Pegasus workflow management system, we demonstrate that intelligent data placement has the
potential to significantly improve the performance of eScience workflows.
Mapping the Early Universe with a Next Generation Radio Telescope of Silicon and
Software
Lincoln Greenhill, Harvard-Smithsonian Center for Astrophysics; Daniel Mitchell, Steven Ord, Randall Wayth; Smithsonian
Astrophysical Observatory
The expansion and cooling of the Big Bang was how the universe began, with particles eventually combining to form a dark sea
of atomic hydrogen. Over time, gravity drew material together, giving rise to the earliest stars, black holes, and galaxies. Intense
ultraviolet radiation, over time, heated and then destroyed the neutral hydrogen. Then the "dark sea" parted and the era of
reionization, which lasted a billion years, brought about the most important structures in the universe we know. Yet we have
only vague notions of how the universe evolved during this time. The best way to study reionization is to map the evolving
distribution of hydrogen. The Mileura Wide-field Array (MWA) will do this for the first time; it is a new-concept, digital, radio
“camera” in which the traditional telescope optics of lenses and reflectors are effectively replaced by software and high
performance computers. The MWA computer pipeline will absorb in real time 128 gigabits of data per second (24x7), execute
calibration and Fourier transform image construction on the fly, and accumulate reduced data to enable output at a
manageable a few hundred TB per year, a 1000x reduction. This is one of the larger computing challenges in radio astronomy,
and would have been impractical to attempt without recent computing advances. I will describe known MWA computing
challenges, with emphasis on throughput and I/O, pipeline parallelization, possible application of GPUs, use of instrument
simulations in algorithm and software development, scaling to future instruments, and collaboration thus far with the IIC.
Building Next-generation CyberCollaboratory for Environmental Observatories
Yong Liu, James Myers, Barbara Minsker, Joe Futrelle, Steve Downey; National Center for Supercomputing Applications;
Il-hwan Kim, University of Michigan; Esa Rantanen, National Center for Supercomputing Applications
Providing community-scale infrastructure while enabling innovation by individual researchers is a central challenge for eScience
efforts. Since 2004, the Cybercollaboratory, which is built on top of the open source Liferay portal framework, is part of the
efforts of the at the National Center for Supercomputing Applications to build national cyber infrastructure to support
collaborative research in environmental engineering and sciences. The CyberCollaboratory was used by Collaborative Largescale
Engineering Analysis Network for Environmental Research (CLEANER), which is now the WATer and Environmental
Research Systems (WATERS) network, project office and several CLEANER/WATERS test bed projects. Among over 400
registered users, over 100 had active involvements in the CyberCollaboratory. However, users have also reported usability
issues. For example, users working in multiple groups found it difficult to get an overview of all of their activities and found
differences in group layouts to be confusing. Users also found the standard account creation and group management processes
cumbersome and wanted a better sense of presence and social networks within the portal. Keeping the document repository
up-to-date as editing was performed on local files and as files were transmitted via email was another concern. As a result of
this feedback and discussions with representatives from the CUAHSI (Consortium of Universities for the Advancement of
Hydrologic Science) community, new design and development efforts were initiated in early 2007. This paper reviews the
usability feedback and potential design changes and provides a summary of the changes made to the CyberCollaboratory.
Leveraging OGC Sensor Web Enablement and Open Source Enterprise Service Bus for Real-
Time Urban Digital Watershed Data Integration and Dissemination
Yong Liu, National Center for Supercomputing Applications; David Fazio, US Geological Survey; Tarek Abdelzaher,
University of Illinois at Urbana-Champaign; Barbara Minsker, National Center for Supercomputing Applications
The value of real-time hydrologic data dissemination including river stage, stream flow, and precipitation for operational storm
water management efforts is particularly high for communities where flash flooding is common and costly. Ideally, such data
would be presented within a watershed-scale geospatial context to portray a holistic view of the watershed. Recent efforts on
providing unified access to hydrological data have concentrated on creating new SOAP-based web services and common data
format (e.g. WaterML and Observation Data Model) for data access (e.g. HIS and HydroSeek). OGC sensor web enablement
(SWE) proposes a revolutionary concept, however, these efforts do not facilitate dynamic data integration/fusion among
heterogeneous sources, or data filtering and support for workflows or domain specific applications. We propose a light weight
integration framework by extending SWE with open source Enterprise Service Bus (e.g., mule) as a backbone component to
dynamically transform, transport, and integrate both heterogeneous sensor data sources and simulation model outputs. We
will report our progress on building such framework where multi-agencies' sensor data and hydro-model outputs (with map
layers) will be integrated and disseminated in a geospatial browser (e.g. Virtual Earth). Our project is the result of collaboration
between the National Center for Supercomputing Applications, the US Geological Survey, the Illinois Water Science Center, and
the Computer Science Department at the University of Illinois at Urbana-Champaign and is funded by the Adaptive
Environmental Infrastructure Sensing and Information Systems initiative.
Semantically Aware On-line Community for Biomedical Researchers
Sudeshna Das, Alister Lewis-Bowen, Lousi Weitzman, Tim Clark; Harvard University
We are developing a reusable framework for on-line communities of biomedical researchers. Although there is a growing
number of biological knowledge bases, the vast majority of biological information and various resources used by the community
(such as cell lines, antibodies etc.) reside in laboratory notebooks and heterogeneous databases. The context of the data is
rarely captured and information exchange among researchers is usually accomplished via emailing of documents or
conversations. Moreover, community websites publishing on-line materials rarely, if ever, link them to the biological
information or resources, whereby key knowledge is lost. We are developing the framework as a Drupal (www.drupal.org)
distribution integrated with an RDF triple store and some associated java components. Drupal is a popular content
management system and is widely used by various communities to develop their website. The framework will allow easy
publishing of online materials. In addition, the framework will have semantic underpinnings to capture the relationships
between research articles, biological entities, profiles of experts etc. We will use an extension of the SWAN ontology (Clark and
Kinoshita, 2007) as our knowledge schema. Our goal is to organize and repurpose on-line material in communities by defining
and capturing semantic relationships to existing knowledge repositories. Such a knowledgebase will enable richer and more
powerful interactions amongst many sub disciplines within the scientific community.
Alzforum and SWAN: The Present and Future of Scientific Web Communities. Brief Bioinform. 2007 May; 8(3):163-71.
Epub 2007 May 17.
Sharing Digital Science
David De Roure, The University of Southampton; Carole Goble, University of Manchester
Most computer users are familiar with the practice of sharing individual files, such as text, photos, videos and music, using
social tools – Wikis, blogs and social networking sites like Flickr, YouTube and Facebook. Scientists are beginning to share
information this way too. However, scientists commonly work with collections of digital items which include experimental
plans, documentation, data, results, logs of runs, ‘housekeeping’ information, etc. myExperiment (http://myexperiment.org) is
a social space for sharing scientific workflows and associated information – a way for scientists to share reusable pieces of
scientific practice. In contrast to photo-sharing on Flickr or videos on YouTube, the basic unit of sharing in myExperiment is not
a single file but rather a package of components that make up an experiment – what we call an Encapsulated myExperiment
Object (EMO), and others have called Reproducible Research Objects. Notionally an EMO is a folder containing the various
assets associated with an experiment. In the scientific context there are stringent requirements with respect to versioning,
ownership, intellectual property and the maintenance of provenance information. We have looked at emerging practice in
sharing “pieces of science” in the scientific and scholarly lifecycle, from social sites to digital repositories. myExperiment
provides simple and extensible support to better understand requirements as new collaborative practice emerges. In this
presentation, we will describe the characteristics of EMOs and present our initial design solution which supports the
requirements of encapsulation and preserves our principles of simplicity and interoperability.
Simple Standards-based Grids
Andrew Grimshaw, University of Virginia
Providing such transparency and thus minimizing the effort required by users to integrate and use their code and data in the
grid is both practical and desirable. The lack of easy data integration and access within a grid is a major barrier to a large
number of potential grid users because they physically cannot change their code (the code is commercial or they do not have
the source code) or because they do not have the time to devote to performing the necessary integration. At a macro level it is
desirable to remove such burdens from end users because time they devote to grid integration activities is time taken away
from working in their area of expertise “ their science and research”, while lowering the integration effort will encourage more
users to take advantage of the benefits that data and compute grid systems offer. This talk will focus on the data grid
capabilities of the Genesis II grid system. Genesis II is an open implementation of grid standards emerging from the Open Grid
Forum. Specifically, Genesis II implements WS-Naming, the HPC-Profile, OGSA-BES, OGSA-ByteIO, RNS, and the draft OGSA
Express Authentication Profile suite.
Computationally-intensive Tasks in Medical Imaging Informatics
William Horsthemke, Daniela Raicu, Jacob Furst; DePaul University
Medical imaging informatics addresses initiatives to improve the performance of clinical radiology. These efforts range from
managing images for reading by radiologists to computer-aided diagnosis. Many projects require significant image processing to
extract image features for use in diagnosis or as reference queries for retrieving other images with similar characteristics. The
effectiveness of such projects often depends on having large image data sets. Given the computational complexity of many
image processing techniques and the number and size of medical images, medical imaging informatics tools are limited by
hardware resources. Many tasks can be parallelized or adapted to distributed processing as available on grid-based technology,
such as image processing feature extraction, dataset storage, content-based image retrieval (CBIR), and computer-aided
diagnosis (CAD). We propose using technologies for three specific medical imaging tasks: 1) automatic segmentation of liver
tissue in computed tomography grid (CT) of the abdomen, 2) CBIR for retrieving lung nodule cases in CT, and 3) classification of
tumors in mammography images. Each task has a significant requirement for image processing to extract low-level features; the
feature independence, as well as the presentation of data as a grid of pixels allows for excellent opportunities to use grid
technology. The high level algorithms built on extracted image features (segmentation, similarity measures, and machine
learning, respectively) can be run in parallel in a number of different ways - image slices, number of retrieved images, and
independent machine learning steps. Focus on grid-enabled techniques will permit inclusion of computationally complex
algorithms and larger datasets than otherwise acceptable for the near-real-time performance requirement of clinically useable
medical imaging applications.
Methods for Automated, Real-Time, Public Health Disease Surveillance in Metropolitan
Atlanta Using Computerized Integration, Knowledge Management, and Analysis of Multiple
Data Streams
Douglas Lowery-North, Eugene Agichtein, James Buehler, Walter |