Home | Agenda | Abstracts | Tutorials  | Image Gallery

2010 Microsoft Research eScience Workshop

On This Page

Exploration of Real-Time Provenance-Aware Virtual Sensors Across Scales for Studying Complex Environmental Systems
Yong Liu, Alejandro Rodriguez, Joe Futrelle, Rob Kooper, and Jim Myers, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

In this position paper, we present our extended concept and preliminary work of “Real-Time Provenance-Aware Virtual Sensors” across scales for studying complex environmental systems, especially sensor-driven real-time environmental decision support and situational awareness. The real-time provenance-aware virtual sensors can re-publish transformed “data, information and knowledge” streams as virtual sensor streams with associated provenance information to describe the causal relationships and derivation history in real-time. An early implementation of Open Provenance Model-Compliant provenance capture across heterogeneous layers of workflows, system daemons and user interactions as well as the re-publishing of the provenance-aware virtual sensors are presented to illustrate the value for environmental systems research and improvement of interoperability with Open Geospatial Consortium’s Sensor Web Enablement standards.

Development and Application of Network of Geosensors for Environmental Monitoring
Rafael Santos, INPE – Brazilian National Institute for Space Research

Some of the goals of the Brazilian National Institute for Space Research are related to research of space and the environment in general and development of tools and methods to support it. One of these research areas is the modeling and study of the interaction between the Earth’s atmosphere and the terrestrial biosphere, which plays a fundamental role in the climate system and in biogeochemical and hydrological cycles, through the exchange of energy and mass (for example, water and carbon), between the vegetation and the atmospheric boundary layer.

The main focus of many environmental studies is to quantify this exchange over several terrestrial biomes.

Over natural surfaces like the tropical forests, factors like spatial variations in topography or in the vegetation cover can significantly affect the air flow and pose big challenges for the monitoring of the regional carbon budget of terrestrial biomes.

With this motivation, a partnership involving INPE, FAPESP (the Research Council for the State of São Paulo), Microsoft Research, the Johns Hopkins University, and the University of São Paulo was created to research, develop and deploy prototypes of environmental sensors (geosensors) in the Atlantic coastal and in the Amazonian rain forests in Brazil, forming sensor networks with high spatial and temporal resolution; and to develop software tools for data quality control, integration with other sensor data, data mining, visualization and distribution.

This short talk presents some concepts, approaches, solutions and challenges on the computational aspects of our project.

BLAST Atlas: A Function-Based Multiple Genome Browser
Lawrence Buckingham, Alejandro Rodriguez, Joe Futrelle, Rob Kooper, and Jim Myers, Queensland University of Technology

BLAST Atlas is a visual analysis system for comparative genomics that supports genome-wide gene characterization, functional assignment and function-based browsing of one or more chromosomes. Inspired by applications such as the WorldWide Telescope, Bing Maps 3D and Google Earth, BLAST Atlas uses novel three-dimensional gene and function views that provide a highly interactive and intuitive way for scientists to navigate, query and compare gene annotations. The system can be used for gene identification and functional assignment or as a function-based multiple genome comparison tool which complements existing position based comparison and alignment viewers.

DIVE: A Data Intensive Visualization Engine
Dennis Bromley, Steven Rysavy, David Beck, and Valerie Daggett, University of Washington

Data-driven research is a rapidly emerging commonality throughout scientific disciplines. Recently, with the proliferation of inexpensive commodity computing clusters, synthetic data sources such as modeling and simulation are capable of producing a continuous stream of terascale data. Confronted with this data deluge, domain scientists are in need of data-intensive analytic environments. Dynameomics is a terascale simulation-driven research effort designed to enhance our understanding of protein folding and dynamics through molecular dynamics simulation and modeling. The project routinely involves exploratory analysis of 100+ terabyte datasets using an array of heterogeneous structural biology-specific tools. In order to accelerate the pace of discovery for the Dynameomics project, we have developed DIVE, a framework that allows for rapid prototyping and dissemination of domain independent (e.g., clustering) and domain specific analyses in an implicitly iterative workflow environment.

The information in the data warehouse is classified into three categories: raw data, derived data, and state data. Raw data are generated from simulations and models, derived data are produced through tools operating on the raw data, and state data constitute the record of the exploratory workflow, which has the added benefit of capturing the provenance of derived data.

DIVE empowers researchers by simplifying and expediting the overhead associated with shared tool use and heterogeneous datasets. Furthermore, the workflow provides a simple, interactive, and iterative data-oriented investigation paradigm that tightens the hypothesis generation loop. The result is an expressive, flexible laboratory informatics framework that allows researchers to focus on analysis and discovery instead of tool development.

Simplifying Oligonucleotide Primer Design Software to Keep Pace with an Ever Increasing Demand for Assay Formats
Kenneth "Kirby" Bloom, Illumina Corporation

As the pace of research and discovery in biotechnology continues to accelerate rapidly, oligonucleotide primer design software with plug-in algorithm architecture and scalable processing capabilities has become essential. With constantly changing algorithms leveraging a multitude of technologies for employing various chemistries and locus targeting techniques, the ability to manage, maintain, and extend the source code and data repositories became a hurdle for getting new products to market.

This challenge was met by creating a dynamic execution model that enables a drag-and-drop component construction through the use of Microsoft Workflow to allow for simplicity and scalability in the application. This architecture had the effect of decreasing the time needed to deliver new assays to market by 60 percent. Identifying a generic workflow pattern to support primer design also helped structure an architecture yielding a more than 700 percent speed improvement and the ability to scale the solution across multiple servers to meet burst demand scenarios.

Integration of Sequence Analysis into Third Dimension Explorer Leveraging the Microsoft Biology Framework
Jeremy Kolpak, Michael Farnum, Victor Lobanov, and Dimitris Agrafiotis, Janssen Pharmaceutical Companies of Johnson & Johnson

Third Dimension Explorer (3DX) is a powerful, internally developed .NET platform designed to address a broad range of data analysis and visualization needs across Johnson & Johnson Pharmaceutical Research & Development. 3DX employs a plugin approach that allows the development of extensions for particular tasks while sharing a common set of core analytic and visualization functionality. This architecture has allowed us to extend the 3DX platform to many areas of pharmaceutical R&D, from early drug discovery (e.g., analysis of chemical structures and their associated biological properties) to mining of electronic medical records.

As 3DX became a foundational system for small molecule pharmaceutical R&D and its use became widespread throughout the company, the need to extend its capabilities to support biologics research quickly emerged. As with small molecules, we followed a two-pronged approach: 1) integrate large molecule discovery data into our existing global discovery data warehouse known as ABCD, and 2) develop a new set of advanced sequence-activity analysis and visualization tools under the 3DX framework to leverage and complement its existing capabilities. The end result is a unique offering: a single data warehouse integrating both small and large molecule data (ABCD), and a single end-user application for mining and visualizing that data (3DX).

However, the task of expanding 3DX’s analysis capabilities to biologics was not a trivial task. Two options were available to us at that time: 1) re-implement the entire infrastructure ourselves from scratch, or 2) attempt to integrate existing tools built on disparate technology platforms. None of the solutions was appealing; the former because of resource constraints, and the latter because of the inherent maintenance and performance issues. Fortunately, it was at that time that MBF was released in beta, and offered an excellent foundation for seamless integration into our native .NET platform, providing much of the core functionality needed to meet our researchers’ needs.

The functionality that we developed enables interactive visualization and editing of multiple sequence alignments (via a customized sequence viewer plugin) and integration of data mining and analytic capabilities (e.g., BLAST searching of sequence libraries, multiple sequence alignment, sequence editing and translation, segment extraction, and so forth). While the sequence viewer is most useful when integrated into a general data mining application like 3DX, it was designed as a 3DX-independent extension of the MBF, thus providing a generic platform for viewing sequences and their associated metadata. It is our intention to make it freely available for use under the Microsoft Public License.

Achieving an Ecosystem Based Approach to Planning in the Puget Sound
Stephen Stanley, Susan Shull, and Susan Grigsby, Washington Department of Ecology; Gino Luchetti, King County DNR; Margaret Macleod and Peter Rosen, City of Issaquah; Millie Judge, Lighthouse Natural Resources Consulting, Everett

Watershed research over the past 20 years, however, has recognized that factors controlling the biological and physical functions at the site scale operate over multiple spatial and temporal scales (Naiman and Bilby, 1998; Beechie & Bolton, 1999; Hobbie, 2000; Benda, 2004; Simenstad et al., 2006; King County, 2007). This requires data at mid and broad scales from watersheds encompassing thousands of hectares. However, available data at mid and broad scales is often inaccurate and inconsistent in its coverage. This complicates the effort to understand the mechanistic relationship between the impacts of a land use activity upon a watershed process and site scale functions and environmental responses, such as low survival of salmonid eggs or flooding. Furthermore, watershed assessments require the integration of knowledge from multiple scientific disciplines. There is a lack of a common language, however, and mismatch between data sets in terms of forms of knowledge and different levels of precision and accuracy (Benda et al., 2002). As a result, the predictive ability and management utility of watershed assessment tools have been considered low (Beman, 2002). Because of these data/scale and integration issues, state and local governments have not developed a standard system for using watershed information to inform future development patterns in a manner that avoids significant long-range impacts to aquatic ecosystems. These issues have also prevented the public from adequately understanding the important role that broad scale data could play in protecting and restoring aquatic resources. To help incorporate watershed data and assessment into local planning efforts, the State of Washington is developing a watershed characterization and planning framework for Puget Sound. This includes methods to assess multiple watershed processes and integrate the results into “decision templates” (Stanley et al., 2009—in review). The templates help interpret and apply the characterization information appropriately.

Adapting Environmental Science Methods to Public Policy and Decision Support
Rob Fatland, Microsoft Research

Dozier and Gail posit a new Science of Environmental Applications, driven more by need than traditional scientific curiosity. I present here a brief elaboration on applying this idea to public policy and decision support based on an example of aquifer management on a small (22 km2) island in Puget Sound. I use the first person plural “we” to imply a community of environmental application problem solvers interested in sharing solutions in the way that scientists share research, from methods to results. In consequence these remarks concern the sociology of integrating science with decision making, a process with attendant difficulties (today) in both sharing and adopting solutions.

An Interactive Modeling Environment for Systems Biology of Aging
Pat Langley, Arizona State University

In this paper, we describe an interactive environment for the representation, interpretation, and revision of qualitative but explanatory biological models. We illustrate our approach on the systems biology of aging, a complex topic that involves many interacting components. We also report initial experiences with using this environment to codify an informal model of aging. We close by discussing related efforts and directions for future research.

Analyzing the Process of Knowledge Dynamics in Sustainability Innovation: Towards a Data-Intensive Approach to Sustainability Science
Masaru Yarime, University of Tokyo

Sustainability science is an academic field that analyzes the processes of production, diffusion, and utilization of various types of knowledge with long-term consequences for innovation. Three components can be identified in the process of knowledge dynamics system in society. Knowledge has aspects of content, quantity, quality, and rate of circulation. Actors are characterized with their heterogeneity, linkages and networks, and interactions among them. Institutions cover a diverse set of entities, ranging from informal ones such as norms and practices to more formal ones including rules and laws. Sustainability science thus deals with dynamic, complex interactions among diverse actors creating, transmitting, and applying various types of knowledge under institutional conditions. Several phases are identified in the production, diffusion, and utilization of knowledge with different actors. Gaps and inconsistencies inevitably exist among different phases in terms of the quantity, quality, and rate of knowledge processed. This effectively constitutes a major challenge in pursuing sustainability on a global scale. Different phases of the process of knowledge dynamics include problem discovery, scientific investigation, technological development, diffusion in society, reactions from stakeholders in society. These different phases are analyzed by using a data-intensive approach, assembling and integrating a diverse set of data through bibliometric analysis of scientific articles published in academic journals, patent analysis of technologies, life cycle assessment of products, and discourse analysis of mass media. Case studies of innovation on photovoltaic and water treatment technologies are conducted by assembling and integrating various types of data on the different phases of the knowledge dynamics. They suggest that gaps and inconsistencies in the knowledge circulation system would actually pose serious challenges to the pursuit of sustainability innovation.

Data-Intensive Science for Safety, Trust, and Sustainability
Shuichi Iwata and Pierre Villars, The University of Tokyo

Thoughts on “Data Commons” for data-intensive science are reported based on our preliminary studies for data-driven materials design, targeting not only at materials but also at all time-dependent properties about aging of engineering products, human bodies and degradation of environments.

Our methods are not powerful enough to predict time-dependent properties of complex system, so that we use causality and correlation in data to ensure safety margins adequate. Thus, in short, “safety” is confirmed by data, and “trust” is built by enough margins, again confirmed by data. These subjects are data-intensive from the beginning due to their inherent complexity.

Dealing with such a complexity proactively to get a set of creative holistic views on each time-dependent complexity, we propose “Data Commons” as a platform for collective knowledge. And it is to be constructed beyond the following two challenges:


Horizontal comparative approaches to get perspectives by a set of two dimensional maps on deep semantics as demonstrated by our former projects LPF (Linus Pauling File)


(2)  Vertical converging (=heuristic inverse/direct) approaches to a concrete target beyond ”multi-scale modeling” as tried by VEMD(Virtual Experiments for Materials Design) to bridge gaps between data and models, allowing rich diversities of scenarios 

The third challenge is to drive abductive approaches so as to become free from “lock–in”, which can be attained by strategic organizations of (1) and (2) through collective knowledge. And a paradigm for the data-centric science is discussed by a preliminary study along this approach. Commitments in the collective knowledge are the key for sustainability.

BL!P: A Tool to Automate NCBI BLAST Searches and Customize the Results for Exploration in Live Labs Pivot
Vince Forgetta and Ken Dewar, McGill University; Moussa S. Diarra, Pacifc Agri-Food Research Centre, Agriculture and Agri-Food Canada; Simon Mercer, Microsoft Research

NCBI BLAST is a tool widely used to annotate protein coding sequences. Current limitations in the annotation process are in part dictated by the methodology used. The manual inspection of BLAST results is slow, tedious and limited to static analysis of textual output, while automated analyses typically discard useful information in favor of increased speed and simplicity of analysis. These limitations can be addressed using data exploration and visualization software, such as Live Labs Pivot by Microsoft, a software application that allows for the fluid exploration of large datasets in an intuitive manner. We have created a Microsoft Windows application, BL!P [blip] or BLAST in Pivot, that automates NCBI BLAST searches, fetches associated GenBank records, and converts this information into a Pivot collection. Also, BL!P provides an interface to create customized images for each BLAST match, allowing the user to perform further customizations to meet their data exploration objectives.

GenoZoom: Browsing the Genome with Microsoft Biology Foundation, Deep Zoom, and Silverlight
Xin-Yi Chua, Queensland University of Technology; Michael Zyskowski, Microsoft Research

Many current genome browsers are faced with a number of limitations, namely: they do not support smooth navigation of large scale data from high to low resolutions at rapid speed; information is limited to a predefined set of genomic data; lengthy setup is required to display user’s own genome sequences and they do not support unformatted user annotations. GenoZoom was an investigation in attempt to address these limitations by utilizing the richness enabled by Silverlight [1] and Deep Zoom [2] technologies.

Data, Data, Everywhere, nor Any Drop to Drink: New Approaches to Finding Events of Interest in High Bandwidth Data Streams
Mark Abbott, Ganesh Gopalan, and Charles E. Sears, Oregon State University

The amount of unstructured data gathered and managed annually by organizations within both the research and the business sectors is growing exponentially. Qualitatively, this shift is even more radical, as the conceptual framework for data moves from a historic, disaggregated, and static perspective, to one based on assumptions about the potentials in dynamic data management and collaboration. Knowledge extraction will require new tools to enable new levels of collaboration, visualization, and synthesis. This is not just scaling up traditional compute workflows to accommodate greater volumes; it is about scaling out to broadly dispersed data and teams that come together to work on specific business and science issues. We are using high-definition (HD) data arrays derived from range of observing systems and models as streaming data sets. The problem space is defined as the detection, annotation, and classification of events or features in the HD stream, link these with an XML-based data base, and provide web services to a broad range of network-aware devices, not just desk side workstations. We are developing a content-based high definition video search engine that integrates multiple Microsoft technologies, including a multi-touch interface to query and navigate through video clips, WPF for transitions in the interface, a SQL Server back-end with an HTTP Endpoint to search through video using MPEG-7 and CLR stored procedure integration to support MPEG-7 tasks directly within the database. Finding the data “drop” of interest will require new approaches, not simply “scaling up” the hardware and approaches we have used for the past decades. Instead, we must accommodate the “scaling out” of data sources, repositories, and users. Our research explores these new avenues to capture, analyze, visualize, distribute and present large-scale digital e-science content.

Extreme Database-centric Computing in Science
Alex Szalay, Tamas Budavari, Laszlo Dobos, and Richard Wilton, Johns Hopkins University

Scientific computing is becoming increasingly about analyzing massive amounts of data. In a typical academic environment managing data sets below 10TB is easy; above 100TB it is very difficult. Databases offer a lot of advantage for the typical patterns required for managing scientific data sets, but lack a few important features. Here we present recent projects at JHU aimed to bridge the gap between databases and scientific computing. We have implemented a framework that enables us to execute SQL Server User Defined Functions on GPGPUs, implemented a new array datatype for SQL Server and ran several science analysis tasks using these features.

Model-Driven Cloud Services for Cancer Research
Marty Humphrey, University of Virginia

The cancer Bioinformatics Grid (caBIG) is a virtual network of interconnected data, individuals, and organizations. Overseen by the NIH National Cancer Institute (NCI), caBIG is redefining how research is conducted, care is provided, and patients/participants interact with the biomedical research enterprise. Given its ambitious goal and vision, caBIG faces a huge number of technical and economic challenges. The software underlying caBIG must be user-friendly, scalable, secure, evolvable and evolving, able to find and process the relevant information necessary to the computation at hand, interoperable with other platforms, cost-effective, and so forth. Delivering on these requirements has the potential to be truly transformative, revolutionizing cancer research and transforming patient health care into a highly-personalized model.

However, it has been observed that the current software of caBIG is very restrictive—there is a tremendous learning curve necessary, whereby researchers must often become familiar with a whole new set of tools and methodologies (based on Java). caBIG is fundamentally model-driven; However, the current modeling capabilities in caBIG are rigid and ineffective, and many of the potential benefits of a model-driven architecture are not being realized. Infrastructure costs (both with respect to software design/deployment and with respect to running deployed services) are starting to overwhelm caBIG as caBIG seeks to expand.

In our prior work (Microsoft eScience Workshop 2008), we demonstrated how to create a caBIG data Service based on ADO.NET Data Services and WCF. In this talk, we demonstrate how we address these challenges through the use of Microsoft SQL Server modeling technologies, ADO.NET Entity Framework in .NET 4.0, Odata, Microsoft Visual Studio 2010, and Windows Azure to deliver model-driven cloud services for cancer research.

Cloud-Based Map-Reduce Architecture for Nuclear Magnetic Resonance-Based Metabolomics
Paul Anderson, Satya Sahoo, Ashwin Manjunatha, Ajith Ranabahu, Nicholas Reo, Amit Sheth, and Michael Raymer, Wright State University; Nicholas DelRaso, Air Force Research Laboratory

The science of metabolomics is a relatively young field that requires intensive signal processing and multivariate data analysis for interpretation of experimental results. We present a scalable scientific workflow approach to data analysis, where the individual cloud-based services exploit the inherent parallel structure of the algorithms. Two significant capabilities include the adaptation of an open source workflow engine (Taverna) that provides flexibility in selecting the most appropriate data analysis technique, regardless of their implementation details, and the implementation of several common spectral processing techniques in the cloud using a parallel map-reduce framework, Hadoop. Due to its parallel processing architecture and its fault-tolerant file system, Hadoop is ideal for analyzing large spectroscopic data sets.

MyExperimentalScience, Extending the ”Workflow”
Jeremy Frey, Andrew Milsted, Danius Michaelides, and David De Roure, University of Southampton

For the past few years there has been a lot of activity in the preservation and dissemination of the "in silico" experiments through the sharing of "workflows". This term has been used to describe the processes that were performed by such experiments, but this term can also apply to "real" in vitro experiments, by describing the experimental steps performed by the scientist. In the past these workflows would of been recoded in a paper labbook, so the only way to share the said workflow would be to write a journal paper just around the procedure or expose full pages of the labbook. With the introduction or Virtual Research Environments (VRE) and Electronic Laboratory Notebooks (ELN) there is now a possibility for the sharing of these processes.

The MyExperimentalScience project linked the myExperiment platform with the ELN LabBlog, myExperiment is a collaborative environment in which scientists can safely publish their workflows and experimental plans, share them with groups and find those of others. Workflows, other digital objects and bundles (called Packs) can now be swapped, sorted and searched like photos and videos on the web. Unlike Facebook or MySpace, myExperiment fully understands the needs of the researcher and makes it really easy for the next generation of scientists to contribute to a pool of scientific methods, build communities and form relationships—reducing time-to-experiment, sharing expertise and avoiding reinvention. myExperiment is now the largest public repository of scientific workflows.

The Conversion Software Registry
Michal Ondrejcek, Kenton McHenry, and Peter Bajcsy, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

We have designed a web-based Conversion Software Registry (CSR) for collecting information about software that are capable of file format conversions. The work is motivated by a community need for finding file format conversions inaccessible via current search engines and by the specific need to support systems that could actually perform conversions, such as the NCSA Polyglot. In addition, the value of CSR is in complementing the existing file format registries such as the Unified Digital Formats Registry (UDFR before GDFR) and PRONOM, and introducing software quality information obtained by content-based comparisons of files before and after conversions. The contribution of this work is in the CSR data model design that includes file format extension based conversion, as well as software scripts, software quality measures, and test file specific information for evaluating software quality. We have populated the CSR with the help of the National Archives and Records Administration (NARA) staff. The Conversion Software Registry provides multiple search services. As of May 28, 2010, CSR has been populated with 183,142 conversions, 544 software packages, 1316 file format extensions associated with 273 MIME types, and 154 PRONOM identifications.

oreChem: Planning and Enacting Chemistry on the Semantic Web
Mark Borkum, Simon Coles, and Jeremy Frey, University of Southampton

This paper presents the oreChem Core Ontology (CO), an extensible ontology for the description of the planning and enactment of scientific methods. Currently, a high level of domain-specific knowledge is required to identify and resolve the implicit links that exist between the digital artefacts that are realised during the enactment of a scientific experiment. This creates a significant barrier-to-entry for independent parties that wish to discover and reuse the published data. The CO radically simplifies and clarifies the problem representing a scientific experiment to facilitate the discovery and reuse of the raw, intermediate and derived results in the correct context. In this paper, we present an overview of the CO and discuss the integration of the CO with the eCrystals repository for crystal structures.

Accelerating Chemical Property Prediction with Cloud Computing
Hugo Hiden, Paul Watson, David Leahy, Jacek Cala, Dominic Searson, Vladimir Sykora, and Simon Woodman, Newcastle University

This paper describes the use of cloud computing to accelerate the building of models to predict chemical properties. The chemists in the project have unique software—the Discovery Bus—that automatically builds quantitative structure-activity relationship (QSAR) models from chemical activity datasets. These models can then be used to design better, safer drugs, as well more environmentally benign products.

Recently, there has been a dramatic increase in the availability of activity data, creating the opportunity to generate new and improved models. Unfortunately, the competitive workflow algorithm used by the Discovery Bus requires large computational resources to process data; for example, the chemists recently acquired some new datasets which would take more than five years to process on their current, single-server infrastructure.

This is potentially an ideal cloud application as large computational resources are required, but only when new datasets become available. Therefore, in the “Junior” project, we have designed and built a scalable, Windows Azure cloud-based infrastructure in which the competitive model-building techniques are explored in parallel on up to 100 nodes. As a result, the rate at which the Discovery Bus can process data has been accelerated by a factor of more than 100, and the new datasets can be processed in weeks rather than years.

Remote Computed Tomography Reconstruction Service on GPU-Equipped Computer Clusters Running Microsoft HPC Server 2008
Timur Gureyev, Yakov Nesterets, Darren Thompson, Alex Khassapov, Andrew Stevenson, Sheridan Mayo, and John Taylor, Commonwealth Scientific and Industrial Research Organisation (CSIRO); Dimitri Ternovski, Trident Software Pty. Ltd.

We describe a complete integrated thick client-type system for remote computed tomography (CT) reconstruction, simulation and visualization services utilising computer clusters optionally equipped with multiple graphics processing units (GPUs). All computers in our system, including the user PCs, web servers, file servers, and the compute cluster nodes, are running flavours of the Windows OS, which greatly simplifies the development, installation, administration, and replication of the system. Our design is also aimed at streamlining and simplifying user interaction with the system, which differentiates it from most software available on today’s compute clusters that typically require some familiarity with parallel computing environment from the user. We briefly describe the high-level architectural design of the system as well as the two-level parallelization of the most computationally-intensive modules utilising both the multiple CPU cores and multiple GPUs available on the cluster. Finally, we present some results about the current system’s performance.

e-LICO: Delivering Data Mining to the Life Science Community
Simon Jupp, James Eales, Rishi Ramgolam, Alan Williams, Robert Stevens, and Carole Goble, University of Manchester; Simon Fischer, Rapid-l GmbH; Jorg-Uwe Kietz, University of Zurich

Life science research is generating a vast amount of data; data are produced detailing many granularities from information about molecular interactions to planetary meteorological information. One of the challenges in bioinformatics is how best to provide the biologists with the necessary tools and infrastructure to process, analyse and explore these data.

e-LICO is a project that is seeking to develop a collaborative environment using Taverna and myExperiment for scientists to build and share scientific workflows, with a specific focus on support for text and data-mining. Data-Mining is a complicated process, resulting in workflows consisting of several steps for each of: data-gathering, integration, preparation, modeling, evaluation and deployment. e-LICO utilizes existing e- science infrastructure (myExperiment, Taverna) along with integrated AI-planning techniques to build data-mining workflows (via case-based planning and hierarchical task-decomposition planning).

SQL is Dead; Long Live SQL: Lightweight Query Services for Ad Hoc Research Data
Bill Howe and Garret Cole, University of Washington

We find that relational databases remain underused in science, despite a natural correspondence between exploratory hypothesis testing and ad hoc “query answering.” The upfront costs to deploy a relational database prevent widespread use by small labs or individuals, while the development time for custom workflows or scripts is too high for interactive Q&A. We are exploring a new way to inject SQL into the scientific method, motivated by these observations:

  • We reject the conventional wisdom that “scientists won’t write SQL.” Rather, we implicate the process of data modeling, schema design, cleaning, and ingest in preventing the uptake of the technology by scientists.
  • We observe that cloud platforms, specifically the Windows Azure platform and Amazon’s EC2 service, drastically reduce the effort required to erect a production-quality database server.
  • We observe that simply sharing examples of SQL queries allows the scientists to self-train, bootstrapping the technological independence needed to allow our work to serve many labs simultaneously.

Guided by these premises, we have built a simple prototype that allows users to upload their data and immediately query it—no schema design, no reformatting, no DBAs, no obstacles. We provide a “starter kit” of SQL queries, translated from English questions provided by the researchers themselves, that demonstrate the basic idioms for retrieving and manipulating data. These queries are saved within the application, and can be copied, modified, and saved anew by the researchers. Beyond these core requirements, we seek novel features to facilitate authoring, sharing, and reuse of SQL statements, as well as analysis and visualization of results. A cloud-based deployment on Windows Azure allows us to establish a global, interdisciplinary corpus of example queries, which we mine to help users find relevant example queries, organize and integrate data, and construct new queries from scratch.

SinBiota 2.0 – Planning a New Generation Environmental Information System
João Meidanis, Pedro Feijão, Cleber Mira, and Carlos Joly, University of Campinas

In March of 1999, the State of Sao Paulo Research Foundation (FAPESP) launched a research program on characterization, conservation, restoration, and sustainable use of the biodiversity of the state, known as the “BIOTA-FAPESP” Program. Over the years, this program accumulated about 100 thousand records on observations and gathering of biological material. A new journal was founded, and the program even had impact on state laws regarding land use. Along with the program, an information system, called SinBiota, was developed to hold the data generated by its participants.

After ten years, the system is in need of a major reorganization. In this paper we cover the steps that are being undertaken to achieve this goal, including consultations with IT specialists, listening to the user community, establishing a multi-phase plan, and also present the current state of affairs, which involves research in areas such as multimedia search, cloud computing, database scalability, and so on, as well as the implementation of a prototype of the new system, in a project jointly funded by FAPESP and Microsoft Research.

Enhancing the Quality and Trust of Citizen Science Data
Jane Hunter and Abdulmonem Alabri, The University of Queensland; Catharine van Ingen, Microsoft Research

The Internet, Web 2.0, and Social Networking technologies are enabling citizens to actively participate in “citizen science” projects by contributing data to scientific programs via the web. However, the limited training, knowledge, and expertise of contributors can lead to poor quality, misleading or even malicious data being submitted. Subsequently, the scientific community often perceive citizen science data as low quality and not worthy of being used in serious scientific research. In this paper, we describe a technological framework that combines data quality improvements and trust metrics to enhance the reliability of citizen science data. We describe how trust models can provide a simple and effective mechanism for measuring the trustworthiness of community-generated data. We also describe filtering services that remove unreliable or untrusted data, and enable scientists to confidently re-use citizen science data. The resulting software services are evaluated in the context of the Coral Watch project—a citizen science project that uses volunteers to collect comprehensive data on coral reef health.

Scientist-Computer Interfaces for Data-Intensive Science
Cecilia Aragon, Lawrence Berkeley National Laboratory

Many of today's important scientific breakthroughs are made by large, interdisciplinary collaborations of scientists working in geographically distributed locations, producing and collecting vast and complex datasets. Experimental astrophysics, in particular, has recently become a data-intensive science after many decades of relative data poverty. These large-scale science projects require software tools that support, not only insight into complex data, but collaborative science discovery. Such projects do not easily lend themselves to fully automated solutions, requiring hybrid human-automation systems that facilitate scientist input at key points throughout the data analysis and scientific discovery process. This paper presents some of the issues to consider when developing such software tools, and describes Sunfall, a collaborative visual analytics system developed for the Nearby Supernova Factory, an international astrophysics experiment and the largest data volume supernova search currently in operation. Sunfall utilizes novel interactive visualization and analysis techniques to facilitate deeper scientific insight into complex, noisy, high-dimensional, high-volume, time-critical data. The system combines novel image processing algorithms, statistical analysis, and machine learning with highly interactive visual interfaces to enable collaborative, user-driven scientific exploration of supernova image and spectral data. Sunfall is currently in operation at the Nearby Supernova Factory; it is the first visual analytics system in production use at a major astrophysics project.

Enabling Scientific Discovery with Microsoft SharePoint
Kenji Takeda, Richard Boardman, Steven Johnston, Mark Scott, Leslie Carr, Simon Coles, Simon Cox, Graeme Earl, Jeremy Frey, Philippa Reed, Ian Sinclair, and Tim Austin, University of Southampton

Scientists, researchers and engineers facing increasing amounts of data must create, execute and navigate complex workflows, collaborate within and outside their organisations, and need to share their work with others. In this paper, we demonstrate how the Microsoft SharePoint platform provides an integrated feature set that can be leveraged in order to significantly improve the productivity of scientists and engineers. We investigate how SharePoint 2010 can be used, and extended, to manage data and workflow in a seamless way, and enable users to share their data with full access control. We describe, in detail, how we have used SharePoint 2010 as the IT infrastructure for a large, multi-user facility, the µ-Vis CT scanning centre. We also demonstrate how we are creating a user-centric data management system for archaeologists, and demonstrate how SharePoint 2010 can be integrated into the everyday lives of scientists and engineers for managing and publishing their data through our Materials Data Centre, which provides an easy-to-use data management system from lab bench to journal publication via EPrints.

Genome-Wide Association of ALS in Finland
Bryan Traynor, National Institute on Aging, National Institutes of Health

We performed a genome-wide association study of amyotrophic lateral sclerosis (ALS) in Finland to determine the genetic variants underlying disease in this population. Finland is a ideal location for performing genetics studies of ALS, because it has one of the highest incidences of the disease in the world, and because the population is known to be remarkably genetically homogeneous. We genotyped a cohort of 442 Finnish ALS patients and 521 Finnish control subjects using HumanHap370 arrays, which assay more than 300,000 SNPs across the human genome. This DNA was collected by our colleague Dr. Hannu Laaksovirta, who reviews nearly all patients diagnosed with this fatal neurodegenerative disease in the country. We were pleased to find two highly significant association peaks in our GWAS, one located on chromosome 21 near the SOD1 gene which is known to have a particularly high prevalence in the Finnish population, the other located on chromosome 9p21. Together, these two loci account for nearly the entire increased incidence of ALS in Finland.

A Framework for Large-Scale Modelling of Population Health
John Ainsworth, Iain Buchan, Nathan Green, Matthew Sperrin, Richard Williams, Philip Couch, Emma Carruthers, and Eleanora Fichera, University of Manchester; Martin O'Flaherty and Simon Capewell, University of Liverpool

Statistics and Informatics methods for synthesising disparate sources of public health evidence are under-developed. This is in part due to the amount of human resource required to synthesise complex evidence, and in part due to a research environment that rewards the study of the independent effects of specific factors on health more than discovering the complexity of health. In particular, it remains difficult to compare the potential impacts of community-based prevention strategies such as smoking cessation, vs. clinical treatments such as lipid lowering drugs. Thus there is a lack of usefully complex models that might underpin the full appraisal of health policy options by the policy-makers. We present a system that enables health care professionals to collaborate on the design of complex models of population health which can then be used to evaluate and compare the impact of interventions.

GREAT.stanford.edu: Generating Functional Hypotheses from Genome-Wide Measurements of Mammalian Cis-Regulation
Gill Bejerano and Cory Y. McLean, Stanford University

Recent technological advances in DNA sequencing provide an unprecedented view of the regulatory genome in action. We can now sequence all binding events of transcription factors and transcription-associated factors, examine the dynamics of different chromatin marks, assay for nucleosome positioning and open chromatin, and more. However, attempts to interpret these data using computational tools developed for microarray analysis often fall short, leaving researchers to manually scrutinize only handfuls of their copious data.

We developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to provide the first computational tool which correctly analyzes whole genome cis-regulatory data. Whereas microarray-based methods are forced to consider only binding proximal to genes, GREAT is able to properly incorporate distal binding sites which greatly enhances resulting interpretations. Applying GREAT to ChIP-seq data sets of multiple transcription-associated factors in different contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate novel hypotheses that can be experimentally tested. GREAT can be similarly applied to any dataset of localized genomic markers enriched for known or putative cis-regulatory function.

GREAT incorporates biological annotations from 20 ontologies and has been made available to the scientific community as an intuitive web tool. Direct submission is also available from the UCSC Genome Browser via the Table Browser.

Medici: A Scalable Multimedia Environment for Research
Joe Futrelle, Luigi Marini, Rob Kooper, Joel Plutchak, Alan Craig, Terry McLaren, and Jim Myers, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

Large-scale community collections of images, videos, and other media are a critical resource in many areas of research and education including the physical sciences, biology, medicine, humanities, arts, and social sciences. Researchers face coupled problems in managing large amounts of data, analysis and visualization over such collections, and managing descriptive metadata and provenance information. NCSA is involved in a wide range of projects targeting collections that involve terabytes to petabytes of data, complex image processing pipelines, and rich provenance linking. Based on this experience, we have developed Medici—a general multimedia management environment based on Web 2.0 interfaces, semantic content management, and service/cloud-based workflow capabilities that can support a broad range of high-throughput research techniques and community data management. Medici provides scalable storage and media processing capabilities, simple desktop and web 2.0 user interfaces, social annotations, preprocessing and preview capabilities, dynamically extensible metadata, provenance support, and citable persistent data references. This talk will provide an overview of Medici’s capabilities and use cases in the humanities and microscopy as well as describe core research and development challenges in creating usable systems incorporating rich semantic context derived from distributed automated and manual sources.

BlogMyData: A Virtual Research Environment for Collaborative Visualization of Environmental Data
Andrew Milsted, Jeremy Frey, Jon Blower, and Adit Santokhee, University of Southampton

Understanding and predicting the Earth system requires the collaborative effort of scientists from many different disciplines and institutions. The National Centre for Earth Observation (NCEO) and the National Centre for Atmospheric Science Climate Group (NCAS-Climate) are both high-profile interdisciplinary research centres involving numerous universities and institutes around the UK and many international collaborators. Both groups make use of the latest numerical models of the climate and earth system, validated by observations, to simulate the environment and its response to forcings such as an increase in greenhouse gas emissions. Their scientists must work together closely to understand the various aspects of these models and assess their strengths and weaknesses.

At the present time, collaborations take place chiefly through face-to-face meetings, the scholarly literature and informal electronic exchanges of emails and documents. All of these methods suffer from serious deficiencies that hamper effective collaboration. For practical reasons, face-to-face meetings can be held only infrequently. The scholarly literature does not yet adequately link scientific results to the source data and thought processes that yielded them, and additionally suffers from a very slow turnaround time. Informal exchanges of electronic information commonly lose vital context; for example, scientists typically exchange static visualizations of data (as GIFs or PostScript plots for example), but the recipient cannot easily access the data behind the visualization, or customize the visualization in any way. Emails are rarely published or preserved adequately for future use. The recent adoption of “off the shelf” Wikis and basic blogs has addressed some of these issues, but does not usually address specific scientific needs or enable the interactive visualization of data.

RightField: Rich Annotation of Experimental Biology Through Stealth Using Spreadsheets
Matthew Horridge, Katy Wolstencroft, Stuart Owen, and Carole Goble, University of Manchester; Wolfgang Mueller and Olga Krebs, HITS gGmbH

Rightfield is an open source application that provides a mechanism for embedding ontology annotation support for scientific data in Excel spreadsheets. It was developed during the SysMO-DB project to support a community of scientists who typically store and analyse their data using spreadsheets. It helps keep annotation consistent and compliant with community standards whilst making the annotation process quicker and more efficient.

RightField is an open-source, cross-platform Java application that is available for download.

musicSpace: Improving Access to Musicological Data
mc schraefel, David Bretherton, Daniel Smith, and Joe Lambert, University of Southampton

Efforts over the past decade to digitize scholarly musicological materials has revolutionized the research process, however online research in musicology is now held back by the segregation of data into a plethora of discrete and disparate databases, and the use of legacy or ad hoc metadata specifications that are unsuited to modern demands. Many real-world musicological research questions are rendered effectively intractable because there is insufficient metadata or metadata granularity, and a lack of data source integration. The "musicSpace" project has taken a dual approach to solving this problem: designing back-end services to integrate (and where necessary surface) available (meta)data for exploratory search from musicology's key online data providers; and providing a front-end interface, based on the "mSpace" faceted browser, to support rich exploratory search interaction.

We unify our partners' data using a multi-level metadata hierarchy and a common ontology. By using RDF for this, we make use of the many benefits of Semantic Web technologies, such as the facility to create multiple files of RDF at different times and using different tools, assert them into a single graph of a knowledge base, and query all of the asserted files as a whole. In many cases we were able to directly map a record field from a partner's dataset to our combined type hierarchy, but in other cases some light syntactic and/or semantic analysis needed to be performed. This small amount of work in the pre-processing stage adds granularity that significantly enriches the data, allowing for more refined filtering and browsing of records via the search UI. Significantly, although all the data we extract is present in the original records, much of it is neither exposed to nor exploitable by the end-user via our data providers' existing UIs. In musicSpace, however, all data surfaced can be used by the musicologist for the purposes of querying the dataset, and can thus aid the process of knowledge discovery and creation.

Our work offers an effective generalizable framework for data integration and exploration that is well suited for arts and humanities data. Our benchmarks have been (1) to make tractable previously intractable queries, and thereby (2) to accelerate knowledge discovery.

Quantifying Historical Geographic Knowledge from Digital Maps
Tenzing Shaw, Peter Bajcsy, Michael Simeone, and Robert Markley, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

An important question facing historians is how knowledge of different geographic regions varied between nations and over time. This type of question is often answered by examining historical maps created in different regions and at different times, and by evaluating the accuracy of these maps relative to modern geographic knowledge. Our research focuses on quantifying and automating the process of analyzing digitized historical maps in an effort to improve the precision and efficiency of this analysis.

In this paper, we describe an algorithmic workflow designed for this purpose. We discuss the application of this workflow to the problem of automatically segmenting Lake Ontario from French and British historical maps of the Great Lakes region created between the 16th and 19th centuries, and computing the surface area of the lake according to each map. Comparing these areas with the modern figure of 7,540 square miles provides a way of measuring the accuracy of French versus British knowledge of the geography of the Lake Ontario region at different points in time. Specifically, we present the results following the application of our algorithms to 40 historical maps. The procedure we describe can be extended to geographic objects other than Lake Ontario and to accuracy measures other than surface area.

Data Intensive Research in Computational Musicology
David De Roure, Oxford e-Research Centre; J. Stephen Downie, University of Illinois at Urbana-Champaign; Ichiro Fujinaga, McGill University

The SALAMI (Structural Analysis of Large Amounts of Music Information) project applies computational approaches to the huge and growing volume of digital recorded music that is now available in large-scale resources such as the Internet Archive. It is set to produce a new and very substantive web-accessible corpus of music analyses in a common framework for use by music scholars, students and beyond, and to establish a methodology and tooling which will enable others to add to the resource in the future. The SALAMI infrastructure brings together workflow and Semantic Web technologies with a set of algorithms and tools for extracting features from recorded music which have been developed by the music information retrieval and computational musicology communities over the last decade, and the project uses “controlled crowd sourcing” to provide ground truth annotations of musical works.

Scaling Information on ‘Biosphere Breathing’ from Chloroplast to the Globe
Dennis Baldocchi, Youngryel Ryu, and Hideki Kobayashi, University of California-Berkeley; Catharine van Ingen, Microsoft Research

We describe the challenges of upscaling of information on the ‘breathing of the biosphere’ from the scales of the chloroplast of leaves to the globe. This task—the upscaling carbon dioxide and water vapor fluxes—is especially challenging because the problem transcends fourteen orders of magnitude in time and space and involves panoply of non-linear biophysical processes. This talk outlines the problem and describes the set of methods used. Our approach aims to produce information on the ‘breathing of the biosphere’ that is ‘everywhere, all of the time’.

The computational demands of this problem are daunting. At the stand-scale one must simulate the micro-habitat conditions of thousands of leaves, as they are displayed on groups of plants with a variety of angle orientations. Then one must apply the micro-habitat information (e.g., sunlight, temperature, humidity, CO2 concentration) to sets of coupled non-linear equations that simulate photosynthesis, respiration and the energy balance of the leaves. And finally add up this information.

At the regional to global scales, there is a need to acquire merge multiple layers of remote sensing datasets at high resolution (1 km) and frequent intervals (daily) to provide the drivers of models that predict carbon dioxide and water vapor exchange. The global data products of ecosystem photosynthesis and transpiration produced with this system have high fidelity, when validated with direct flux measurements, and produce complex spatial and temporal patterns that will prove to be valuable for environmental modelers and scientists studying climate change and carbon and water cycles from local to global scales.

Agrodatamine: Integrating Analysis of Climate Time Series and Remote Sensing Images
Humberto Razente and Maria Camila N.Barioni, UFABC; Daniel Y. T. Chino, Elaine P. M. Sousa, Robson Cordeiro, Santiago A. Nunes, Caetano Traina Jr., José F. Rodrigues Jr., Willian D. Oliveira, and Agma J. M. Traina, University of São Paulo; Luciana A. S. Romani, University of São Paulo & EMBRAPA Informatics; Marcela X. Ribeiro, Federal University of São Carlos; Renata R. V. Gonçalves, Ana H. Ávila, and Jurandir Zullo, CEPAGRI-UNICAMP

Despite the scientific community not having doubts about the global warming, to quantify and to identify the causes of the average increase of the global temperature, and its consequences for the ecosystems remain urgent and of utmost importance. Mathematical and statistical models have been used to predict likely future scenarios and as an outcome, a large amount of data has been generated. The technological progress also led to improved sensors for several climate data measurements and earth's surface imaging, contributing even more to the increasing volume and complexity of the data generated. In this context, we present new methods to filter, analyze and extract association patterns between climate data and those extracted from remote sensing, which aim at aiding agricultural research.

Correction for Hidden Confounders in Genetic Analyses
Jennifer Listgarten, Carl Kadie, and David Heckerman, Microsoft Research; Eric E. Schadt, Pacific Biosciences

Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. One way of getting at such an understanding is to find out which parts of our DNA, such as single-nucleotide polymorphisms, affect particular intermediary processes such as gene expression (eQTL), or endpoints such as disease status (GWAS). Naively, such associations can be identified using a simple statistical test on each hypothesized association. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. Our work focuses on novel statistical models that correct for these confounders. In particular, we present a novel statistical model that jointly corrects for two particular kinds of hidden structure—population structure (e.g., race, family-relatedness), and microarray expression artifacts (e.g., batch effects)—when these confounders are unknown. We also are working on models that robustly correct for confounders but which are cheap enough to be applied to extremely large data sets.

BioPatML.NET and Its Pattern Editor: Moving into the Next Era of Biology Software
James Hogan, Yu Toh, Lawrence Buckingham, Michael Towsey, and Stefan Maetschke, Queensland University of Technology

Existing XML-based bioinformatics pattern description languages are best seen as subsets or minor extensions of regular expression based models. In general, regular expressions are sufficient to solve many pattern searching problems. However their expressive power is insufficient to model complex structured pattern such as promoters, overlapping motifs or RNA stem–loops. In addition, these languages often provide only minimal support for techniques common in bioinformatics such as mismatch thresholds, weighted gaps, direct and inverted repeats, general similarity scoring and position weight matrices. In this paper we introduce BioPatML.NET, a comprehensive search library which supports a wide variety of pattern components, ranging from simple motif, regular expression or prosite patterns, and their aggregation into more complex hierarchical structures. BioPatML.NET unifies the diversity of pattern description languages and fills a gap in the set of XML-based description languages for biological systems. As modern computational biology increasingly demands the sharing of sophisticated biological data and annotations, BioPatML.NET simplifies data sharing through the adoption of a standard XML-based format to represent pattern definitions and annotations. This approach not only facilitates data exchange, but also allows compiled patterns to be logically mapped easily onto database tables. The library is implemented in C# and builds upon the Microsoft Biology Foundation data model and file parsers. This paper also introduces an intuitive and interactive editor for the format, implemented in Silverlight 4 and allowing drag and drop creation and maintenance of biological patterns, and their preservation and re-use through an associated repository. (Refer to Appendix Fig 1.0 for a snapshot of the BioPatML Pattern Editor Tool).

Availability: A demonstration video and the tool is available at the following links (requires Silverlight 4 plug-in).

GRAS Support Network, Its Implementation, Operation, and Use
Fritz Wollenweber, Francois Montagner, Christian Marquardt, and Yago Andres, EUMETSAT; Maria Lorenzo and Rene Zandbergen, ESOC

This paper will present the GRAS support network that was put into place to support the processing of the GRAS radio occultation instrument on board of the Metop spacecraft. GRAS is using GPS satellite signals received by the instrument to perform retrievals of vertical profiles of refractivity from which Temperature profiles can be computed. The presentation will describe in detail the GRAS processing, the requirements that have to be fulfilled by the GSN support network and the design and implementation of the GSN. Examples will be given from the operational use of this system in the past 3 years. Particular emphasis will be given to the details of the global GSN network, its communication links and the GSN processing Center. we will also address future evolutions of this network to cover changing and more demanding user requirements.

Data Intensive Frameworks for Astronomy
Jeffrey Gardner, Andrew Connolly, Keith Wiley, YongChul Kwon, Simon Krughoff, Magdalena Balazinska, Bill Howe, and Sarah Loebman, University of Washington

Astrophysics is addressing many fundamental questions about the nature of the universe through a series of ambitious wide-field optical and infrared imaging surveys (e.g., studying the properties of dark matter and the nature of dark energy) as well as complementary petaflop-scale cosmological simulations. Our research focuses on exploring emerging data-intensive frameworks like Hadoop and Dryad for astrophysical datasets. For observational astronomers, we are delivering new scalable algorithms for indexing and analyzing astronomical images. For computationalists, we are implementing cluster finding algorithms for identifying interesting objects in simulation particle datasets.

Experiences and Visions on Archaeo Informatics
Christiaan Hendrikus van der Meijden, Peer Kröger, and Hans-Peter Kriegel, Ludwig Maximilians University

To successfully establish the new scientific branch of archaeo informatics the main problems are based on standardization, understanding of advanced informatics (i.e., data mining) within archaeo sciences, and setting up data communication infrastructures. Our experiences are based on the development of OSSOBOOK, an intermittedly-synchronized database system that allows any authorized user to record data offline at the site and later synchronize this new data with a central data collection. Powerful data mining and similarity search tools have been integrated. The future development steps are establishing a standardized minimal electronic finding description and the implementation of an enhanced database connection interface for data mining communication techniques to set up an archaeo data network. Another focus is set on modularization, visualization, and simplification of data mining tools. Learn more.

Panel: Challenges of Data Standards and Tools
Deb Agarwal, LBNL/UCB; Bill Howe, University of Washington; Alex James, Microsoft; Yong Liu, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign; Maryann Martone, UCSD; Yan Xu, Microsoft Research

Environmental Research involves multi disciplines and players from academia, industry, and government agencies worldwide. By nature, environmental researchers are challenged with massive and heterogeneous data provided by various sources. If one “grand standard” will not work for dealing with all the required environmental data sources, how do we work together to define and adopt difference data standards? What tools are essential for making the standards successful?

Scientific Data Sharing and Archiving at UC3/CDL: the Excel Add-in Project and More
John Kunze and Tricia Cruse, California Digital Library/California Curation Center

The University of California Curation Center (UC3), part of the California Digital Library (CDL), will be working with University of California researchers, the NSF DataONE community, and Microsoft (MS) Research to create open-source MS Excel extensions (“add-ins”), that will make it easier for scientists to record and export spreadsheet data in re-usable ways, fostering integration, new uses, and hence new science. We expect that creating such add-ins for as widely deployed a tool as Excel will help to transform the conduct of scientific research by enabling and promoting data publishing, sharing, and archiving. The Excel add-in project is the primary topic of this talk.

The talk will also address the larger context of this effort being one of four “fronts” on which UC3/CDL is working to establish data publishing, sharing, and archiving as common scientific practice. While this is a complex and ambitious undertaking, we hope that by chipping away at these tractable areas, we will reduce the size of the overall challenge. The most direct of these fronts is participation as an NSF DataONE member node, contributing University of California research data to NSF DataNet. We are also a founding member of the global DataCite consortium, which is working to create standards, tools, and incentives for data producers to publish citable datasets. Finally, with support from the Moore Foundation we are writing up a comparative analysis of current practices across domains for publishing and preserving the methods, techniques, and credits in preparing data used to draw conclusions in the published literature, but that is otherwise lost for want of standard practices for capturing this “appendix” information. We will conclude with a description of the newly released EZID (easy-eye-dee) service for creating and resolving persistent identifiers for data.

Visualizing All of History with Chronozoom
David Shimabukuro, Roland Saekow, and Walter Alvarez, University of California-Berkeley

Our knowledge of human history comprises a truly vast data set, much of it in the form of chronological narratives written by humanist scholars and difficult to deal with in quantitative ways. The last 20 years has seen the emergence of a new discipline called Big History, invented by the Australian historian, David Christian, which aims to unify all knowledge of the past into a single field of study. Big History invites humanistic scholars and historical scientists from fields like geology, paleontology, evolutionary biology, astronomy, and cosmology to work together in developing the broadest possible view of the past. Incorporating everything we know about the past into Big History greatly increases the amount of data to be dealt with.

Big History is proving to be an excellent framework for designing undergraduate synthesis courses that attract outstanding students. A serious problem in teaching such courses is conveying the vast stretches of time from the Big Bang, 13.7 billion years ago to the present, and clarifying the wildly different time scales of cosmic history, Earth and life history, human prehistory, and human history. We present “ChronoZoom,” a computer-graphical approach to dealing with this problem of visualizing and understanding time scales, and presenting vast quantities of historical information in a useful way. ChronoZoom is a collaborative effort of the Department of Earth and Planetary Science at UC Berkeley, Microsoft Research, and originally Microsoft Live Labs.

Our first conception of ChronoZoom was that it should dramatically convey the scales of history, and the first version does in fact do that. To display the scales of history from a single day to the age of the Universe requires the ability to zoom smoothly by a factor of ~1013, and doing this with raster graphics was a remarkable achievement of the team at Live Labs. The immense zoom range also allows us to embed virtually limitless amounts of text and graphical information.

We are now in the phase of designing the next iteration of ChronoZoom in collaboration with Microsoft Research. One goal will be to have ChronoZoom be useful to students beginning or deepening their study of history. We therefore show a very preliminary version of a ChronoZoom presentation of the human history of Italy designed for students, featuring (1) a hierarchical periodization of Italian history, (2) embedded graphics, and (3) an example of an embedded technical article. This kind of presentation should make it possible for students to browse history, rather than digging it out, bit by bit.

At a different academic level, ChronoZoom should allow scholars and scientists to bring together graphically a wide range of data sets from many different disciplines, to search for connections and causal relationships. As an example of this kind of approach, from geology and paleontology, we are inspired by TimeScale Creator.

ChronoZoom, by letting us move effortlessly through this enormous wilderness of time, getting used to the differences in scale, should help to break down the time-scale barriers to communication between scholars.

Proteome-Scale Protein Isoform Characterization with High Performance Computing
Jake Chen and Fan Zhang, Indiana University

The study of proteomes represents significant discovery and application opportunities in post-genome biology and medicine. In this work, we explore the use of high performance computing to characterize novel protein isoforms in tandem mass spectrometry (MS-MS) spectra derived from biological samples. We perform computational proteomics analysis of peptides, by searching a new large peptide database that we custom built from all possible protein isoforms of a target proteome. Therefore, there is significantly higher complexity, both at the computational level and the biological level, involved with the proteome-scale study of these protein isoforms than the standard approaches that involve only normal MS/MS protein search databases.

To discover novel protein isoform in proteomics data, we developed a high performance computing and data analysis platform to support the following tasks: 1) conversion of raw data to open formats, 2) support for searching spectra and peptide identification, 3) conversion of search engine results to a unified format, 4) statistical validation of peptide and protein identifications, and 5) protein isoform marker annotations. By applying this platform, we show that, through human fetal liver and breast cancer case studies, that the platform can markedly increase computational efficiency to support identification of novel protein isoforms. Our results show promises for future diagnostic biomarker applications. They also point out new potentials for real-time analysis of proteomics data with more powerful computing cloud.

Answering Biological Questions by Querying k-Mer Databases
Paul Greenfield, CSIRO Mathematics, Informatics and Statistics

Short DNA sequences ('k-mers') are effectively unique within and across bacterial species. Databases of such k-mers, derived from diverse sets of organisms, can be used to answer interesting biological questions. SQL queries can quickly show how organisms are related and find functions for hypothetical genes. Metagenomic applications include quickly partitioning reads by family, and mapping reads onto possibly-related reference genomes. Planned work includes including functional improvements (searching over amino acid codons, querying over gene functions) and scaling the applications to work well on clusters, and possibly clouds.

Contact Us

For more information, contact esci@microsoft.com.