Microsoft eScience Workshop 2012
October 8–9, 2012 | Chicago, Illinois, United States
Watch videos of presentations from this year's eScience Workshop. Additional videos will be added soon, so please check back.
|Keynote: Defensible Modeling of the Biosphere
To manage the planet on which we all depend, we need to predict the future outcome of various options. How would biofuel subsidies affect crop prices affect deforestation? CO2 emissions affect climate change affect fire? At present, we cannot make such predictions with any confidence. But, as I’ll show in this talk, a computational approach to environmental science can change that. I’ll explain how we built the first fully data-constrained model of the terrestrial carbon cycle, using Big Data, cloud computing, and machine learning. And I’ll demo similar models for global food production, Amazon deforestation, and bird biodiversity. The prototype tools on which these models have been built—for example, FetchClimate, Filzbach, WorldWide Telescope—are freely available, and will hopefully allow other scientists to adopt a rigorous approach to modeling the complexities of the biosphere.
|Keynote: Biology: A Move to Dry Labs
Since its beginning, the wet lab has been the key driver in biological discovery. Recently, however, more and more science is getting done in dry labs, those where only computational analysis is done. The presentation will include examples, ranging from genomics to vaccine design.
|2012 Jim Gray Award / The Possibilities and Pitfalls Internet-Based Chemical Data
Antony John Williams and Tony Hey
2012 Jim Gray eScience Award Presentation
|Panel: Open Data for Open Science—Data Interoperability
Ilya Zaslavsky, Karen Stocks, Philip Murphy, Robert Gurney, and Yan Xu
The goal of cross-domain interoperability is to enable reuse of data and models outside the original context in which these data and models are collected and used and to facilitate analysis and modeling of physical processes that are not confined to disciplinary or jurisdictional boundaries. A new research initiative of the U.S. National Science Foundation, called EarthCube, is developing a roadmap to address challenges of interoperability in the earth sciences and create a blueprint for community-guided cyberinfrastructure accessible to a broad range of geoscience researchers and students.
|Panel: Enabling Multi-Scale Science
Claudia Bauzer Medeiros, James Hunt, and Roberto Cesar
eScience research increasingly involves the need to facilitate multi-scale problem solving that spans wide ranges in space and time scales. It requires collaboration among researchers and practioneers from multiple disciplines, each with their own orientations towards problem identification, solution formulation, and implementation.
|The Internet of Databases—Generalizing the Archaeo Informatics Approach
Chris van der Meijden
One thing we have learned from our Archaeo-Data-Network is, that there is a need to split meta information of databases in two levels. The first level contains a centralized unique id and very few standard information. The second level of meta information is defined by the archaeo scientist. This can be implemented for any kind of archaeo database, so the network's extensibility is virtually unlimited. The advantage of this dual meta approach is its flexible connectivity and therefor getting comprehensive data transparent available for general searching and mining. With this approach huge, rigid archives can be connected to small, flexible databases for scientific analysis in any scientific domain. Combined with a simple authorization management for unpublished data we see in our system the potential of being the general blueprint for an eScience infrastructure, which we call the Internet of databases.
|Combining Semantic Tagging and Support Vector Machines to Streamline the Analysis of Animal Accelerometry Data
Increasingly, animal biologists are taking advantage of low cost micro-sensor technology, by deploying accelerometers to monitor the behaviour and movement of a broad range of species. The result is an avalanche of complex tri-axial accelerometer data streams that capture observations and measurements of a wide range of animal body motion and posture parameters. We present a system which supports storing, visualizing, annotating, and automatic recognition of activities in accelerometer data streams by integrating semantic annotation and visualization services with Support Vector Machine techniques.
|Panel: Handling Big Data for the Environmental Informatics / Real-Time Environmental Observation, Modeling, and Decision Support
Barbara Minsker, Chaowei Yang, David Maidment, Jeff Dozier, Jong Lee, and Ting Ting Zhao
Earth observations and other environmental data collection methods help us accumulate terabytes to petabytes of datasets. This pose a grand challenge to the informatics for environmental studies. We propose this session to capture the latest development on the Big Data collection, processing, and visualization in several aspects.
With increasing near-real-time availability of embedded and mobile sensors, radar, satellite, and social media, the opportunities to improve understanding, modeling, and management of environmental systems, as well as the built and human systems that interact with environmental systems, is immense.
Ian Foster and Tanu Malik
The eScience domain brings together scientists, experts, and engineers to enterprise comprehensive, large-scale data and computational cyberinfrastructures. The objective is to advance knowledge discovery in the sciences and establish effective channels of communication between the various disciplines. Software, data, workflows, technical reports, and publications are often the modes of this communication. However, currently all these modes of communication are disconnected from each other.
E-publishing is changing the nature of scientific communication through digital publication repositories and libraries. But the larger and more pertinent issue is connecting these yet static digital e-publications repositories to large amounts of computation, data, derived data, and extracted information.
|Machine Assisted Thought
I suggest that there are two distinct branches of eScience, both fundamentally enabled by the explosion of capabilities inherent in the information age. The first concerns the use of numbers, measurements from arrays of sensors, outputs from simulations, and so forth. The techniques of eScience increase our ability to perceive massive amounts of data by factors of billions or trillions. I call this Machine Assisted Perception.
The second branch of eScience concerns the use of words, the verbal abstractions used by humans to communicate ideas. The new technologies of digital libraries and search engines have already substantially changed the scholarly thought process, growth in the capabilities of these technologies continues to be rapid. I call this machine/human collaboration Machine Assisted Thought.
|Panel: Cloud Computing—What Do Researchers Want?
Dennis Gannon, Fabrizio Gagliardi, Marty Humphrey, and Paul Watson
Cloud computing for science is seeing take-up in many disciplines, but many researchers are skeptical. In this panel session, we discuss:
Carly Strasser, Dong Xe, Eamonn Maguire, Ian Foster, Jim Pinkelman, Michael Witt, Rob Fatland, Steve Tuecke, Tanu Malik, and Yan Xu
At the 2012 eScience Workshop, DemoFest presenters briefly introduce their topics.
|The Utility of Human/Computer Learning Network for Improving Biodiversity Conservation and Research
We describe our work to improve the quality and utility of citizen science contributions to eBird, arguably the largest biodiversity data collection project in existence. Citizen science (the use of “human sensors”) is especially important in a number of observation-based fields, such as astronomy, ecology, and ornithology, where the scale and geographic distribution of phenomena to be observed far exceeds the capabilities of the established research community. Our work is based on the notion of a Human/Computer Learning Network, in which the benefits of active learning (in both the machine learning sense and human learning sense) are cyclically fed back among human and computational participants.
|Educating Scientists About the Data Life Cycle
The research life cycle is well known and consists of an initial idea or question that, if sound, leads to submission and funding of a proposal, implementation of a study and, ideally, to one or many publications that advance the state of knowledge. What is less well understood is how the research life cycle is related to the data life cycle.
|Teaching Scientific Data Management in Data Science Education and Workforce Development Programs for Science Communities
Robert R. Downs
Recent popularity of data science has led to increased recognition of the need for education and workforce development in data science. However, definitions of the term, data science, vary and often focus on techniques for data analytics and visualization, omitting scientific data management and related topics associated with data policy, stewardship, and preservation.
|Tools and Techniques for Outreach and Popular Engagement in eScience
Public participation in scientific research takes many forms: participation of volunteers in citizen science projects, monitoring of natural resources and phenomena, volunteering of computational resources for distributed data analysis tasks, and so forth.
In this presentation, we comment on some of the computational tools, techniques, and case studies of applications that enable active public participation in scientific research. Of particular interest are applications that showcase the benefits of letting the public use the professional resources (in other words, the same data and computational resources that the scientists have access to) and return something back to the research behind it, such as applications that go beyond simple publication of scientific data or applications that use novel methods for user engagement. Examples of applications for scientific outreach that use specialized computational tools or techniques, and/or educational approaches, are also discussed.
|Priorities for Data Curation Education: Data Center Partnerships and Long-Tail Science
For science to fully exploit digital data in new and innovative ways, research data will need to be collected, curated, and made accessible and usable across domains. The need for workforce development in data curation systems and services has been recognized for many years, and education programs are beginning to mature. But to continue to build strong programs in this emerging field, current data curation practice and research needs to underpin goals for professional education.
|Big Data Processing on the Cheap
Getting started with big data? Generating more and more data without the hardware resources to process it? This session will help newcomers to 'big data' get started processing and visualizing their data, without the need for expensive computing resources. While these techniques may not produce lightning-fast results, you can at least get started with your analysis.
|Educating a New Breed of Data Scientists for Scientific Data Management
Data scientists play active roles in the design and implementation work of four related areas: data architecture, data acquisition, data analysis, and data archiving. While any data and computing related academic unit could offer a data science program or curriculum, each of them has their own flavors: statistics would weigh heavily toward data analytics and computer science on computational algorithms. The information schools are taking a more holistic approach in educating data scientists. This presentation reports the data science curriculum development and implementation at Syracuse iSchool, which has been shaped by the quickly-changing, data-intensive environment not only for science but also for business and research at large. Research projects that we conducted on scientific data management with participation from the e-science student fellows demonstrates the need and significance of educating the new breed of data scientists who have the knowledge and skills to take on the work in the four related areas mentioned above.
|Publishing and eScience Panel
James Frew, Jeff Dozier, Mark Abbott, and Shuichi Iwata
Scientific Publishing in a Connected, Mobile World
Data Journal Challenge for the Fourth Paradigm-Trust through Data on Environmental Studies and Projects
|What Is a Data Scientist?
Kenji Takeda and Liz Lyon
The term, data-scientist, is becoming prevalent in science, engineering, business, and industry. We explore how the term is used in different contexts, segments, and sectors; we examine the different variants, flavors, and interpretations and try to answer the following questions:
|Informatics, Information Science, Computer Science, and Data Science Curricula
We describe a possible data science curricula based on discussions at Indiana University and experience with our Informatics, Computer Science, and Library and Information Science programs. This leads to an interesting breadth of courses and students' interests, which could address the many job opportunities. We suggest a collaboration to build a MOOC (online) offering with one initial target: minority serving institutions.
|Data Science Curricula at the University of Washington eScience Institute
The University of Washington eScience Institute is engaged in a number of educational efforts in data science, including certificate programs for professionals, workshops for students in domain science, a new data-oriented introductory programming course, and a data science MOOC to be offered through Coursera in the spring. We consider the tools, techniques, research topics, and skills to be well-aligned with the data-driven discovery emphasis of eScience itself—the only difference is the applications.
We see several benefits in aligning these two areas. For example, students in science majors who are not pursuing research careers become more marketable. In the other direction, working professionals see opportunities to apply their skills to solve science problems—we have recruited volunteers from industry in this way. In this talk, I'll discuss these activities, review our curriculum, and describe our next steps.
|Novel Approaches to Data Visualization
Darren Thompson, Dawn Wright, and George Djorgovski
Data Visualization in Virtual Spaces and High Dimensions
CT and Imaging Tools for Windows HPC Clusters and Azure Cloud
A key goal of our systems is to provide our “end users”—researchers—with easy access to the tools, computational resources, and data via familiar interfaces and client applications without the need for specialized HPC expertise. We have recently explored the adaptation of our CT-reconstruction code to the Windows Azure cloud platform, for which we have constructed a working “proof-of-concept” system. However, at this stage, several challenges remain to be met in order to make it a truly viable alternative to our HPC cluster solution.
Work in Progress Toward Enhancing Multidimensional Visualization with Analytical Workflows
|Panel: Scientific Data: the Current Landscape, Challenges, and Solutions
Carly Strasser, Chris Mentzel, Dave Vieglais, Jeff Dozier, Stephanie Wright, and William Michener
Funders, researchers, and public stakeholders increasingly see the need to better communicate and curate ever expanding bodies of research data. This panel will bring together many of the stakeholders in the scientific data community, including researchers, librarians, and data repositories.
Before the panel commences, we will provide a brief introduction to scientific data to facilitate discussion. We will describe the current landscape of scientific data and its management, including publication, citation, archiving, and sharing of data. We will also describe existing tools for data management. The panel discussion will focus on identifying gaps and unmet needs in order to help chart a path for future policy, service, and infrastructure development.