Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
New Book Expands on Jim Gray’s Vision
October 16, 2009 7:00 AM PT

During the 2009 Microsoft eScience Workshop, held Oct. 15-17, Jeff Dozier, professor in the Donald Bren School of Environmental Science & Management at the University of California, Santa Barbara, was named winner of the second annual Jim Gray eScience Award. As a leader in his field, Dozier is a contributor to The Fourth Paradigm, a new book published by Microsoft Research, announced Oct. 16. The External Research Division announced the availability of this book, in which a collection of academic visionaries and Microsoft researchers discuss the implications of Gray’s Fourth Paradigm in science. He postulated that data exploration, or, as he termed it, eScience, is the evolutionary next step in scientific exploration, following the original, empirical stage and its subsequent theoretical and computational phases. Gray, a Turing Award-winning computer scientist, was lost at sea in late January 2007 while sailing his 40-foot yacht Tenacious.

The Fourth Paradigm, the book, was edited by Tony Hey, corporate vice president of the External Research Division, along with Stewart Tansley, a senior research program manager in Hey’s group, and Kristin Tolle, a director in the same group. The book features a total of 70 authors, 43 of them from outside Microsoft, representing 20 separate institutions. Hey recently took a few moments to discuss Dozier’s award and the new book, dedicated simply as “For Jim.”

Q: What does the concept of the Fourth Paradigm mean to you personally, and what is its importance to Microsoft?

Hey: The Fourth Paradigm is something that Jim Gray realized after working with a variety of scientists—biologists, chemists, physicists, astronomers, and engineers. It became clear to Jim that their problems were as much about data as about computation and that they needed new skills to manipulate, visualize, and manage large amounts of scientific data.

It was Ken Wilson, Nobel Prize winner in physics, who coined the phrase Third Paradigm to refer to computational science and the need for computational researchers to know about algorithms, numerical methods, and parallel architectures. The skills needed for manipulating, visualizing, managing, and, finally, conserving and archiving scientific data are very different.

The Fourth Paradigm is also an opportunity for Microsoft, because we have technologies that can democratize the way we do science in the future. We can have usable, extensible, and interoperable technologies that can really make a difference to the lives of working scientists. I think Jim’s emphasis on it being a new paradigm is really right.

Q: Why are you releasing the book? What do you hope to gain from this effort?

Hey: Much of the focus of both funding agencies and the computer-science community is on the need for more computational power, going from petaflops to exaflops. While this focus on computation is clearly important, I think it is also important to show that there is a need for a second focus, on the technologies required for data-intensive science.

Tony Hey
Tony Hey

In this book, we have paired distinguished scientists with computer scientists to give their vision of how they see their fields being transformed in the next five years. In many cases, some research fields genuinely will go from being data-poor to data-rich during this time frame. This will present scientists with new challenges and the need to manipulate, visualize, and combine data sets. In some ways, the book complements the vision contained in Towards 2020 Science, an influential report from Microsoft Research Cambridge that was the brainchild of Stephen Emmott, head of Computational Science in our Cambridge lab.

In addition to highlighting the requirements of data-intensive science, I firmly believe that Microsoft can make a great contribution to helping scientists in their research by raising the level of abstraction—so that they do not need to write lots of low-level scripts to manipulate their data.

The other revolution the book talks about is in scholarly communication. At the moment, when you want to get data from a scientific paper, you often have to actually take a ruler to the published paper and directly measure the data points on a graph. In future electronic versions of a scientific paper, you should be able to click on a point on a graph and go directly to the data or click on the curve and go to the program that produced the curve.

Documents can and will be much more interactive in the future. In addition, besides links to the data, there will also be many types of contextual information associated with a peer-reviewed paper, such as wikis, blogs, and social networks. Just as there is a revolution happening in data-intensive science, there is also a revolution happening in scholarly communication.

Q: Why are you making the book available for free, and why is it being published under a Creative Commons license?

Hey: We wish this book to be maximally useful and widely cited, and what better way to achieve this than making all of the content available free, for reuse under a Creative Commons license?

We want to spread the debate to the largest possible audience, and we hope that this will be a way of broadcasting the content and generating the widest possible circulation of the book. Wherever you are, you should be able to get hold of a copy; you can download it from the Web, get a print-on-demand version, or maybe download a version for the Amazon Kindle or the Sony Reader.

Q: Is Microsoft Research working on any projects mentioned in the book?

Hey: Microsoft Research is working on quite a few of them, but not all of them. Many articles have paired a research scientist with a Microsoft researcher, but others have no specific Microsoft connection.

The book is not intended to be specifically about Microsoft projects, although some of our projects are used as illustrative exemplars. Our projects are only used to illustrate and are not meant to be definitive.

Q: Many of the authors represented in the book are from universities or other parts of the research community. Do you feel this book is typical of how Microsoft Research collaborates with that community?

Hey: To a large extent, yes. We wanted the contributors to be leaders of the field who were capable of looking a few years into the future and who could give a credible vision of how their fields would develop as a result of imminent advances in IT. The authors are pretty special individuals and are, indeed, typical of the sort of scientists with whom we want to engage.

Q:How does the finished product compare to your original vision?

Hey: It’s pretty close. I have to say that I am very pleased with the way the book has turned out, and I think it succeeds in being both interesting and informative. We asked people to write essays who don’t usually write essays, in order that their contributions would be readable, even by non-experts in their fields. I think all the articles support the prescience of Jim’s vision that, in the future of many research fields, the manipulation and management of scientific data will be the key bottleneck.

One of the particularly good things about the book, to my mind, was Gordon Bell’s insistence that the introductory article should be produced from the transcript of Jim Gray’s last talk, to the National Research Council’s Computer Science and Telecommunications Board, given two weeks before he disappeared. The talk was all about Jim’s vision for data-intensive-science and the scholarly-communication revolution, and I do not think we could have had a better introduction.

Q: How does the winner of this year’s Jim Gray eScience Award reflect the contributions made by Jim’s work?

Hey: We want the winner of the Jim Gray eScience Award to epitomize Jim’s understanding of the importance of data-intensive science. Alex Szalay [Alumni Centennial Professor in the Department of Physics and Astronomy at Johns Hopkins University] received a Lifetime eScience Award from us for his contributions to data-intensive science a year before the award was renamed the Jim Gray eScience Award. Alex was a longtime collaborator of Jim’s, and his research on such things as the Sloan Digital Sky Survey typifies the sort of significant contribution we are looking for in a Jim Gray eScience Award winner.

Carole Goble [professor of Computer Science at the University of Manchester and winner of the first Jim Gray eScience Award] is an expert database researcher who has been applying her computer-science skills to problems in biology, in projects such as myGrid and myExperiment. I believe that Jim would thoroughly have approved of her role in developing powerful workflow and provenance-tracking technologies with the biologists.

Jeff Dozier, this year’s winner, is from the environmental-science community. His article in the book [The Emerging Science of Environmental Applications] is particularly interesting, because he talks about how environmental sciences in the ’80s were split into geophysics and other small disciplines. Then the community realized these subfields all overlapped and interacted with each other, so in the ’90s, the field evolved to become earth-systems science.

Jeff is now calling for a science of environmental applications. Scientists now have to use their knowledge to solve problems that the world cares about. Scientists need to use all their scientific-research data for a specific action, to try to help solve or alleviate the problems of climate change and global warming.

On a personal note, Jeff also was enormously helpful to me way back in 2001, when I was leading the U.K. eScience program. The U.K. Natural Environmental Research Council set up an eScience committee and asked Jeff to be its chairman. In this way, Jeff was able to have enormous positive impact on the U.K. eScience program He is definitely a worthy winner of the Jim Gray eScience Award!

Q: If Jim were still with us today, what would he think about the book?

Hey: I hope Jim would be extremely pleased. It is really a validation of his vision for data-intensive science.