On June 15, Microsoft’s Washington (D.C.) Innovation and Policy Center will host thought leaders, policymakers, analysts, and press for Microsoft Research’s D.C. TechFair 2011. The event showcases projects from Microsoft Research facilities around the world and provides a strategic forum for researchers to discuss with a broader community their work in advancing the state of the art in computing. Microsoft researchers and attendees alike will have an opportunity to exchange ideas on how technology and the policies concerning those technologies can improve our future.
Roger Barga, architect in the Cloud Research Engagement team within the eXtreme Computing Group (XCG), looks forward to the D.C. TechFair, where he will demonstrate Excel DataScope, a Windows Azure cloud service for researchers that simplifies exploration of large data sets.
The audience for the event—members of the Obama administration and staff, members of the U.S. Congress and staff, representatives of prominent think tanks, academics, and members of the media—also will get an opportunity to explore cutting-edge research projects in other areas, such as natural user interfaces, the environment, healthcare, and privacy and security.
“Big data” is a term that refers to data sets whose size makes them difficult to manipulate using conventional systems and methods of storage, search, analysis, and visualization.
“Scientists tend to talk about big data as a problem,” Barga says, “but it’s an ideal opportunity for cloud computing. How large data sets can be addressed in the cloud is one of the important technology shifts that will emerge over the next several years. Microsoft Research’s Cloud Research Engagement projects push the frontiers of client and cloud computing by making investments in projects such as Excel DataScope to support researchers in the field.”
As one of XCG’s cloud-research projects, Excel DataScope offers data analytics as a service on Windows Azure. Users can upload data, extract patterns from data stored in the cloud, identify hidden associations, discover similarities, and forecast time series. The benefits of Excel DataScope go beyond access to computing resources: Users become productive almost immediately because Microsoft Excel acts as an easy-to-use interface for the service.
“Excel is a leading tool for data analysis today,” Barga explains. “With 500,000,000 licensed users, there are incredible numbers of people already comfortable with Excel. In fact, the spreadsheet itself is a fine metaphor for manipulating data. It’s friendly, and it allows different data types, so it’s a good technology ramp to the cloud for data analysts.”
The project enables the use of Excel on the cloud through an add-in that displays as a research ribbon in the spreadsheet’s toolbar. The ribbon provides seamless access to computing and storage on Windows Azure, where users can share data with collaborators around the world, discover and download related data sets, or sample from extremely large—terabyte-sized—data sets in the cloud. The Excel research ribbon also provides new data-analytics and machine-learning algorithms that execute transparently on Windows Azure, leveraging dozens or even hundreds of CPU cores.
“All Excel DataScope does,” Barga explains, “is start up an analytics algorithm in the cloud. You get to visualize the results and never have to move the actual data out of the cloud. We don’t want a data analyst to learn much more than the names of the algorithms and what they do. Users should just think that Excel has a new capability which opens up great opportunities for extracting new insights out of massive data sets.”
While the cloud puts massively scalable computing resources into the hands of users, Barga notes that there are performance differences between cloud-based computers and supercomputers. Supercomputers and high-performance computing clusters are designed to share data at high frequencies and with low latency; in this respect, cloud computing is slower. Another key differentiator is storage services. High-performance clusters have storage arrays that provide high-speed, high-bandwidth pathways to storage. This is not the case in clouds, where storage often resides separately from computing nodes, with multiple routers or hops widening the distance between them.
Barga believes the highly available nature of cloud computing mitigates these differences and changes the game when it comes to opportunities for accessing computing resources.
“Our observation,” he muses, “is that while a cloud may be slower in some regards, you get the computing resources when you want and as much as you want. Many of the major data labs in the country, who have some of the biggest iron around, have wait times of weeks for jobs in the queue, so, in terms of elapsed time, your cloud job could have run ages ago, and your report could be written up by now.”
Successful university research groups and companies are interested in Microsoft’s Cloud Research Engagement initiatives because, while their labs have plenty of processing power, when there are situations that require fast decision making, such as pandemics or a new crop virus, it’s hard to secure enough CPU cycles on short notice. The cloud, therefore, is a game-changer even for groups that already have many computing resources.
In the same way that Excel was the logical interface choice for the project, the research team selected algorithms to include on the research ribbon based on popularity.
“It turns out there’s a fairly consistent set of tasks,” Barga says, “to making sense out of data, whether in the social sciences, engineering, or oceanography. You need clustering, for example, to see how the data groups together. You want to look for outliers and run regression analysis to understand how the data trends. We felt if we implemented the top two dozen or so algorithms, we would have a good starter set. It’s extensible, so people can add their own analytics over time. That’s when things will get really exciting.”
Barga and his colleagues want users to write their own algorithms for the cloud, then upload and register the code on the service. Once that happens, the next time the user logs into Excel DataScope, the algorithm will appear on the research ribbon. When users begin to publish algorithms into a shared workspace, things get even more interesting.
“That’s the vision: for users to publish high-value or specialized algorithms in a viewable library that others can access to try out, install, and make part of their working set of algorithms,” Barga says. “This is where it gets exciting, when experts in particular domains start contributing algorithms that unlock the value of data.”
The ability to share both data and algorithms has been one of the project’s design goals. Excel DataScope includes the notion of security-enhanced workspaces, where users can upload data sets to share with research colleagues anywhere in the world. This opens opportunities for cross-discipline collaboration that is nothing less than transformative.
For example, say that an expert in oceanography works within a particular discipline and with data specific to an area of study. Understanding and predicting the effects of an oil spill, however, requires knowledge from multiple disciplines, such as ocean chemistry, biology, and ecology. Simulations of complex oceanographic and atmospheric models require mining, searching, and analysis of huge data sets in near-real time, across disciplines, as never before. The ability to collaborate and extract insight from large data sets is part of a shift from traditional paradigms of theory, experimentation, and computation to data-driven discovery.
Barga and colleagues from the Cloud Research Engagement team are busy publicizing the resources they offer to researchers in the field. The response they receive from talks and introductory videos validate the research community’s interest in data-driven discovery.
“We hear a lot of excitement,” he says, “and there are always researchers in the back of the room who want to know whether we have algorithms or data sets for a particular domain. We’ll say, ‘Sorry, we don’t have those,’ but they’ll say: ‘No, that’s good. We have expertise in that area, and we’d love to build a library of algorithms, contribute data, and make it available to the rest of the world,’ which is very encouraging—not to mention very cool.”
During the D.C. TechFair, Barga will discuss the value of cloud computing in the context of the Open Government Directive, which makes data available to the public through websites such as data.gov.
“The government is contributing data sets,” he explains, “and we’d like to engage with scientists who are willing to add other data sets to data.gov. But data by itself is not enough. We’d like to see people proposing analytics associated with those data sets, so that when you go to data.gov, you also find useful algorithms to run against the data. We’d like to talk to scientists who want to extract insights or craft policies based on that data.”
At the moment, relatively few scientists or organizations can perform analysis of big data, either because of a lack of knowledge or a lack of resources. This distances data from the people who need to make decisions. The team plans to expand the Excel DataScope initiative later this year to include more of the research community and to release a programming guide that will explain how to write algorithms that scale out on Azure.
Barga sees the project as revolutionary: It democratizes access to data and, consequently, insights into data. He envisions a future data market in which users can find data sets and, with a few mouse clicks, select algorithms that the system identifies as relevant to the selected data, then start analyzing.
“We have architected a pathway from Excel to the cloud,” he says. “We have an infrastructure that sets the stage for a future where a decision maker can select from huge data sets and ask about trends in healthcare, poverty, or education. When we get there, it will have a huge economic impact, not just for scientific research, but also for businesses and for the country. This is a dialogue we’re trying to open and the space we are trying to move into with this project.”