Iterative MapReduce on Windows Azure
Microsoft has developed an iterative MapReduce runtime for Windows Azure, code-named "Daytona." Project Daytona is designed to support a wide class of data analytics and machine learning algorithms. It can scale out to hundreds of server cores for analysis of distributed data.
Project Daytona was developed as part of the eXtreme Computing Group’s Cloud Research Engagement Initiative, and made its debut at the Microsoft Research Faculty Summit. One of the most common requests we have received from the community of researchers in our program is for a data analysis and processing framework. Increasingly, researchers in a wide range of domains—such as healthcare, education, and environmental science—have large and growing data collections and they need simple tools to help them find signals in their data and uncover insights. We are making the Project Daytona MapReduce Runtime for Windows Azure download freely available, along with sample codes and instructional materials that researchers can use to set up their own large-scale, cloud data-analysis service on Windows Azure. In addition, we will continue to improve and enhance Project Daytona (periodically making new versions available) and support our community of users.
The project code-named Daytona is built on Windows Azure and employs the available Windows Azure compute and data services to offer a scalable and high-performance system for data analytics. To deploy and use Project Daytona, you need to follow these simple steps:
- Develop your data analytics algorithm(s). Project Daytona enables a data analytics or machine learning algorithm to be authored as a set of Map and Reduce tasks, without in-depth knowledge of distributed computing or software development on Windows Azure. To get you up and running quickly, the release package includes sample data analysis algorithms to provide you with examples for building a data analysis library, as well as a developer guide with step-by-step instructions for authoring new algorithms, and source code for a sample client application for integrating your existing applications with Project Daytona on Windows Azure.
- Upload your data and library of data analytics routines into Windows Azure. Windows Azure blob storage provides reliable, scalable, easy-to-use storage for your data and library of analysis routines. Our documentation clearly outlines the steps for doing this.
- Deploy the Daytona runtime to Windows Azure. By following the steps in the deployment guide, deploy the Daytona runtime to your Windows Azure account. You can configure the number of virtual machines for the deployment, specify and configure the storage account on Windows Azure for the analysis results, and then start and verify that the service is operational. Project Daytona enables you to use as many or as few virtual machines as you wish. When you are finished with your data analysis, you can follow the steps in the deployment guide to shut down the running instances and tear down your deployment.
- Launch data analytics algorithms. The Daytona release package provides you with source code for a simple client application that you can use to select and launch a data analytics model against a data set on Project Daytona. This client application is merely one example of how to integrate Project Daytona with a client application—perhaps one already in use in your lab—or you can author a Windows Azure service interface for job submission and monitoring.
Project Daytona will automatically deploy the iterative MapReduce runtime to all of the Windows Azure virtual machines (VMs) in the deployment, sub-dividing the data into smaller chunks so that they can be processed (the “map” function of the algorithm) in parallel. Eventually, it recombines the processed data into the final solution (the “reduce” function of the algorithm). Windows Azure storage serves as the source for the data that is being analyzed and as the output destination for the end results. Once the analytics algorithm has completed, you can retrieve the output from Windows Azure storage or continue processing the output by using other analytics model(s). Project Daytona demonstrates the power of taking advantage of Windows Azure cloud services for application design.
Project Daytona features the following key properties.
- Designed for the cloud, specifically for Windows Azure. Virtual machines, irrespective of infrastructure as a service (IaaS) or platform as a service (PaaS), introduce unique challenges and architectural tradeoffs for implementing a scale-out computation framework such as Project Daytona. Out of these, the most crucial are network communications between virtual machines (VMs) and the non-persistent disks of VMs. We have tuned the scheduling, network communications scheduling, and the fault tolerance logic of Project Daytona to suit this situation.
- Designed for cloud storage services. We have defined a streaming based, data-access layer for cloud data sources (currently, Windows Azure blob storage, but we will extend to others), which can partition data dynamically and support parallel reads. Intermediate data can reside in memory or in local non-persistent disks with backups in blobs, so that Project Daytona can consume data with minimum overheads and with the ability to recover from failures. We use the automatic persistence and replication that is provided by the Windows Azure storage services and, therefore, do not require a distributed file system.
Horizontally scalable and elastic. Computations in Project Daytona are performed in parallel, so to scale a large data-analytics computation, you can add more virtual machines to the deployment and Project Daytona will take care of the rest. By using Project Daytona on Windows Azure, you can instantly provision as much or as little capacity as you need to perform data-intensive tasks for applications such as data mining, machine learning, financial analysis, or data analytics. Project Daytona lets you focus on your data exploration; without having to worry about acquiring compute capacity or time-consuming hardware setup and management.
Optimized for data analytics. We designed Project Daytona with performance of data analytics in mind. Algorithms in data analytics and machine learning are often iterative and produce a sequence of answers of improving quality until they converge. Project Daytona provides support for iterative computations in its core runtime; it caches data between iterations to reduce communication overheads, different scheduling and relaxed fault tolerance mechanisms, and a natural programming API to author iterative algorithms.
There are a number of use cases for Project Daytona, such as for data analysis, machine learning, financial analysis, text processing, indexing, and search. Almost any application that involves data manipulation and analysis can take advantage of Project Daytona to scale out processing on Windows Azure.
We are actively exploring a specific use case for Project Daytona, as outlined below.
Data analytics as a service on Windows Azure, accessible to a host of clients, is about turning utility cloud computing into a service model for data analytics. In our view, this service is not limited to a single data collection or set of analytics, but the ability to upload data and select from an extensible library of models for data analysis. Powered by Project Daytona, the service will automatically scale out the data and analytics model across a pool of Windows Azure VMs without the overhead that is usually associated with typical business intelligence (BI) and data analysis projects. The analytic application possibilities are limited only by your imagination.
We have implemented one such application, which we call Excel DataScope. From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. Our DataScope analytics service offers a library of data analytics and machine learning models, such as clustering, outlier detection, classification, and machine learning, along with information visualization—all implemented on Project Daytona. Users can upload data in their Excel spreadsheet to the DataScope service or select a data set already in the cloud, and then select an analysis model from our Excel DataScope research ribbon to run against the selected data. Project Daytona will scale out the model processing by using possibly hundreds of CPU cores to perform the analysis. The results can be returned to the Excel client or remain in the cloud for further processing and/or visualization. The algorithms and analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.
- See Overview for information about what is included in the release package.
While Daytona is a novel Iterative MapReduce runtime, designed for Microsoft Windows Azure and optimized for data analytics, the concepts in Daytona have several related works in the literature. Followings are some of the notable ones.
MapReduce and Dryad introduced simplified programming abstractions for large scale data processing on commodity clusters. Daytona adopts this strategy by providing a simple programming model for large scale data analytics based on MapReduce. Iterative MapReduce for distributed memory architectures was first introduced by the Twister project [3-4] and the Twister4Azure project [5-6] introduced iterative MapReduce on Windows Azure. HaLoop  and i-MapReduce are related research efforts which optimize iterative MapReduce computations. Similar to other iterative MapReduce runtimes, Daytona also provides additional optimizations to enhance the performance of iterative computations on Windows Azure.
 J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107-113, 2008.
 M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, 2007.
 Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox, “ Twister: a runtime for iterative MapReduce,”oceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 810-818. DOI=10.1145/1851476.1851593
 Thilina Gunarathne, BingJing Zang, Tak-Lon Wu and Judy Qiu. Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure, In Proceedings of the forth IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2011) , Melbourne, Australia. Dec 2011.
 Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1-2 (September 2010), 285-296.
 Yanfeng Zhang; Qinxin Gao; Lixin Gao; Cuirong Wang; , "iMapReduce: A Distributed Computing Framework for Iterative Computation," Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on , vol., no., pp.1112-1121, 16-20 May 2011, doi: 10.1109/IPDPS.2011.260
What’s Next for Project Daytona?
Project Daytona is part of an active research and development project in the eXtreme Computing Group of Microsoft Research.
The current release of Project Daytona is a research technology preview (RTP). We are still tuning the performance of Project Daytona and adding new functionality, and we will fix any software defects that are identified (see our email link below).
Our research on Project Daytona and its use for cloud data analytics is far from complete. In the summer of 2011 three PhD candidates joined our group, Romulo Goncalves (CWI), Atilla Balkir (University of Chicago), and Chen Jin (Northwestern University), to work with our team on data streaming support in Project Daytona, optimizations to the core runtime, support for incremental processing, and data services to minimize latency and data movement. We look forward to sharing these results in our technical papers and improved versions of Project Daytona that will be released in Spring 2012.