DataCenter Services for Data Intensive Computing
As data-intensive computing becomes widespread, it is increasingly valuable to present a simple, pervasive programming model for querying and manipulating complex data sets. A developer should be able to take a program written against a small collection of data on a local client machine and apply it to much larger data sets stored either in local, private, high-performance-computing clusters within an organization or public cloud services such as Azure. The language compiler and runtime should ensure that the program scales automatically to exploit the massively parallel storage and computing resources of these clusters. This collection of demos describes our infrastructure and language projects that are supporting this goal of painless distributed computing for data-intensive applications. Dryad is an execution engine that enables reliable, distributed computing across thousands of servers for large-scale-data parallel applications. DryadLINQ combines LINQ and Dryad to provide a simple, powerful, elegant programming environment for large-scale-data parallel computing. We also will demonstrate a simple storage system optimized for Dryad workloads, in addition to Job Browser, a suite of tools that simplify the profiling and debugging of complex distributed applications.
(Click to view this poster in Microsoft PowerPoint.)In this demo we showcase efforts in Microsoft Research to collaborate with external researchers to explore the application of new technologies, specifically Dryad and DryadLINQ, to big data research problems in science. We also highlight our efforts to provide software and services to academics across the world, through the release of Dryad and DryadLINQ free of charge to the research community, along with associated programming guides, user documentation, and code libraries. Dryad is a general-purpose distributed computing engine, more flexible than MapReduce or Hadoop!, that was designed to simplify the task of implementing distributed applications on clusters of Windows computers. DryadLINQ is an abstraction layer which simplifies the process of implementing Dryad-based applications. Microsoft Research is acutely aware of the ubiquity of big data and the challenges this presents. We are offering researchers the tools, resources and collaboration to explore this new area.
A Simple and Small Distributed File System
(Click to view this poster in Microsoft PowerPoint.)This research showcases an extremely simple distributed file system that provides all the necessary services for a cluster running data-parallel computations. The file system consists of two components: a centralized server that stores the metadata mapping data streams to a sequence of files stored in the NTFS file system, and a set of housekeeping tasks implemented as a Windows service on the storage nodes. For scalability and fault tolerance, the centralized portion of the service is replicated across multiple machines using Paxos.
(Click to view this poster in Microsoft PowerPoint.)The cluster browser is an application designed for analyzing and troubleshooting the performance of large clusters running data-center services. It consists of four modules: 1) distributed-log collection and extraction, 2) a database storing the extracted data, 3) an interactive visualization tool for exploring the data, and 4) a plug-in interface and a set of sample plug-ins that enable users to implement data-analysis tools. We have augmented the cluster browser with a job browser specialized for visualizing, monitoring, profiling and debugging DryadLINQ jobs.
Databases can serve many social goals, such as fair allocation of resources and identification of genetic markers for disease. But privacy concerns often discourage participation or hinder access or sharing of potentially valuable information once it is collected. We have developed a definitional, analytical framework to facilitate design of privacy-preserving algorithms with quantifiable and rigorous privacy guarantees. Our approach treats privacy as a non-renewable resource, associating a privacy budget consumed with every query into a database. This encourages developers to optimize their algorithms in one additional dimension—privacy—in addition to cost measures such time, energy, or accuracy.
The problem of statistical disclosure control—revealing accurate statistics about a population while preserving the privacy of individuals—is a multidisciplinary effort spanning statistics, theoretical computer science, security, and databases. This project revisits private data analysis from the perspective of modern cryptography. We address many previous difficulties by obtaining a strong, yet realizable, definition of privacy. Intuitively, differential privacy ensures that the system behaves essentially the same way, independent of whether any individual, or small group of individuals, opts into or out of the database. In addition to the simple and intuitive semantics, the guarantee of differential privacy holds even with arbitrary existing or future knowledge available to a ”privacy adversary,” completely solving the problem of database-linkage attacks.
(Click to view this poster in Adobe Reader.)Recent research on privacy-preserving data mining and analysis (done largely at Microsoft Research) has resulted in several very exciting results, demonstrating the possibility of a broad class of data analyses that provide mathematical guarantees on the privacy of the underlying records. Specifically, these analyses are guaranteed to unfold identically with and without any one user’s records; neither the user, nor anyone else, can even tell if the user participated in the data set, much less the contents of their records. This privacy guarantee is called “Differential Privacy”. We present a programming interface to data, much like a standard database, that automatically guarantees differential privacy. The analyst describes the computation they wish to perform (e.g: “count the number of patients admitted from a certain zip code, with the following symptoms, last month”) and our execution platform provides a response certain to respect the strong formal guarantees above. Importantly, the system itself provides the privacy guarantees, and does not require privacy expertise of the users, or the participation of a privacy expert. The users, who are likely expert in other areas (eg: epidemiology, sociology, public policy) can focus on applying their expertise to the task at hand, without stumbling over privacy constraints, and without the risk of unintended disclosure of sensitive information.
Pan-Private Streaming Algorithms
(Click to view this poster in Microsoft PowerPoint.)Collectors of confidential data, such as governmental agencies, hospitals, or search-engine providers, can be pressured to permit data to be used for purposes other than that for which they were collected. To support data curators, we have initiated a study of pan-private algorithms. Roughly speaking, these algorithms retain their privacy properties even if their internal state becomes visible to an adversary. Our principal focus is on streaming algorithms.
Adding Privacy to Netflix Recommendations
(Click to view this poster in Microsoft PowerPoint.)When the first Netflix Prize was announced in October 2006, it generated a lot of excitement among machine-learning researchers, who got access to the largest, richest corpus of data, which has great data-mining potential. The prize also led to influential work attacking the privacy of individuals whose records were disclosed by Netflix by linking them with profiles available elsewhere, ultimately leading to cancellation of the prize's sequel. The lessons learned include better appreciation for the value contained in the data made available to researchers and increased awareness of its potential to cause privacy breaches. We have designed a recommender system that is provably privacy-preserving but yet generates recommendations on par with the baseline score of the Netflix Prize.
Microsoft Research/Bing Technology Transfer Showcase
Microsoft Research has contributed numerous technologies to the Bing search engine. Years of research efforts on intensive, data-driven problems have resulted in cutting-edge technology transfers. And Microsoft Research’s ongoing collaborations with Bing product teams continue to evolve the search paradigm. The research presented during the Silicon Valley TechFair provides several examples of Microsoft Research projects that have moved the bar on search.
WISE: Large-Scale Web Image Search and Exploration
(Click to view this poster in Microsoft PowerPoint.)WISE is a Web-scale, content-based image-retrieval system. There are many applications: Imagine that you could submit to Bing a photo taken by your phone—a face, a product, or a historic landmark—and get back relevant information about the photo. We address two major challenges in such an image-search system: large-scale machine learning for image representation and efficient image indexing and querying. WISE showcases general image search as well as specific applications, such as facial-image search, logo search, landmark search, and product search.
Worldwide Telescope in Bing Maps
(Click to view this poster in Microsoft PowerPoint.)The WorldWide Telescope Map App in Bing lets you to seamlessly move from earth to sky, using WorldWide Telescope data and images within your Bing Maps. This functionality provides context for where celestial entities are in real time if you were to look up at the night sky, allowing you to navigate the universe the same way you do Bing Maps by grabbing an area and dragging the map around.
Real Time Search
(Click to view this poster in Microsoft PowerPoint.)Bing merges real-time content with Web search technology to present compelling new search results for users. Find out how Microsoft Research contributions helped build this powerful feature in Bing, and try it yourself at www.bing.com/twitter.
(Click to view this poster in Microsoft PowerPoint.)We are changing the paradigm of search, from the system guessing what you want based on a few words you type, to a dialog where you can continue to clarify your intent. We will demonstrate Bing Active Answers, which allow the user to interact with Bing to refine what they meant, get results from Bing at keystroke speeds, share information with friends, and more. Microsoft Research built and shipped the first Active Answer to let users check their airline flight status, and then helped interactivity become more widespread across Bing’s properties, including local results, travel, reference, and even connections with Bing’s partners.
Sketch-Based Distance Estimates for Web-Scale Graphs
(Click to view this poster in Microsoft PowerPoint.)We study the fundamental problem of computing distances between nodes in large graphs such as the Web graph and social networks. Our objective is to be able to answer distance queries between a pair of nodes in real time. Because standard shortest-path algorithms are expensive, our approach moves the time-consuming shortest-path computation offline and, at query time, only looks up precomputed values and performs simple, fast computations on these precomputed values. During the offline phase, we compute and store a small sketch for each node in the graph, and at query time, we look up the sketches of the source and destination nodes and perform a simple computation using these two sketches to estimate the distance.
Stereo Display with Correct Focus Cues
(Click to view this poster in Microsoft PowerPoint.)Stereo displays create an exciting sense of depth, but extended viewing often results in discomfort, and some viewers are unable to perceive the stereo effect. One reason is that viewers must focus their eyes at the distance of the screen, even while directing their gaze to objects whose simulated distance is quite different. To address this defect, we and our colleagues have created prototype stereo displays that approximate correct focus cues, enabling focus and viewing distances to remain coupled. Research using these displays has confirmed the importance of focus cues for viewing comfort and correct stereo effect. We will demonstrate two stereo display technologies, one that was developed almost ten years ago using stacked image planes, and a new approach using a micro-lens array to create a light field with an angular resolution exceeding that of lenticular auto-stereoscopic displays by two orders of magnitude.
Tracking Internet Hosts Using Unreliable IDs
(Click to view this poster in Adobe Reader.)Today's Internet is open and anonymous. While it permits free traffic from any host, attackers that generate malicious traffic typically cannot be held accountable. We will present a system that tracks dynamic bindings between hosts and IP addresses by leveraging application-level data with unreliable IDs. Using a month-long Hotmail user-login trace, we show that this system can attribute most of the activities reliably to the responsible hosts, despite the existence of dynamic IP addresses, proxies, and NATs. With this information, we are able to analyze the host population, to conduct forensic analysis, and to blacklist malicious hosts dynamically.
Mobile-to-Mobile Networking in 3G Networks
(Click to view this poster in Microsoft PowerPoint.)Mobile devices increasingly are first-class computing devices, generating large amounts of data. Searching and sharing data securely across multiple devices can be a significant challenge. We have built Contrail, a communication abstraction for P2P communication on mobile phones. Communication in Contrail is purely asynchronous, coping with the fact that phones temporarily are disconnected from the network. Phones set up filters with other phones expressing their interest set. We will demonstrate the usage of Contrail with three applications: P2P content distribution, P2P search, and location-based group communication.
New Technologies for Multi-Image Fusion
(Click to view this poster in Microsoft PowerPoint.)As video and still cameras have become almost ubiquitous, people are taking increasingly more photographs and videos of the world around them. Often, the photographer's intent is to capture more than what can be seen in a single photograph, and he or she instead takes a large set of images or a video clip to capture a large scene or a moment that extends over time. One can combine these images to produce an output that improves the input images, such as creating an image with a large field of view, a panorama, or a composite image that takes the best parts of the image, a photo montage. But creating these results is still non-trivial for many users. One challenge is in creating large-scale panoramas, for which the capture and stitching times can be long. In addition, when using consumer-level point-and-shoot cameras and camera phones, artifacts such as motion blur appear. Another challenge is combining large image sets from photos or videos to produce results that use the best parts of the images to create an enhanced photograph. We will present several new technologies that advance the state of the art in these areas and create improved user experiences. For panorama generation, we will demonstrate: ICE 2.0. Stitching of panoramas from video. Generating sharp panoramas from blurry videos. For generating composites, we will demonstrate: Video to snapshots. De-noising and sharpening using lucky imaging.
eScience in the Cloud
Scientific applications have diverse data and computational needs that scale from desktop to supercomputers. Besides the nature of the application and the domain, the resource needs for the applications also vary over time—as the collaboration and the data collections expand, or when seasonal campaigns are undertaken. Cloud computing offers a scalable, economic, on-demand model well-matched to evolving eScience needs. We will present a suite of science applications that leverage the capabilities of Microsoft's Windows Azure cloud-computing platform. We will show tools and patterns we have developed to use the cloud effectively for solving problems in environmental science.
The Translating! Telephone
(Click to view this poster in Microsoft PowerPoint.)We will demonstrate a system for live speech-to-text and speech-to-speech translation of telephone calls. Douglas Adams' Babelfish inspired dreams of unfettered universal communication. Though we are still far from achieving that goal, there are scenarios in which today's limited accuracy can create value. Our goal in the telephone-call scenario is to provide an aid for cross-language communication in the event that no other means of communication exists. The system we will show makes extensive use of speaker-adaptation technologies to achieve reasonable, real-time speech-to-text transcription accuracy. This is then translated live using machine translation to provide speech-to-text translation and further fed into a text-to-speech system to realize speech-to-speech translation. The speech-to-text transcript and the translated transcript are shown to the users to enable validation of their intentions. This system will be demonstrated by a live conversation between German and English speakers.
Mobile Assistance Using Infrastructure (MAUI)
(Click to view this poster in Microsoft PowerPoint.)Mobile devices have reached an impasse. Although the resources that can be integrated onto mobile handheld devices will continue to improve, faster CPUs, more RAM, faster wireless NICs, making substantial use of these resources will require a major breakthrough in battery technology. To bypass these limitations, MAUI (Mobile Assistance Using Infrastructure) is a system that enables fine-grained offloading of mobile applications to cloud-based infrastructure. By leveraging nearby infrastructure, MAUI enables a new class of resource-intensive applications, such as augmented reality, to run on mobile handheld devices. With MAUI, we enable resource-intensive .NET applications to run on Windows Mobile smartphones. We will demonstrate: A resource-intensive face-recognition application that consumes an order-of-magnitude less energy. A voice-based translation application that previously could not run using only the limited resources available on today's smartphones.
Greening Corporate Networks with Sleep Proxy
(Click to view this poster in Microsoft PowerPoint.)In a corporate network, most desktop machines always are left on, even when they are not in use for extended periods, such as at night. This is wasteful, bad for the environment, and bad for the corporate treasury. While Windows 7 provides aggressive sleep functionality, most users override it because they occasionally might want to access their machine remotely. Ideally, a desktop would go to sleep when not in use and awaken seamlessly when the user tries to access it. We have built a system to enable this. Our system consists of a sleep server that maintains the network presence of the sleeping machine and seamlessly awakens it on remote access. We do not require special hardware or changes to existing software. Our system is operational in Microsoft building 99 and has resulted in substantial savings in terms of money, power consumption, and carbon-dioxide emissions.
Manual Deskterity: An Exploration of Simultaneous Pen + Touch Direct Input
(Click to view this poster in Microsoft PowerPoint.)Our research showcases a conceptual model for bimanual input that spans a wide range of form-factors including smart phones, slates, and tabletop systems using a combination of pen, touch, motion sensing, and voice input modalities. The pen writes, the hands manipulate, and the combination of modalities yields new tools – as well as compelling over-the-shoulder playback experiences for human-human communication.
Mobile-Search and Advertisement-Cache Architecture
(Click to view this poster in Microsoft PowerPoint.)We will show how to improve the mobile-search user experience by caching popular search results on mobile devices. First, a community-based cache is created by mining the most popular queries in mobile-search logs. Over time, the cache is personalized by adding all the new user search queries. An analysis of four months of mobile-search logs shows that, on average, 66 percent of the search queries submitted by a user can be answered by caching 2,500 links on a 1MB cache. Our prototype implementation in Windows Mobile demonstrates responses 16 times faster and 23 times more energy-efficient compared with querying through a 3G link. Our prototype also demonstrates how our caching architecture can enable monetization of mobile local search without hurting the mobile user experience. A rich set of ads is first cached on the phone. Because all the ads are locally stored, finding and displaying a mobile local ad is extremely fast, enabling us to display ads instantly to a mobile user as a query is being typed.
Thread Ownership of Memory
(Click to view this poster in Microsoft PowerPoint.)This technology helps programmers find bugs in their multithreaded programs. This is important because multicore processors are now prevalent and so most programs will have to be multithreaded to speed up, and the resulting complexity (and current relative dearth of programmers with expertise writing parallel code) can lead to bugs. This research was done in collaboration between Microsoft Research Cambridge and the University of Maryland, College Park.