Big Data Analytics

Established: October 18, 2012

We conduct research in the area of algorithms and systems for processing massive amounts of data. Our work aims at pushing the boundary of computer science in the area of algorithms and systems for large-scale computations. Our mission is to achieve major technological breakthroughs in order to facilitate new systems and services relying on efficient processing of big data.

Research Areas

Database queries – How can we efficiently resolve database queries on massive amounts of input data? Here the input data may be presented in the form of a distributed data stream.

 

Machine learning – How can we efficiently solve large-scale machine learning problems? Here the input data may be massive, stored in a distributed cluster of machines.

 

Distributed computing – How can we efficiently solve large-scale optimization problems in distributed computing environments? For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large scale graphs?

Talks

  • Small Summaries for Big Data

    Graham Cormode – AT&T Research

    July 2nd, 2012, 16:00-17:00, Microsoft Research Cambridge, Jasmine Room

    Abstract: In dealing with big data, we often need to look at a small summary to get the big picture. Over recent years, many new techniques have been developed which allow important properties of large distributions to be extracted from compact and easy-to-build summaries. In this talk, I’ll highlight some examples of different types of summarization: sampling, sketching, and special-purpose. Then I’ll outline the road ahead for further development of such summaries.

    About the speaker: Homepage (opens in new tab)

    Testing Properties of Discrete Distributions

    Tugkan Batu – London School of Economics

    May 15th, 2012, 16:00-17:00, Microsoft Research Cambridge, Lecture Theatre Small

    Abstract: In this talk, we will survey some algorithmic results for several fundamental statistical inference tasks. The algorithms are given access only to i.i.d. samples from the discrete input distributions and make inferences based on these samples. The main focus of this research is the sample complexity of each task as a function of the domain size for the underlying discrete probability distributions. The inference tasks studied include (i) similarity to a fixed distribution (i.e., goodness-of-fit); (ii) similarity between two distributions (i.e., homogeneity); (iii) independence of joint distributions; and (iv) entropy estimation. For each of these tasks, an algorithm with sublinear sample complexity is presented (e.g., a goodness-of-fit test on a discrete domain of size $n$ is shown to require $O(sqrt{n}polylog(n))$ samples). Given certain extra information on the distributions (such as the distribution is monotone or unimodal with respect to a fixed total order on the domain), the sample complexity of these tasks become polylogarithmic in the domain size.

    About the speaker: Homepage (opens in new tab)

    Streaming Pattern Matching

    Ely Porat – Bar-Ilan University, Israel

    May 1st, 2012, 15:00-16:00, Microsoft Research Cambridge, Primrose Room

    Abstract: We present a fully online randomized algorithm for the classical pattern matching problem that uses merely O(log m) space, breaking the O(m) barrier that held for this problem for a long time. Our method can be used as a tool in many practical applications, including monitoring Internet traffic and firewall applications. In our online model we first receive the pattern P of size m and preprocess it. After the preprocessing phase, the characters of the text T of size n arrive one at a time in an online fashion. For each index of the text input we indicate whether the pattern matches the text at that location index or not. Clearly, for index i, an indication can only be given once all characters from index i till index i+m-1 have arrived. Our goal is to provide such answers while using minimal space, and while spending as little time as possible on each character (time and space which are in O(poly(log n))).

    About the speaker: Homepage (opens in new tab)

    Basic Probabilistic Load Balancing Mechanisms

    Tom Friedetzky – Durham University

    April 24th, 2012, 15:00-16:00, Microsoft Research Cambridge, Lecture Theatre Large

    Abstract: In this talk we will discuss a number of basic yet useful load balancing mechanisms for parallel and distributed computations, based on random allocation (“balls into bins”) and randomised work stealing. The focus will be on approaches that lend themselves to thorough mathematical analysis but that, due to their simplicity and general easiness on assumptions, may be considered to be good, all-purpose performers. The talk will mention theoretical results and occasionally hint at proof strategies but most parts will be accessible to a general audience.

    About the speaker: Homepage (opens in new tab)

    Large Scale Semidefinite Programming

    Alexandre d’Aspremont – Ecole Polytechnique, France

    April 17th, 2012, 15:00-16:00, Microsoft Research Cambridge, Primrose Room

    Abstract: The talk will start by a brief introduction on semidefinite programming. It will discuss some recent advances in large scale semidefinite programming solvers, detailing in particular stochastic smoothing techniques for the maximum eigenvalue function.

    Joint work with Noureddine El Karoui at U.C. Berkeley.

    About the speaker: Homepage (opens in new tab)

    The Online Approach to Machine Learning

    Nicolo Cesa-Bianchi – Universita degli Studi di Milano

    February 28th, 2012, 14:00-15:00, Microsoft Research Cambridge, Large Lecture Theatre

    Abstract: Online learning has become a standard tool in machine learning and large-scale data analysis. Learning is viewed as a repeated game between an adaptive agent and an ever-changing environment. Within this simple paradigm, one can model a variety of sequential decision tasks simply by specifying the interaction protocol and the resource constraints on the agent. In the talk we will first highlight some of the main features of online learning algorithms, such as simplicity, locality, scalability, and robustness. Then, we will describe algorithmic applications to specific learning scenarios (partial feedback, attribute-efficient, multitask, semi-supervised, transductive, and more) showing how diverse settings can be effectively captured within the same conceptual framework.

    About the speaker: Nicolo Cesa-Bianchi (opens in new tab)

    Logic and Analysis Issues in Web-based Data Integration

    Michael Benedikt – Oxford University

    January 31st, 2012, 14:00-15:00, Microsoft Research Cambridge, Primrose Room

    Abstract: A prime driver for much database research over the past decade has been providing unified structured (relational) query interfaces on top of web-based datasources. There are a range of issues that come up in doing this I will talk try to give an idea of a few of them, focusing on several we have worked on at Oxford: analysis of dynamic access plans, answerability of queries on the Web, and analysis of web pages to support query answering.

    This is joint work with Pierre Bourhis, Tim Furche, Georg Gottlob, Andreas Savvides, and Pierre Senellart.

  • Visitors

    • Srikanta Tirthapura, Iowa State University, to join May 2013
    • Ely Porat, Bar-Ilan University, Israel, June 2012

    Interns

    • Nan Li, University of California Santa Barbara, 2012/13
    • Charalampos Tsourakakis, CMU, 2012
    • Zengfeng Huang, Hong Kong University of Science and Technology, 2012
    • Alan Roytman, UCLA, 2012
    • Zhenming Liu, Harward University, now a post-doc at Princeton University, 2011
    • Dinkar Vasudevan, EPFL, 2008

    Partners

    • Microsoft Online Services Division

 

People

Portrait of Bozidar Radunovic

Bozidar Radunovic

Senior Principal Researcher

Portrait of Christos Gkantsidis

Christos Gkantsidis

Principal Researcher

Portrait of Thomas Karagiannis

Thomas Karagiannis

Principal Researcher