Big Data Analytics Lecture Series

# Small Summaries for Big Data

Graham Cormode
AT&T Research

July 2nd, 2012, 16:00-17:00, Microsoft Research Cambridge, Jasmine Room

Abstract: In dealing with big data, we often need to look at a small summary to get the big picture. Over recent years, many new techniques have been developed which allow important properties of large distributions to be extracted from compact and easy-to-build summaries. In this talk, I'll highlight some examples of different types of summarization: sampling, sketching, and special-purpose. Then I'll outline the road ahead for further development of such summaries.

# Testing Properties of Discrete Distributions

Tugkan Batu
London School of Economics

May 15th, 2012, 16:00-17:00, Microsoft Research Cambridge, Lecture Theatre Small

Abstract: In this talk, we will survey some algorithmic results for several fundamental statistical inference tasks. The algorithms are given access only to i.i.d. samples from the discrete input distributions and make inferences based on these samples. The main focus of this research is the sample complexity of each task as a function of the domain size for the underlying discrete probability distributions. The inference tasks studied include (i) similarity to a fixed distribution (i.e., goodness-of-fit); (ii) similarity between two distributions (i.e., homogeneity); (iii) independence of joint distributions; and (iv) entropy estimation. For each of these tasks, an algorithm with sublinear sample complexity is presented (e.g., a goodness-of-fit test on a discrete domain of size $n$ is shown to require $O(\sqrt{n}polylog(n))$ samples). Given certain extra information on the distributions (such as the distribution is monotone or unimodal with respect to a fixed total order on the domain), the sample complexity of these tasks become polylogarithmic in the domain size.

# Streaming pattern matching

Ely Porat
Bar-Ilan University, Israel

May 1st, 2012, 15:00-16:00, Microsoft Research Cambridge, Primrose Room

Abstract: We present a fully online randomized algorithm for the classical pattern matching problem that uses merely O(log m) space, breaking the O(m) barrier that held for this problem for a long time. Our method can be used as a tool in many practical applications, including monitoring Internet traffic and firewall applications. In our online model we first receive the pattern P of size m and preprocess it. After the preprocessing phase, the characters of the text T of size n arrive one at a time in an online fashion. For each index of the text input we indicate whether the pattern matches the text at that location index or not. Clearly, for index i, an indication can only be given once all characters from index i till index i+m-1 have arrived. Our goal is to provide such answers while using minimal space, and while spending as little time as possible on each character (time and space which are in O(poly(log n))).

# Basic Probabilistic Load Balancing Mechanisms

Tom Friedetzky
Durham University

April 24th, 2012, 15:00-16:00, Microsoft Research Cambridge, Lecture Theatre Large

Abstract: In this talk we will discuss a number of basic yet useful load balancing mechanisms for parallel and distributed computations, based on random allocation ("balls into bins") and randomised work stealing. The focus will be on approaches that lend themselves to thorough mathematical analysis but that, due to their simplicity and general easiness on assumptions, may be considered to be good, all-purpose performers. The talk will mention theoretical results and occasionally hint at proof strategies but most parts will be accessible to a general audience.

# Large Scale Semidefinite Programming

Alexandre d'Aspremont
Ecole Polytechnique, France

April 17th, 2012, 15:00-16:00, Microsoft Research Cambridge, Primrose Room

Abstract: The talk will start by a brief introduction on semidefinite programming. It will discuss some recent advances in large scale semidefinite programming solvers, detailing in particular stochastic smoothing techniques for the maximum eigenvalue function.

Joint work with Noureddine El Karoui at U.C. Berkeley.

# The Online Approach to Machine Learning

Nicolo Cesa-Bianchi
Universita degli Studi di Milano

February 28th, 2012, 14:00-15:00, Microsoft Research Cambridge, Large Lecture Theatre

Abstract: Online learning has become a standard tool in machine learning and large-scale data analysis. Learning is viewed as a repeated game between an adaptive agent and an ever-changing environment. Within this simple paradigm, one can model a variety of sequential decision tasks simply by specifying the interaction protocol and the resource constraints on the agent. In the talk we will first highlight some of the main features of online learning algorithms, such as simplicity, locality, scalability, and robustness. Then, we will describe algorithmic applications to specific learning scenarios (partial feedback, attribute-efficient, multitask, semi-supervised, transductive, and more) showing how diverse settings can be effectively captured within the same conceptual framework.