# The Machine Learning and Optimization Group at Microsoft Research, Redmond

I'm a Principal Researcher in, and manager of, the Machine Learning and Optimization (MLOP) group at Microsoft Research. With the broad acceptance of machine learning throughout industry, it has become increasingly important to further the field through basic research validated by real world applications. We work on optimization theory and practice, with a focus typically on large datasets using distributed computation, and on the foundations of machine learning, but guided by real world needs. Our current work spans the development of machine learning solutions for small, cheap devices; of principled approaches to crowdsourcing for labeling, including optimal incentive mechanisms; of new, efficient methods for function optimization; and of new approaches to intelligent assistants.

# My Own Research

I'm interested in the theory and practice of machine learning, optimization methods, and the intersection of machine learning with natural language processing and semantic modeling. I'm currently working on crowd sourced solutions to learning language for applications such as intelligent assistants.

## Modeling Meaning in Text

M. Richardson, E. Renshaw and I created a freely available dataset called MCTest for work on the machine comprehension of text. We gathered 500 short stories using a vocabulary that a typical seven year old would know. In this sense, the dataset is open domain, yet its scope is still limited. For each story, we also gathered four multiple choice questions, and for each question, four answers, one of which is correct. Many of the questions are designed to require more than one sentence from the story to be answered correctly. A simple token-based baseline system gets about 60% of the questions correct (when guessing would achieve 25%, in expectation). It is our hope that the gap between this number and how well a human does (which is in the high nineties) will spur research in semantic modeling. The data can be downloaded, and publications and results found, here.

G. Zweig and I also constructed a sentence completion challenge dataset that we hope will prove useful for testing text-based semantic modeling methods - you can find the details here.

## The Predicate Depth: Finding Robust, Accurate Classifiers

Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic: one can construct examples where the MAP solution disagrees with the weighted majority vote on every instance. Instead of directly finding the MAP solution, one can divide the problem into a part that assigns a posterior distribution over the function class, based on training data, and a part that, given this posterior, finds a function (or ensemble of functions) that it is hoped will generalize well. We define the predicate depth, and its median, over binary classification functions, and we show that the predicate median is robust and has good generalization guarantees. It also has the advantage of being a single deterministic function, and so avoids problems with sampling models (non-deterministic solutions) and ensembles (speed, and interpretability). We present algorithms to approximate it and also show how they can be used to approximate the Tukey median in polynomial time. This is joint work with R. Gilad-Bachrach (JMLR, 2013).

## Ranking for Information Retrieval

Information retrieval measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), are difficult to optimize directly, since viewed as functions of the model parameters, they are either flat or discontinuous everywhere. LambdaRank was an attempt to solve this, and in fact, in this [paper, tech report], we verified that LambdaRank indeed directly optimizes NDCG, and we further demonstrated that LambdaRank can easily be adapted to directly optimize MAP and MRR. This work was done using neural nets as the underlying model. In this paper, we showed that boosted tree classifiers can make excellent rankers; this, together with the fact that on large, artificial data sets, boosted trees can do arbitrarily well, whereas neural nets cannot, suggests trying the LambdaRank idea (which applies to any model for which a gradient of the cost can be defined, not only neural nets) with boosted trees. The resulting algorithm is called LambdaMART [paper, tech report]. Our first ranking algorithm for Search, RankNet, is still an excellent method for training using pairwise preferences (for example, from clicks), and can also easily be adapted to work with boosted trees This paper won the 2015 ICML Test of Time award). For an overview of these algorithms see this [tech report].

We won the 2011 Yahoo! Learning to Rank Challenge (Track 1). We used an ensemble of 12 models, 8 of which were LambdaMART (plus 2 LambdaRank neural nets and 2 MART regression models). We were the only team we know of who directly optimized for the measure used in the competition (Expected Reciprocal Rank), demonstrating the flexibility of the LambdaMART approach. In Track 1, which was the standard Web search ranking problem, 312 teams submitted at least two models (and over a thousand submitted at least one). See here for more details (our team name was Ca_{3}Si_{2}O_{7}, the chemical formula for Rankinite).

## Audio Fingerprinting

You have an incoming stream of audio and you'd like to know what's playing. Our RARE (*Robust Audio Recognition Engine*) system can identify any one of about a quarter million songs in real time using about 10% CPU on an 833 MHz PC. On 36 hours of noisy test audio, it achieves 0.2% false positives at 4.10^{-6} false negative rate. Confirmation fingerprints can be used to significantly further improve these error rates, with almost no extra CPU cost. Our work is currently used in Windows Media Player and in the Zune media player. Audio fingerprinting has lots of applications: for example, to automatically construct audio thumbnails, and to automatically find duplicate audio clips on your PC. Our main innovations are a method to train for robustness to distortions, and a lookup method that is over an order of magnitude faster than competing methods for this problem. See here for details. Joint work with J. Platt, J. Goldstein, E. Renshaw, C. Herley.