Machine Learning Seminar Series

The Machine Learning Seminar series hosts a variety of talks pertaining to machine learning theory, algorithms, applications and related fields. It is aimed at a broad audience of researchers and students. Members of the local academic community are welcome to attend.

# Upcoming Speakers

## Universal low-rank matrix recovery from Pauli measurements

Yi-Kai Liu, NIST, Tuesday, April 10, 2012 4:00 PM–5:00 PM

Description
We study the problem of reconstructing an unknown matrix M, of rank r and dimension d, using O(rd poly log d) Pauli measurements. This has applications to compressed sensing methods for quantum state tomography. It is a non-commutative analogue of a well-studied problem: recovering a sparse vector from a few of its Fourier coefficients.

We give a solution to this problem based on the restricted isometry property (RIP), which improves on previous results using dual certificates. In particular, we show that almost all sets of O(rd log^6 d) Pauli measurements satisfy the rank-r RIP. This implies that M can be recovered from a fixed ("universal") set of Pauli measurements, using nuclear-norm minimization (e.g., the matrix Lasso), with nearly-optimal bounds on the error. Our proof uses Dudley's inequality for Gaussian processes, together with bounds on covering numbers obtained via entropy duality.

## Learning Models of Natural Language Semantics

Percy Liang, Google and Stanford, Thursday, April 5, 2012 4:00 PM–5:00 PM

Description
Statistical NLP has made a lot of progress on labeling, parsing, and in general, analyzing the structure of sentences. However, the "meaning" of a sentence goes beyond its internal linguistic structure, but importantly involves the connection between the sentence and the rest of the world. In this talk, I will present several projects that attempt to model this connection. First, I describe an unsupervised model which can induce the correspondences between sentences and a database of facts about the world. Next, we present a model to answer questions via latent logical forms, which can be learned from question-answer pairs alone. Finally, we model language use as a two-player game between a speaker and a listener, and show how a decision-theoretic speaker strategy can greatly improve accuracy on a natural language generation task.

# Past Speakers

## Using More Data to Speed-up Training Time

Shai Shalev-Shwartz, The Hebrew University, Monday, February 27, 2012 4:00 PM–5:00 PM

Description
Recently, there has been a growing interest in understanding how more data can be leveraged to reduce the required training runtime. I will describe a systematic study of the runtime of learning as a function of the number of available training examples, and underscore the main high-level techniques. In particular, a formal positive result will be presented, showing that even in the unrealizable case, the runtime can decrease exponentially while only requiring a polynomial growth of the number of examples. The construction corresponds to a synthetic learning problem and an interesting open question is if and how the tradeoff can be shown for more natural learning problems. I will spell out several interesting candidates of natural learning problems for which we conjecture that there is a tradeoff between computational and sample complexity.
Based on joint work with Ohad Shamir and Eran Tromer.

## Some statistical models for nonlinear feature learning

Tong Zhang, Rutgers, Monday, November 7, 2011 4:00 PM – 5:00 PM

Description
Feature learning is becoming an important topic in machine learning, but most studies have focused on developing heuristic methods without proper justifications. In this talk I'd discuss two application problems we investigated recently that led to simple and effective feature learning algorithms based on sound statistical models.

The first model is unsupervised learning of super-vector coding for image classification, and the second model is supervised learning of regularized decision forest for generic nonlinear prediction problems. Both methods can be regarded as learning nonlinear basis functions for statistical additive models. The first method is mathematically motivated by the computation of Hellinger distance between two empirical probability distributions in high dimension, and our solution achieves the state of the art performance in image classification. The second method is motivated by boosted decision trees where we are able to apply the modern concept of structured sparsity regularization to design algorithms that are superior.

## Sample Complexity Bounds for Differentially Private Classification

Kamalika Chaudhuri, UC San-Diego, Monday, October 31, 2011 4:00 PM – 5:00 PM

Description
We study the problem of differentially private classification -- namely, learning a classifier from sensitive training data, while still preserving the privacy of individuals in the data set. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy?

Previous work studied this question in the context of discrete data distributions, and provided an upper bound, and lower bounds for some specific classes. In the first part of this talk, we show a lower bound that holds for general classification problems which obey certain fairly-loose conditions.

In the second part of this talk, we consider this question in the context of learning infinite hypothesis classes on continuous data distributions. We show that in this case, even for very simple hypothesis classes, any algorithm that uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions.

This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible (as was shown by previous work).

Joint work with Daniel Hsu.

## Correction for Hidden Confounders in Genetic Analyses

Jennifer Listgarten, MSR Los Angeles, Monday, October 17, 2011 4:00 PM – 5:00 PM

Description
Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Genome-wide associations, wherein genetic markers are systematically scanned for association with disease are one window into disease processes. Additionally, finding out which parts of our DNA affect particular intermediary processes such as gene expression (eQTL) can shed light on important pathways. Naively, these associations can be identified using a simple statistical test on each hypothesized association. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. My talk will focus on a class of statistical models, linear mixed models (LMMs), which correct for these confounders.

Although LMMs are among the richest class of models used today for genome-wide association studies, their use on contemporary data sets is limited because the required computations are prohibitive when many individuals are analyzed, with runtime increasing as the cube of the number of individuals and memory footprint increasing as the square of the number of individuals. I'll describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies that scales linearly with cohort size in both run time and memory use. I'll also present a new model that jointly corrects for two particular kinds of hidden structure: (1) population structure (e.g., race, family relatedness), and (2) microarray expression artifacts (e.g., batch effects), when these confounders are unknown.

## Generalization Bounds and Consistency for Latent-Structural Probit and Ramp Loss

David McAllester, Toyota Technological Institute - Chicago, Monday, September 12, 2011 4:00 PM – 5:00 PM

Description
Linear predictors are scale-insensitive --- the prediction does not change when the weight vector defining the predictor is scaled up or down. This implies that direct regularization of the performance of a linear predictor with a scale sensitive regularizer (such as a norm of the weight vector) is meaningless. Linear predictors are typically learned by introducing a scale-sensitive surrogate loss function such as the hinge loss of an SVM. However, no convex surrogate loss function can be consistent in general --- in finite dimension SVMs are not consistent. Here we generalize probit loss and ramp loss to the latent-structural setting and show that both of these loss functions are consistent in arbitrary dimension for an arbitrary bounded task loss. Empirical experience with probit loss and ramp loss will be briefly discussed.

## Domain Adaptation Learning Utilizing Unlabeled Target Samples

Shai Ben-David, University of Waterloo, Monday, June 13, 2011 4:00 PM – 5:00 PM

Description
The domain adaptation problem in machine learning occurs when the test data generating distribution differs from the one that generates the training data. We introduce two properties of the training and test data generating distributions: (1) A weight ratio assumption which bounds the ratio of the probability weights between the train/test marginal distributions for a restricted class of subsets of the domain. (2) A novel probabilistic relaxation of the classic Lipschitzness condition applied on the underlying labeling function. By analyzing a nearest neighbor algorithm, we show that these two assumptions suffice for domain adaptation learning to succeed and provide finite sample guarantees. We complement our positive learnability results with lower bounds, showing that none of these two assumptions suffices on its own. Furthermore, we discuss implications of our results for \emph{proper} domain adaptation learning, where the learner is required to output a predictor from some pre-determined class. We propose a learning algorithm that utilizes an unlabeled training sample from the target distribution in a beneficial way---we prove that for proper learning, any algorithm that has no access to such unlabeled target samples, will produce a significantly larger error than ours. To the best of our knowledge, this is the first result proving that the use of target-generated unlabeled samples can be utilized to achieve a clear performance advantage.

This is joint work with Ruth Urner and Shai Shalev-Shwartz

## Robust High-Dimensional Principal Component Analysis

Shie Mannor, The Technion, Monday, June 6, 2011 4:00 PM – 5:00 PM

Description
The analysis of very high dimensional data - data sets where the dimensionality of each observation is comparable to or even larger than the number of observations - has drawn increasing attention in the last few decades due to a broad array of applications, from DNA microarrays to video processing, to consumer preference modeling and collaborative filtering, and beyond.

As we discuss, many of our tried-and-true statistical techniques fail in this regime. We revisit one of the perhaps most widely used statistical techniques for dimensionality reduction: Principal Component Analysis (PCA). In the standard setting, PCA is computationally efficient, and statistically consistent, i.e., as the number of samples goes to infinity, we are guaranteed to recover the optimal low-dimensional subspace. On the other hand, PCA is well-known to be exceptionally brittle – even a single corrupted point can lead to arbitrarily bad PCA output. We consider PCA in the high-dimensional regime, where a constant fraction of the observations in the data set are arbitrarily corrupted. We show that standard techniques fail in this setting, and discuss some of the unique challenges (and also opportunities) that the high-dimensional regime poses. For example, one of the (many) confounding features of the high-dimensional regime, is that the noise magnitude dwarfs the signal magnitude, i.e., SNR goes to zero. While in the classical regime, dimensionality recovery would fail under these conditions, sharp concentration-of-measure phenomena in high dimensions provide a way forward. Then, for the main part of the talk, we propose a High-dimensional Robust Principal Component Analysis (HR-PCA) algorithm that is computationally tractable, robust to contaminated points, and easily kernelizable. The resulting subspace has a bounded deviation from the desired one, for up to 50% corrupted points. No algorithm can possibly do better than that, and there is currently no known polynomial-time algorithm that can handle anything above 0%. Finally, unlike ordinary PCA algorithms, HR-PCA has perfect recovery in the limiting case where the proportion of corrupted points goes to zero.

## Learning Graphical Models: Trees, Latent trees and Random Graphs

Anima Anandkumar, UC Irvine, Monday, May 9, 2011 4:00 PM – 5:00 PM

Description
Capturing complex interactions among a large set of variables is a challenging task. Probabilistic graphical models or Markov random fields provide a graph-based framework for capturing such dependencies. Graph estimation is an important task, since it reveals
important relationships among the variables. I will describe graph estimation for two model classes.

In the first part of my talk, I will focus on the class of Erdos-Renyi random graphs. We propose simple local algorithms for graph estimation which use only low-order statistics of the data. These algorithms are provably consistent and achieve optimal sample complexity, when there is a decay of long-range correlations in the model.

The second part of my talk is motivated by the following question: can we discover hidden influences acting on the observed variables? We consider latent tree models for capturing hidden relationships. We develop novel algorithms for learning the unknown tree structure. Our algorithm is amenable to efficient implementation of the Bayesian Information Criterion (BIC) to tradeoff the number of hidden variables with the accuracy of the model fitting. Experiment on the S&P 100 financial data reveals sectorization of the companies and experiment on the newsgroups data automatically categorizes words into different topics.

## Computation Meets Statistics: Fast Global Convergence for High-Dimensional Statistical Recovery

Martin Wainwright, UC Berkeley, Monday, April 25, 2011, 4:00 PM – 5:00 PM

Description
Many statistical estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension $d$ to grow with (and possibly exceed) the sample size $n$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that a simple first-order method has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical distance between the true unknown parameter $\theta^*$ and the optimal solution $\widehat{\theta}$. This guarantee is substantially sharper than previous analyses of global convergence for specific methods that yielded only sublinear rates, or linear convergence only up to noise levels. Our analysis applies to a wide range of $M$-estimators and statistical models, including sparse linear regression using Lasso generalized linear models, low-rank matrix recovery, and matrix decomposition.

Joint work with Alekh Agarwal and Sahand Negahban, UC Berkeley.

Carlos Guestrin, CMU
Monday, March 14, 2011
4:00 PM – 5:00 PM

Description
The internet has allowed us to democratize information, but has also brought us a new challenge: Information Overload. With the huge amounts of information out there, it can be very difficult to keep track of what's most important to us. This problem is not limited to the web: The democratic process can potential benefit from open debate, but it is very difficult for an individual to understand all the facets of an issue. And, in science, with the proliferation of research publications, it is difficult for anyone to stay on top of the developments in their own fields, let alone make the types of connections between fields that can be transformative. In this talk, I will cover three conceptual efforts that together could help tame information overload:
* Richer query models - going beyond simple keyword search.
* Personalization - helping the user adaptively express their information needs.
* Structured outputs - expressing relationships between documents, rather than simply returning a list of documents, allowing the user, for example, to connect he dots between news articles or form a "metro map" representing the information landscape.

This work is lead by Khalid El-Arini, Dafna Shahaf and Gaurav Veda.

## Scalable Prediction Algorithms for Networked Data

Nicolò Cesa-Bianchi, University of Milan
Monday, February 28, 2011
4:00 PM – 5:00 PM

Description
The talk is about classification problems that arise in the analysis of networked data. In node classification, we want to categorize the nodes of a given graph. For this problem we describe a randomized algorithm that, in many interesting cases,
runs in time sublinear in the network size. This algorithm has optimal predictive performance (to within log factors) in the online learning model, and empirically performs close to less scalable competitors like Label Propagation. A related problem is that of link classification, where the task is to determine whether a given relationship between two nodes is positive or negative. In social networks, this may be used to infer the sentiment between two individuals. We describe and theoretically motivate the use of the correlation clustering index as a measure of regularity for this problem. We conclude by describing an active learning strategy for predicting the link signs that is both efficient and more accurate than standard heuristics for this problem.

Joint work with: C. Gentile, F. Vitale, and G. Zappella.

## Cluster trees, Near-neighbor Graphs, and Continuum Percolation

Sanjoy Dasgupta, UCSD
Monday, February 14, 2011
4:00 PM – 5:00 PM

Description
What information does the clustering of a finite data set reveal about the underlying distribution from which the data were sampled? This basic question has proved elusive even for the most widely-used clustering procedures. One natural criterion is to seek clusters that converge (as the data set grows) to regions of high density. When all possible density levels are considered, this is a hierarchical clustering problem where the sought limit is called the “cluster tree”. We give a simple algorithm for estimating this tree that implicitly constructs a multiscale hierarchy of near-neighbor graphs on the data points. We show that the procedure is consistent, answering an open problem of Hartigan. We also obtain rates of convergence, using a percolation argument that gives insight into how near-neighbor graphs should be constructed.

## Learning Models for Scalable Object Class Recognition and Retrieval

Lorenzo Torresani, Dartmouth College
Monday, January 24, 2011
4:00 PM – 5:00 PM

Description
In this talk I will discuss new methods enabling visual search of object classes in large-scale image data sets. This application poses challenging requirements on the system design: the object class must be learned efficiently at query-time from few examples; compact image descriptors must be used to support storage of large collections; finally, recognition must have highly sublinear cost with respect to the database size in order to provide results in real-time.

Traditional object categorization methods do not meet these requirements as they employ high-dimensional descriptors (typical storage size is several Kbytes per image) and they are computationally expensive to train and test. Several hashing methods have been proposed to speed up image retrieval. While such techniques are effective for search of near-duplicate images, they have been proven hard to extend to object class retrieval.

We propose to address this problem through a two-step procedure: for each query, a small subset of images is selected by a fast prefilter, after which a more accurate yet efficient classifier is applied to produce a ranked list of the images in the candidate set. Scalability to large collections is achieved by constraining the prefilter to be of a form directly executable on traditional text-search engines, which perform real-time search in databases of several billion documents. In addition, we introduce a new compact descriptor for images which allows the construction of highly efficient classifiers with good ranking and recognition performance: even when the representation is compressed to 200 bytes per image, classification accuracy is comparable with the state-of-the-art but at orders of magnitude lower computational cost.

This talk reports on joint work with Martin Szummer and Andrew Fitzgibbon

## Constraints Driven Learning for Natural Language Understanding

Dan Roth, University of Illinois at Urbana-Champaign
Monday, December 13, 2010
4:00 PM – 5:00 PM

Description
Intelligent Information Access and Extraction suggest significant challenges for Natural Language analysis. Tasks of interest include semantic role labeling (determining who did what to whom, when and where), information extraction (identifying events, entities and relations), transliteration of names, and textual entailment (determining whether one utterance is a likely consequence of another). The challenges addressed in these and other natural language understanding tasks often involve assigning values to sets of interdependent variables and thus frequently necessitate performing global inference that accounts for these interdependencies.

This talk presents research on Constrained Conditional Models (CCMs), a framework that augments probabilistic models with declarative constraints as a way to support such decisions. Specifically, we will focus on new algorithms for training these global models using indirect supervision signals. Learning models for structured tasks is difficult partly since generating supervision signals is costly. We show that it is often easy to obtain a related indirect supervision signal, and discuss several options for deriving this supervision signal, including inducing it from the world's response to the model's actions.

Our learning framework is “Constraints Driven” in the sense that it allows and even gains from global inference that combines statistical models with expressive declarative knowledge (encoded as constraints), modeled via Constrained Conditional Models. Experimental results show the significant contribution of easy-to-get indirect supervision on several NLP tasks including information extraction, Transliteration and Textual Entailment.

## Graph Estimation

John Lafferty, Carnegie Mellon
Monday, Nov 29, 2010
4:00 PM – 5:00 PM

Description

Graphical modeling has proven to be an extremely useful abstraction in
statistical machine learning. The starting point is the graph of a
distribution. While usually the graph is assumed given, we are
interested in estimating the graph from data. The general problem is
extremely challenging. We present progress under different types of
nonparametric assumptions. One approach is something we call "the
nonparanormal," which uses copula methods to transform the variables
by monotonic functions, relaxing the strong distributional assumptions
made by the Gaussian graphical model. Another approach is to restrict
the family of allowed graphs to spanning forests, enabling the use of
fully nonparametric density estimation. Finally, we introduce a
framework for "graph-valued regression" where the graph is estimated
conditioned on a recursive partition of a set of covariates. The
resulting methods are (mostly) easy to understand and use,
theoretically well supported, and effective for modeling high
dimensional data.

Joint work with Haijie Gu, Anupam Gupta, Han Liu,
Pradeep Ravikumar, Martin Wainwright, Larry Wasserman, Min Xu, and
Shuheng Zhou.

## More Data Less Work: Machine Learning as Stochastic Optimization

Nati Srebro, Toyota Technological Institute
Monday, Nov 8, 2010
4:00 PM – 5:00 PM

Description

A common modern approach to learning is to phrase training as an empirical optimization problem, and then invoke an optimization procedure in order to carry out the training. It is thus common to study machine learning problems as (deterministic) optimization problems, yielding optimization approaches with runtimes that grow with the size of the training set. But in this talk I will argue and demonstrate how, from a machine learning perspective, optimization runtime should only *decrease* as a function of training set size, and how this is often achieved by simple stochastic optimization approaches. In fact, I will show how machine learning is in itself a stochastic optimization problem, and should be studied as such.

## Machines That Learn To Segment Images: A Crucial Technology For Connectomics

Sebastian Seung, MIT & HHMI
Monday, Oct 18, 2010
4:00 PM – 5:00 PM

Description
A connectome is the totality of connections between a brain's neurons---the analog of a genome for the mind. In principle, a connectome can be found by analyzing images of a brain from 3d electron microscopy. In practice, this has only been done for the nervous system of the tiny nematode C. elegans, because the image analysis is too laborious. To find connectomes of brains more like our own, two image analysis tasks must be automated. The first is the identification of synapses. The second is the reconstruction of the branching shapes of neurons, which is a special case of image segmentation. In spite of decades of research, existing segmentation algorithms are all far too inaccurate to be useful for finding connectomes. I will describe our efforts to improve upon the state of the art. Our approach is based on the warping error and the Rand error, which quantify the performance of an algorithm by its disagreement with “true” segmentations of example images. These metrics penalize topological disagreements, but are less sensitive to minor disagreements over the location of boundaries. In particular, the metrics penalize splits and mergers, errors that matter for finding neural connectivity. We have developed methods (called BLOTC and MALIS respectively) of learning image segmentation by minimizing the warping and Rand errors. Both methods yield algorithms with significantly better segmentation performance than the naive learning method of minimizing pixel error, which in turn outperforms traditional hand-designed methods. While these results are encouraging, much higher accuracy will be needed for finding connectomes efficiently. I will conclude by speculating about the future directions that could achieve this.

## Mitigating Manhole Events in Manhattan

Cynthia Rudin, MIT
Monday, Oct 4
4:00 PM – 5:00 PM

Description
There are a few hundred manhole events (fires, explosions, smoking manholes) in New York City every year, often stemming from problems in the low voltage secondary electrical distribution network that provides power to residential and commercial customers. I will describe work on the Columbia/Con Edison Manhole Events project, the goal of which is to predict manhole events in order to assist Con Edison (NYC's power utility company) with its pre-emptive maintenance and repair programs. The success of this project relied heavily on an understanding of the current state of Manhattan's grid, which has been built incrementally over the last century. Several different sources of Con Edison data are used for the project, the most important of which is the ECS (Emergency Control Systems) database consisting of trouble tickets from past events that are mainly recorded in free text by Con Edison dispatchers.

In this talk, I will discuss the data mining process by which we transformed extremely raw historical Con Edison data into a ranking model that predicts manhole vulnerability. A key aspect in this process is a machine learning method for ranking, called the "P-Norm Push." Our ranked lists are currently assisting with the prioritization of future inspections and repairs in Manhattan, Brooklyn, and the Bronx.

This is joint work with Becky Passonneau, Axinia Radeva, and several others at the Center for Computational Learning Systems at Columbia University, and Delfina Isaac and Steve Ierome at Con Edison

## Dual Decomposition (and Belief Propagation) for Inference in Natural Language Processing

Michael Collins, MIT
Monday, Sept 20
4:00 PM – 5:00 PM

Description
There has been a long history in combinatorial optimization of methods that exploit structure in complex problems, using methods such as dual decomposition or Lagrangian relaxation. These methods leverage the observation that complex inference problems can often be decomposed into efficiently solvable sub-problems. Thus far, however, these methods are not widely used in NLP.

In this talk I'll describe recent work on inference algorithms for NLP based on dual decomposition. In the first part of the talk I'll describe algorithms that efficiently solve problems that would conventionally be solved through "intersections" of dynamic programs (e.g., the classic construction of Bar-Hillel et al). The algorithms are simple and efficient, building on standard dynamic programming algorithms as oracles; they provably solve a linear programming relaxation of the original problem; and empirically they very often lead to an exact solution to the original problem.

In the second part of the talk I'll describe work on non-projective parsing using dual decomposition. Although decoding with non-projective models is intractable in the general case, we leverage two observations: 1) Decoding for individual head-words can be accomplished using dynamic programming. 2) Decoding for arc-factored models can be accomplished using directed minimum-weight spanning tree

(MST) algorithms. The resulting algorithms give exact solutions on over 98% of sentences across a broad set of languages.

While the focus of this talk is on NLP problems, there are close connections to inference methods, in particular belief propagation, for graphical models. Our work was inspired by recent work that has used dual decomposition as an alternative to belief propagation in Markov random fields.

This is joint work with Tommi Jaakkola, Terry Koo, Sasha Rush, and David Sontag.

##### Where

Microsoft Research New England
First Floor Conference Center
One Memorial Drive, Cambridge, MA

##### Arrival Guidance

Upon arrival, be prepared to show a picture ID and sign the Building Visitor Log when approaching the Lobby Floor Security Desk. Alert them to the name of the event you are attending and ask them to direct you to the appropriate floor. Typically the talks are located in the First Floor Conference Center, however sometimes the location may change.

##### Mailing List

To subscribe to the mailing list, send an email to this email address.

To unsubscribe, send an email to this email address.