SCHEDULE
---------------------------------
Date
July 11 - 12, 2007
Venue
Peking University (see
campus map)
Sciences Building No.4 (Life Science Building
No.5 The Summer Palace Street, Haidian District, Beijing.
Abstract
Statistical Failure
Diagnosis in Software and Systems
Alice Zheng
Carnegie Mellon University
As
software and systems become increasingly complex, the task of
debugging also becomes increasingly difficult. Manual diagnosis
can require sifting through millions of lines of code and output
logs. In addition, large systems contain many components, each
complex on its own, and often interacting in unexpected ways.
I
present a case study illustrating how statistical machine
learning algorithms, along with appropriate system
instrumentation, can aid in failure diagnosis. I propose a
statistical software debugging framework that collects
information from past successes and failures via fine-grained
instrumentation of the program and then analyzes this
information to locate suspicious program predicates. I discuss
the algorithmic challenges of the approach, and demonstrate a
bi-clustering algorithm that is effective at simultaneously
clustering failed runs and selecting useful predicates. Using
this approach, it took a programmer 20 minutes to find a
long-standing bug in a real-world
software program which he had never seen before.
This work is done in collaboration with Ben Liblit (U.
Wisconsin, Madison), Michael Jordan (U.C. Berkeley), Alex Aiken
and Mayur Naik (Stanford).
Predictive Learning
via Rule Ensembles
Jerome H. Friedman
Stanford University
General regression and classification models are constructed as
linear combinations of simple rules derived from the data. Each
rule consists of a conjunction of a small number of simple
statements concerning the values of individual input variables.
These rule ensembles are shown to
produce predictive accuracy comparable to the best methods.
However their principal advantage lies in interpretation.
Because of its simple form, each rule is easy to understand, as
is its influence on individual predictions, selected subsets of
predictions, or globally over the entire space of joint input
variable values. Similarly, the degree of relevance of the
respective input variables can be assessed globally, locally in
different regions of the input space, or at individual
prediction points. Techniques are presented for automatically
identifying those variables that are involved in interactions
with other variables, the strength and degree of those
interactions, as well as the identities of the other variables
with which they interact. Graphical representations are used to
visualize both main and interaction effects.
Tournament
Algorithms for Reducing Multiclass Prediction to Binary Prediction
John Langford
Toyota
Technological Institute at Chicago
I'll describe a new family of algorithms for reducing from
multiclass classification to binary classification. The first
member of this family is a single elimination tournament, as is
often used in gaming competitions. Other members of this family
correspond to double elimination and higher order elimination
tournaments.
These reductions are provably consistent, implying that a zero
regret binary classifier induces a zero regret multiclass
classifier. They are also robust, implying that small errors in
binary classification imply small errors in multiclass
classification. The robustness is superior to all other known
algorithms.
Variational methods
in Markov random fields: Stability and the benefits of choosing
the "wrong" model
Martin Wainwright
Department of Statistics, UC Berkeley
Markov random fields (MRFs) are widely used in different areas
of statistics and computer science, including statistical
machine learning, spatial statistics, statistical signal
processing, and communication theory. Variational methods are a
class of algorithms, either of an exact or approximate nature,
for solving challenging computational problems that arise in
applications of MRF models, including marginalization and
computing likelihoods.
The
first part of this talk will be a tutorial introduction to
variational methods in Markov random fields, including mean
field theory, belief propagation and extensions, and various
higher-order relaxations. In the second part, we consider a
problem of joint model estimation and prediction, in which a
Markov random field is first estimated on the basis of one data
set, and then used to perform prediction on new samples.
Here we demonstrate the remarkable fact that estimating the
``wrong'' model can be beneficial in the computation-limited
setting. Indeed, we prove that an inconsistent estimate yields
better prediction performance than the true underlying model,
even in the infinite data limit. We discuss various other
contexts in which similar behavior is likely to occur when
approximate algorithms are used.
A Geometric
Perspective on Machine Learning and Related Algorithms
Partha
Niyogi
University of
Chicago
Increasingly, we face machine learning problems in very high
dimensional spaces. We proceed with the intuition that although
natural data lives in very high dimensions, they have relatively
few
degrees of freedom. One way to formalize this intuition is to
model the data as lying on or near a low dimensional manifold
embedded in the high dimensional space. This point of view leads
to a new class of algorithms that are "manifold motivated" and a
new set of theoretical questions that surround their analysis. A
central construction in these algorithms is a graph or
simplicial complex that is data-derived and we will relate the
geometry of these to the geometry of the underlying manifold.
Applications to embedding, clustering, classification, and
semi-supervised learning will be considered.
Boosting: Theory and
Applications
Robert Schapire
Princeton University
Boosting is a general method for producing a very accurate
classification rule by combining rough and moderately inaccurate
"rules of thumb." While rooted in a theoretical framework of
machine learning, boosting has been found to perform quite well
empirically. In this talk, I will focus on the boosting
algorithm AdaBoost, and explain the underlying theory of
boosting, including our explanation of why boosting often does
not suffer from overfitting, as well as recent developments in
the ongoing debate surrounding this explanation. Some of the
myriad other theoretical points of view that have been taken on
AdaBoost will also be presented. Finally, I will describe some
recent applications of boosting.
Some improved
boosting methods
Tong Zhang
Rutgers University &
Yahoo
We
consider the problem of fitting complex loss functions using an
arbitrary regression weak learner. Our approach, extending
gradient boosting, is based on greedy optimization of quadratic
upper bounds (or approximation) of the loss functions. This idea
allows us to present a rigorous convergence analysis of the
resulting algorithms based on various results for greedy least
squares methods. We compare a few methods, their convergence
rates, as well as the underlying mechanism for complexity
control.
Semi-Supervised
Learning and Interactive Image Segmentation
Changshui Zhang
Tsinghua Univeristy
The
recent years have witnessed a surge of interest in
semi-supervised learning in machine learning community. And
graph based semi-supervised learning has been becoming on of the
hot research topics in semi-supervised learning (SSL) fields.
Despite its empirical success, there are still some problems
that are not properly resolved. In this talk, I will introduce
our novel graph based SSL algorithm called Linear Neighborhood
Propagation (LNP). Based on the assumption that each data point
can be linearly reconstructed from its neighborhood, LNP can
automatically get the similarities between pairwise points, and
such similarities will be used to propagate the labels to the
unlabeled points. We prove theoretically the convergence of our
algorithm and also derive a simple way for using LNP to
induction. To demonstrate the effectiveness of LNP, we apply it
to interactive image segmentation, in which the user first give
some hints on the foreground and background, and the LNP
algorithm will be used to propagate these hints to the rest of
the pixels. We will see that in this application LNP can do a
perfect job and give us satisfactory results.
Multi-Instance
Learning Revisited
Zhi-Hua Zhou
Nanjing University
Multi-instance learning is a branch of machine learning which
has been studied by many researchers during the past decade.
Here the training set is composed of many bags each contains
many instances; a bag is positive if it contains at least one
positive instance and negative otherwise; although the labels of
the training bags are known, the labels of the instances are
unknown. In this talk, we will show that by assuming i.i.d.
instances, which is an assumption explicitly or implicitly taken
by most previous studies, multi-instance learning can be
regarded as a special case of semi-supervised learning, which is
another branch of machine learning, and we suggest that only
i.i.d. bags should be assumed in future research of
multi-instance learning.
Group and
Hierachical Sparsity through Composite Absolute Penalities
Bin Yu
University of
California at Berkeley
Extracting useful information from high-dimensional data is the
focus of today's statistical research and practice. Penalized
loss function minimization has been shown to be effective for
this task both theoretically and empirically. With the vitures
of both regularization and sparsity, the L1-penalized L2
minimization method Lasso has been popular.
In
this talk, we combine different norms including L1 to form an
intelligent penalty in order to add side information to the
fitting of a regression or classification model to obtain
reasonable estimates. Specifically, we introduce the Composite
Absolute Penalties (CAP) family which allows the grouping and
hierarchical relationships between the predictors to be
expressed. The CAP framework covers and goes beyond existing
works including grouped lasso and elastic nets.CAP penalties are
built by defining groups and combining the properties of norm
penalties at the across group and within group levels. Grouped
selection occurs for non-overlapping groups.
In
that case, we give a Bayesian interpretation for CAP
penalties.Hierarchical variable selection is reached by defining
groups with particular overlapping patterns.In the computation
aspect, we propose using the BLASSO and cross-validation to
obtain CAP estimates. For a subfamily of CAP estimates, we
introduce the iCAP ($\infty$-CAP) algorithm to trace the entire
regularization path for the grouped selection problem. Within
this subfamily, unbiased estimates of the degrees of freedom
(df) are derived allowing the regularization parameter to be
selected with cross-validation. CAP is shown to improve on the
predictive performance of the LASSO in a series of simulated
experiments including those when $p>>n$ and when the grouping is
wrong. When the complexity of a model is properly calculated,
iCAP is seen to be parsimonious as well in the experiments.
AdaRank: A Boosting
Algorithm for Information Retrieval
Hang Li
Microsoft Research
Asian
In
this talk, I will introduce a machine learning algorithm for
constructing ranking functions in information retrieval, which
we have developed recently. I will first give an overview on
information retrieval using machine learning. I will explain the
ranking models, evaluation criteria, and data collection methods
widely used in information retrieval. The models include BM25,
Language Model for IR, the general model of Learning to Rank.
The evaluation criteria include NDCG (Normalized Discounted
Cumulative Gain) and Kendall's tau. The data collection methods
are explicit data labeling and implicit data labeling. I will
discuss that Ideally a learning algorithm would train a ranking
model that could directly optimize the performance measures with
respect to the training data, in contrast to the fact that many
existing methods were only able to train ranking models by
minimizing loss functions loosely related to the performance
measures. I will next introduce our proposal on AdaRank, a
boosting algorithm, which can minimize a loss function directly
defined on the performance measures. I will also explain the
theoretical results and empirical results we have obtained with
regard to AdaRank. This is joint work with Jun Xu at MSRA.
Hilbert Space
representations and tests for distributions
Alex Smola
University of
Australia
Entropy and mutual information are popular methods for dealing
with distributions. In this talk I will discuss alternative
measures for computing distances between distributions based on
Reproducing Kernel Hilbert Spaces. This leads to a family of
methods ranging from computationally efficient two sample tests
over tests for independence of random variables, algorithms for
feature selection, clustering, covariate shift correction and
variants of maximum variance unfolding. The talk will discuss
several of those applications and algorithms arising from this
setting.
Bayesian Models of Social Networks and Text
Andrew McCallum
University of Massachusetts Amherst
In this talk I will describe recent research at the intersection
of information extraction, data mining and social network
analysis. In particular I will focus on how such a combination
can be made both robust and scalable---showing that the typical
brittle cascading of errors from text extraction to data mining
can be avoided with unified probabilistic inference in graphical
models, and showing that these models can be made efficient with
recent methods of approximate inference and learning. After
briefly introducing conditional random fields, I will
demonstrate their use in joint models of extraction, entity
resolution, and sequence alignment. I will then describe several
methods of integrating textual data into a particular type of
data mining---social network analysis. In one model, we
discover role-similarity between entities by examining not only
network connectivity, but also the words communicated on on
those edges; I'll demonstrate this method on a large corpus of
email data.
In another model, we discover groups of entities and the
"topical" conditions under which different groupings arise; I'll
demonstrate this on coalition discovery from many years worth of
voting records in the United Nations. I'll conclude with further
examples of graphical models successfully applied to relational
data, as well as discussion of their applicability to trend
analysis, expert-finding and bibliometrics.
Learning to Rank: From pariwise approach to Listwise approach
Tie-Yan Liu
Microsoft Research Asia
The talk is concerned with learning to rank, which is to
construct a model or a function for ranking objects. Learning to
rank is useful for document retrieval, collaborative filtering,
and many other applications. Several methods for learning to
rank have been proposed, which take object pairs as `instances¨
in learning. We refer to them as the pairwise approach. Although
the pairwise approach offers advantages, it ignores the fact
that ranking is a prediction task on list of objects. This talk
postulates that learning to rank should adopt the listwise
approach in which lists of objects are used as `instances¨ in
learning. We will mainly introduce an algorithm named ListNet,
in which we propose a listwise loss function based on
probability model. Experimental results on information retrieval
show that the proposed listwise approach performs better than
the pairwise approach.
Approximation of Haar Distributed Matrices and Limiting
Distributions of Eigenvalues of Jacobi Ensembles
Tiefeng
Jiang
University of Minnesota
I will first present tools to approximate the entries of a large
dimensional real and complex Jacobi ensembles with independent
complex Gaussian random variables. Based on this, we obtain the
Tracy-Widom law of the largest singular values of the Jacobi
emsemble. Moreover, the circular law, the Marchenko-Pastur law,
the central limit theorem, and the laws of large numbers for the
spectral norms are also obtained.