![]() |
Statistical Learning Theory |
![]() |
Thore Graepel and Ralf Herbrich
In the Bayesian framework learning is viewed as an update of prior belief in the
target concept in light of the data. The learning algorithms considered in the
PAC-Bayesian framework are the Gibbs classifier (or better classification
strategy) and the Bayes classifiers. Thus, once a learning algorithm is
expressed as an update of a probability distribution such that the Bayes
classifier is equivalent to the classifier at hand, the whole (and powerful)
machinery of PAC-Bayesian can be applied. We are particularly interested in the
study of linear classifiers. A geometrical picture reveals that the margin is
only an approximation to the real quantity controlling generalisation error: the
volume of consistent classifiers to the whole volume of parameter space. Hence
we are able to remove awkward constant as well as permanent complexity terms
from known margin bounds. The resulting bound can considered as tight and
practically useful for bound based model selection.
It is generally accepted that inferring a function given only a finite amount of
data is only possible if one restricts the model of the data (descriptive
approach) or the model of the dependencies (predictive approach) respectively.
Over the last years sparse models have become very popular in the field of
prediction. Sparse models are additive models f(x)=∑αi k(x,xi) - also referred
to as kernel models - where at the solution for a finite amount of data only a
few αi are unequal to zero. Surprisingly Bayesian schemes (like Gaussian
Processes, Ridge Regression) which do not enforce such a sparseness show good
generalization behaviour. We look for an explanation of this fact and finally
for the usefulness of sparseness in Machine Learning
Over the last few decades a few frameworks to study the generalisation
performance of learning algorithms have been emerged. Among the few, the most
remarkable are the VC framework (empirical risk minimisation algorithms),
compression framework (on-line algorithms and compression schemes) and the
luckiness framework (structural risk minimisation algorithms). However, apart
from the compression framework none of the frameworks has considered the
generalisation error of the single hypothesis learned by a given learning
algorithm but resorted to the more stringent requirement of uniform convergence.
The algorithmic luckiness framework is an extension of the powerful luckiness
framework which studies the generalisation error of particular learning
algorithms relative to some prior knowledge about the target concept encoded via
a luckiness function.
In many learning problems, the goal is not simply to classify objects into one
of a fixed number of classes; instead, a ranking of objects is desired. This is
the case, for example, in information retrieval problems, where one is
interested in retrieving documents from some database that are relevant to a
given query or topic. In such problems, one wants to return to the user a list
of documents that contains relevant documents at the top and irrelevant
documents at the bottom; in other words, one wants a ranking of the documents
such that relevant documents are ranked higher than irrelevant documents. We
study generalisation properties of the area under the ROC curve (AUC), a
quantity that has been advocated as an evaluation criterion for the bipartite
ranking problem. The AUC is a different term than the error rate used for
evaluation in classification problems; consequently, existing generalisation
bounds for the classification error rate cannot be used to draw conclusions
about the AUC. In this project we develop large deviation bounds for the AUC and
use combinatorial quantities such as the bipartite rank-shatter coefficient to
obtain uniform convergence bounds for the problem of bipartite ranking.
Ralf Herbrich and Robert C. Williamson. Algorithmic Luckiness. 2002. Journal of Machine Learning Research (Gziped Postscript)
Machine Learning and Perception—Machine Learning—Learning Theory