![]() |
Kernel Methods and Convex Optimisation |
![]() |
Thore Graepel and Ralf Herbrich
From a Bayesian perspective support vector machines choose the hypothesis
corresponding to the largest possible hyper-sphere that can be inscribed in
version space, i.e. in the space of all consistent hypotheses given a training
set. Those boundaries of version space which are tangent to the hyper-sphere
define the support vectors. An alternative and potentially better approach is to
construct the hypothesis using the whole of version space. This is achieved by
using a Bayes point machine which finds the midpoint of the region of
intersection of all hyper-planes bisecting version space into two halves of
equal volume (the Bayes point). It is known that the centre of mass of version
space approximates the Bayes point. We investigate estimating the centre of mass
by averaging over the trajectory of a billiard ball bouncing in version space.
Experimental results indicate that Bayes point machines consistently outperform
support vector machines.
Knowledge about local invariances with respect to given pattern transformations
can greatly improve the accuracy of classification. Previous approaches are
either based on regularisation or on the generation of virtual (transformed)
examples. We developed a new framework for learning linear classifiers under known
transformations based on semi-definite programming. We present a new learning
algorithm - the Semi-Definite Programming Machine (SDPM) - which is able to find
a maximum margin hyper-plane when the training examples are polynomial
trajectories instead of single points. The solution is found to be sparse in
dual variables and allows to identify those points on the trajectory with
minimal real-valued output as virtual support vectors. Extensions to segments of
trajectories, to more than one trans- formation parameter, and to learning with
kernels are discussed. In experiments we use a Taylor expansion to locally
approximate rotational invariance in pixel images from USPS and find
improvements over known methods.
In contrast to the standard machine learning tasks of classification and metric
regression we investigate the problem of predicting variables of ordinal scale,
a setting referred to as ordinal regression. This problem arises frequently in
the social sciences and in information retrieval where human preferences play a
major role. Whilst approaches proposed in Statistics rely on a probability model
of a latent (unobserved) variable we present a distribution independent risk
formulation of ordinal regression which allows us to derive uniform convergence
bounds. Applying these bounds we present a large margin algorithm that is based
on a mapping from objects to scalar utility values though classifying pairs of
objects. We give experimental results for an Information Retrieval task which
show that our algorithm outperforms more naive approaches to ordinal regression
such as support vector classification and support vector regression in the case
of more than two ranks.
We present a framework for sparse Gaussian process methods which uses forward
selection with criteria based on information-theoretical principles, previously
suggested for active learning. In contrast to most previous work on sparse GPs,
our goal is not only to learn sparse predictors (which can be evaluated in O(d)
rather than O(n), d<<n, n the number of training points), but also to perform
training under strong restrictions on time and memory requirements. The scaling
of our method is at most O(n d^2), and in large real-world classification
experiments we show that it can match prediction performance of the popular
support vector machine (SVM), yet it requires only a fraction of the training
time. In contrast to the SVM, our approximation produces estimates of predictive
probabilities (`error bars'), allows for Bayesian model selection and is less
complex in implementation.
Machine Learning and Perception—Machine Learning—Kernel Methods