Thore Graepel's Home Page

 

Publications

Publications (no abstract)

Publications

 

2007

  • Ralf Herbrich, Tom Minka, Thore Graepel. TrueSkill(TM): A Bayesian Skill Rating System 2007 Advances in Neural Information Processing Systems 20
    We present a new Bayesian skill rating system which can be viewed as a generalisation of the Elo system used in Chess. The new system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. Inference is performed by approximate message passing on a factor graph representation of the model. We present experimental evidence on the increased accuracy and convergence speed of the system compared to Elo and report on our experience with the new rating system running in a large-scale commercial online gaming service under the name of TrueSkill.

2006

  • David Stern, Ralf Herbrich, Thore Graepel. Bayesian Pattern Ranking for Move Prediction in the Game of Go 2006 Proceedings of the International Conference of Machine Learning 873--880
    We investigate the problem of learning to predict moves in the board game of Go from game records of expert players. In particular, we obtain a probability distribution over legal moves for professional play in a given position. This distribution has numerous applications in computer Go, including serving as an efficient stand-alone Go player. It would also be effective as a move selector and move sorter for game tree search and as a training tool for Go players. Our method has two major components: a) a pattern extraction scheme for efficiently harvesting patterns of given size and shape from expert game records and b) a Bayesian learning algorithm (in two variants) that learns a distribution over the values of a move given a board position based on the local pattern context. The system is trained on 181,000 expert games and shows excellent prediction performance as indicated by its ability to perfectly predict the moves made by professional Go players in 34\% of test positions.

  • David Stern, Ralf Herbrich, Thore Graepel. Learning To Solve Game Trees 2006 Proceedings of the International Conference of Machine Learning 873--880

2005

  • Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth. A Large Deviation Bound for the Area Under the ROC Curve 2005 Advances in Neural Information Processing Systems 17
    The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.

  • Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, Dan Roth. Generalization Error Bounds for the Area Under the ROC curve 2005 Journal of Machine Learning Research
    We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.

  • Thore Graepel, Ralf Herbrich, John Shawe-Taylor. PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification 2005 Machine Learning
    We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk. We provide bounds on the prediction error of classifiers returned by mistakedriven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample. Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.

  • Shyamsundar Rajaram, Thore Graepel, Ralf Herbrich. Poisson-Networks: A Model for structured point processes 2005 Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics
    Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. In this paper, we develop the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. Further, we discuss fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.

2004

  • Thore Graepel, Ralf Herbrich. Invariant Pattern Recognition by Semidefinite Programming Machines 2004 Advances in Neural Information Processing Systems 16 33--40
    Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm - the Semidefinite Programming Machine (SDPM) - which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods.

  • Thore Graepel, Ralf Herbrich, Andriy Kharechko, John Shawe-Taylor. Semidefinite Programming by Perceptron Learning 2004 Advances in Neural Information Processing Systems 16 457--464
    We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods.

  • Thore Graepel, Ralf Herbrich, Julian Gold. Learning to Fight 2004 Proceedings of the International Conference on Computer Games: Artificial Intelligence, Design and Education
    We apply reinforcement learning to the problem of finding good policies for a fighting agent in a commercial computer game. The learning agent is trained using the SARSA algorithm for on-policy learning of an action-value function represented by linear and neural network function approximators. We discuss the selection and construction of features, actions, and rewards as well as other design choices necessary to integrate the learning process into the game. The learning agent is trained against the built-in AI of the game with different rewards encouraging aggressive or defensive behaviour. We show that the learning agent finds interesting (and partly near optimal) policies in accordance with the reward functions provided. We also discuss the particular challenges arising in the application of reinforcement learning to the domain of computer games.

  • Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth. A Large Deviation Bound for the Area Under the ROC Curve 2004 Advances in Neural Information Processing Systems 17
    The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.

  • David Stern, Thore Graepel, David MacKay. Modelling Uncertainty in the Game of Go 2004 Advances in Neural Information Processing Systems 16 33--40
    Go is an ancient oriental game whose complexity has defeated attempts to automate it. We suggest using probability in a Bayesian sense to model the uncertainty arising from the vast complexity of the game tree. We present a simple conditional Markov random field model for predicting the pointwise territory outcome of a game. The topology of the model reflects the spatial structure of the Go board.We describe a version of the Swendsen-Wang process for sampling from the model during learning and apply loopy belief propagation for rapid inference and prediction. The model is trained on several hundred records of professional games. Our experimantal results indicate that the model successfully learns to predict territory despite its simplicity.

2003

  • Ralf Herbrich, Thore Graepel. Introduction to the Special Issue on Learning Theory 2003 Journal of Machine Learning Research 755--757 4

  • Nicol N. Schraudolph, Thore Graepel. Combining Conjugate Direction Methods with Stochastic Approximation of Gradients 2003 Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, AISTATS 2003 Christopher M. Bishop and Brendan Frey
    The method of conjugate directions provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore ideas from conjugate gradient in the stochastic (online) setting, using fast Hessian-gradient products to set up low-dimensional Krylov subspaces within individual mini-batches. In our benchmark experiments the resulting online learning algorithms converge orders of magnitude faster than ordinary stochastic gradient descent. Numerical experiments are carried out for both the linear, realisable as well as the non-linear, non-realisable case.

  • Hendrik Purwins, Thore Graepel, Benjamin Blankertz, Klaus Obermayer. Correspondence Analysis for Visulaizing Interplay of Pitch Class, Key, and Composer 2003 Perspectives in Mathematical Music Theory

  • Thore Graepel. Solving Noisy Linear Operator Equations by Gaussian Processes: Application to Ordinary and Partial Differential Equations 2003 Proceedings of the Twentieth International Conference on Machine Learning
    We formulate the problem of solving stochastic linear operator equations in a Bayesian Gaussian process (GP) framework. The solution is obtained in the spirit of a collocation method based on noisy evaluations of the target function at randomly drawn or deliberately chosen points. Prior knowledge about the solution is encoded in terms of the covariance kernel of the GP. As in GP regression, analytical expressions for the mean and variance of the estimated target function are obtained from which the solution to the operator equation follows by a manipulation of the kernel. Linear initial and boundary value constraints can be enforced by embedding the non-parametric model in a form that automatically satisfies the boundary conditions. The method is illustrated on a noisy linear first-order ordinary differential equation with initial condition and on a noisy second-order partial differential equation with Dirichlet boundary conditions.

  • Malte Kuss, Thore Graepel. The Geometry of Kernel Canonical Correlation Analysis 2003 Max Planck Institute for Biological Cybernetics, Tuebingen 108
    Canonical correlation analysis (CCA) is a classical multivariate method concerned with describing linear dependencies between sets of variables. After a short exposition of the linear sample CCA problem and its analytical solution, the article proceeds with a detailed characterization of its geometry. Projection operators are used to illustrate the relations between canonical vectors and variates. The article then addresses the problem of CCA between spaces spanned by objects mapped into kernel feature spaces. An exact solution for this kernel canonical correlation (KCCA) problem is derived from a geometric point of view. It shows that the expansion coefficients of the canonical vectors in their respective feature space can be found by linear CCA in the basis induced by kernel principal component analysis. The effect of mappings into higher dimensional feature spaces is considered critically since it simplifies the CCA problem in general. Then two regularized variants of KCCA are discussed. Relations to other methods are illustrated, e.g., multicategory kernel Fisher discriminant analysis, kernel principal component regression and possible applications thereof in blind source separation.

  • Hendrik Purwins, Thore Graepel, Benjamin Blankertz, Klaus Obermayer. Correspondence Analysis for Visualizing Interplay of Pitch Class, Key, and Composer 2003 Perspectives in Mathematical Music Theory
    We apply correspondence analysis for visualization of interdependence of pitch class & key and key & composer. A co-occurrence matrix of key & pitch class frequencies is extracted from score (Bach's WTC). Keys are represented as high-dimensional pitch class vectors. Correspondence analysis then projects keys on a planar 'keyscape'. Vice versa, on 'pitchscapes' pitch classes can also be embedded in the key space. In both scenarios a homogenous circle of fifths emerges in the scapes. We employ biplots to embed keys and pitch classes in the 'keyscape' to visualize their interdependence. After a change of co-ordinates the four-dimensional biplots can be interpreted as a configuration on a torus, closely resembling results from music theory and experiments in listener models. In conjunction with spectral analysis, correspondence analysis constitutes a cognitive auditory model. Correspondence analysis of the co-occurrence table of intensities of keys and pitch classes lets the circle of fifths evolve in the pitchscape. This model works on digitized recorded music, does not require averaging or normalization of the data, and does not implicitly use circularity inherent in the model. Statistics on key preference in composers yields a composer & key cooccurrence matrix. Then 'stylescapes' visualize relations between musical styles of particular composers and schools. The Biplotting technique links stylistic characteristics to favored keys. Interdependence of composers and schools is meaningfully visualized according to their key preferences.

2002

  • Ralf Herbrich, Thore Graepel. A PAC-Bayesian Margin Bound for Linear Classifiers 2002 IEEE Transactions on Information Theory 12 3140--3150 48
    We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and --- for maximum margins --- to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.

  • Nicol N. Schraudolph, Thore Graepel. Conjugate Directions for Stochastic Gradient Descent 2002 Proceedings of the International Conference on Neural Networks, ICANN 2002 J.R. Dorronsoro 2415 1351--1356 Lecture Notes in Computer Science
    The method of conjugate directions provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore ideas from conjugate gradient in the stochastic (online) setting, using fast Hessian-gradient products to set up low-dimensional Krylov subspaces within individual mini-batches. In our benchmark experiments the resulting online learning algorithms converge orders of magnitude faster than ordinary stochastic gradient descent. The experiments are restricted to the linear, realisable case.

  • Thore Graepel. Kernel matrix completion by semidefinite programming 2002 Proceedings of the International Conference on Neural Networks, ICANN 2002 J.R. Dorronsoro 2415 694--699 Lecture Notes in Computer Science
    We consider the problem of missing data in kernel-based learning algorithms. We explain how semidefinite programming can be used to perform an approximate weighted completion of the kernel matrix that ensures positive semidefiniteness and hence Mercer's condition. In numerical experiments we apply a support vector machine to the XOR classification task based on randomly sparsified kernel matrices from a polynomial kernel of degree 2. The approximate completion algorithm leads to better generalisation and to fewer support vectors as compared to a simple spectral truncation method at the cost of considerably longer runtime. We argue that semidefinite programming provides an interesting convex optimisation framework for machine learning in general and for kernel-machines in particular.

  • Thore Graepel. PAC-Bayesian Pattern Classification with Kernels: Theory, Algorithms, and an Application to the Game of Go 2002 Berlin, Germany Department of Computer Science, Technical Universi
    The thesis deals with problems of pattern classification in the framework of machine learning. The focus of the work is on kernel methods for the supervised classification of objects. The thesis gives a detailed introduction into the field of kernel algorithms and learning theory. New contributions include learning theoretical results in the PAC-Bayesian framework, efficient sampling algorithms for Bayesian classification in kernel space, and an application of kernel methods to pattern analysis in the game of Go. Learning Theory: In the PAC-Bayesian framework we derive new bounds on the prediction error of linear classifiers (in kernel space) in terms of the normalised margin achieved on the training sample, taking into account both the concentration of the training data and the margin distribution. Assuming sparseness of the dual variables we extend the PAC-Bayesian framework to data-dependent hypotheses. Finally, we prove egalitarian bounds on the probability of finding classifiers with high prediction error in subsets of hypothesis space with low empirical risk - results that emphasise the importance of model selection. Learning Algorithms: We discuss Bayesian classification in kernel space and identify Bayesian transduction and the Bayes point machine as optimal procedures for classification in a Bayesian sense. Assuming a uniform prior over normalised weights for a given kernel we devise sampling schemes for the resulting piecewise constant posterior: The kernel Gibbs sampler, a Markov chain Monte Carlo method able to deal with label noise, and the permutational perceptron sampler for large scale sampling. The classification techniques are tested on handwritten-digit recognition tasks and compare favourably to support vector machines when rejecting test data based on confidence measures. Application: We apply kernel classifiers to the domain of pattern classification in the Japanese game of Go, which serves as a test domain for classification problems involving

  • Thore Graepel, Nicol N. Schraudolph. Stable adaptive momentum for rapid online learning in nonlinear systems 2002 Proceedings of the International Conference on Neural Networks, ICANN 2002 J.R. Dorronsoro 2415 450--455 Lecture Notes in Computer Science
    We consider the problem of developing rapid, stable, and scalable stochastic gradient descent algorithms for optimisation of very large nonlinear systems. Based on earlier work by Orr et al. on adaptive momentum---an efficient yet extremely unstable stochastic gradient descent algorithm---we develop a stabilised adaptive momentum algorithm that is suitable for noisy nonlinear optimisation problems. The stability is improved by introducing a forgetting factor lambda between zero and one that smoothes the trajectory and enables adaptation in non-stationary environments. The scalability of the new algorithm follows from the fact that at each iteration the multiplication by the curvature matrix can be achieved in O(n) steps using automatic differentiation tools. We illustrate the behaviour of the new algorithm on two examples: a linear neuron with squared loss and highly correlated inputs, and a multilayer perceptron applied to the four regions benchmark task.

  • Nicol N. Schraudolph, Thore Graepel. Towards Stochastic Conjugate Gradient Methods 2002 Proceedings of the 9th International Conference on Neural Information Processing, ICONIP 2002
    The method of conjugate gradients provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore a number of ways to adopt ideas from conjugate gradient in the stochastic setting, using fast Hessian-vector products to obtain curvature information cheaply. In our benchmark experiments the resulting highly scalable algorithms converge about an order of magnitude faster than ordinary stochastic gradient descent.

2001

  • Ralf Herbrich, Thore Graepel. A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work 2001 Advances in Neural Information Processing Systems 13 224-230
    We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and --for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets.

  • Thore Graepel, Ralf Herbrich. A PAC-Bayesian Margin Distribution Bound for Kernel Classifiers (extended abstract) 2001

  • Ralf Herbrich, Thore Graepel, Colin Campbell. Bayes Point Machines 2001 Journal of Machine Learning Research 245-279 1
    Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that --- for the zero training error case --- Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.

  • Thore Graepel, Ralf Herbrich, Robert C. Williamson. From Margin to Sparsity 2001 Advances in Neural Information Processing Systems 13 210--216
    We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the dual perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself.

  • Ralf Herbrich, Thore Graepel. Large Scale Bayes Point Machines 2001 Advances in Neural Information Processing Systems 13 528--534
    The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [Herbrich et al., 2000] is restricted to small sample size because it requires O(m^2) of memory and O(N*m^2) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes Point Machines are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1\% generalisation error for a given rejection rate of 10\%.

  • Thore Graepel, Mike Goutrie, Marco Krüger, Ralf Herbrich. Learning on Graphs in the Game of Go 2001 Proceedings of the Ninth International Conference on Artificial Neural Networks 347--352
    We consider the game of Go from the point of view of machine learning and as a well-defined domain for learning on graph representations. We discuss the representation of both board positions and candidate moves and introduce the common fate graph (CFG) as an adequate representation of board positions for learning. Single candidate moves are represented as feature vectors with features given by subgraphs relative to the given move in the CFG. Using this representation we train a support vector machine (SVM) and a kernel perceptron to discriminate good moves from bad moves on a collection of life-and-death problems and on 9x9 game records. We thus obtain kernel machines that solve Go problems and play 9x9 Go.

  • Thore Graepel, Ralf Herbrich. The Kernel Gibbs Sampler 2001 Advances in Neural Information Processing Systems 13 514--520
    We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

2000

  • Thore Graepel, Ralf Herbrich, Klaus Obermayer. Bayesian Transduction 2000 Advances in Neural Information Processing Systems 12 456--462
    Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk--sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.

  • Thore Graepel, Ralf Herbrich, John Shawe-Taylor. Generalisation Error Bounds for Sparse Linear Classifiers 2000 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory 298--303
    We provide small sample size bounds on the generalisation error of linear classifiers that are sparse in their dual representation given by the expansion coefficients of the weight vector in terms of the training data. These results theoretically justify algorithms like the Support Vector Machine, the Relevance Vector Machine and K-nearest-neighbour. The bounds are a-posteriori bounds to be evaluated after learning when the attained level of sparsity is known. In a PAC-Bayesian style prior knowledge about the expected sparsity is incorporated into the bounds. The proofs avoid the use of double sample arguments by taking into account the sparsity that leaves unused training points for the evaluation of classifiers. We furthermore give a PAC-Bayesian bound on the average generalisation error over subsets of parameter space that may pave the way combining sparsity in the expansion coefficients and margin in a single bound. Finally, reinterpreting a mistake bound for the classical perceptron algorithm due to Novikoff we demonstrate that our new results put classifiers found by this algorithm on a firm theoretical basis.

  • Ralf Herbrich, Thore Graepel, Klaus Obermayer. Large Margin Rank Boundaries for Ordinal Regression 2000 Advances in Large Margin Classifiers 115--132
    In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.

  • Ralf Herbrich, Thore Graepel, Colin Campbell. Robust Bayes Point Machines 2000 Proceedings of ESANN 2000 49--54
    Support Vector Machines choose the hypothesis corresponding to the centre of the largest hypersphere that can be inscribed in version space. If version space is elongated or irregularly shaped a potentially superior approach is take into account the whole of version space. We propose to construct the Bayes point which is approximated by the centre of mass. Our implementation of a Bayes Point Machine (BPM) uses an ergodic billiard to estimate this point in the kernel space. We show that BPMs outperform hard margin Support Vector Machines (SVMs) on real world datasets. We introduce a technique that allows the BPM to construct hypotheses with non--zero training error similar to soft margin SVMs with quadratic penelisation of the margin slacks. An experimental study reveals that with decreasing penelisation of training error the improvement of BPMs over SVMs decays, a finding that is explained by geometrical considerations.

  • Ralf Herbrich, Thore Graepel, John Shawe-Taylor. Sparsity vs. Large Margins for Linear Classifiers 2000 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory 304--308
    We provide small sample size bounds on the generalisation error of linear classifiers that take advantage of large observed margins on the training set and sparsity in the data dependent expansion coefficients. It is already known from results in the luckiness framework that both criteria independently have a large impact on the generalisation error. Our new results show that they can be combined which theoretically justifies learning algorithms like the Support Vector Machine or the Relevance Vector Machine. In contrast to previous studies we avoid using the classical technique of symmetrisation by a ghost sample but directly using the sparsity for the estimation of the generalisation error. We demonstrate that our result leads to practical useful results even in case of small sample size if the training set witnesses our prior belief in sparsity and large margins.

  • Sambu Seo, Marko Wallat, Thore Graepel, Klaus Obermayer. Gaussian Process Regression: Active Data Selection and Test Point Rejection 2000 Proceedings of the International Joint Conference on Neural Networks IJCNN'2000 241--246 III
    We consider active data selection and test point rejection strategies for Gaussian process regression based on the variance of the posterior over target values. Gaussian process regression is viewed as transductive regression that provides target distributions for given points rather than selecting an explicit regression function. Since not only the posterior mean but also the posterior variance are easily calculated we use this additional information to two ends: Active data selection is performed by either querying at points of high estimated posterior variance or at points that minimize the estimated posterior variance averaged over the input distribution of interest or --- in a transductive manner --- averaged over the test set. Test point rejection is performed using the estimated posterior variance as a confidence measure. We find for both a two-dimensional toy problem and for a real-world benchmark problem that the variance is a reasonable criterion for both active data selection and test point rejection.

1999

  • Ralf Herbrich, Thore Graepel, Colin Campbell. Bayes Point Machines: Estimating the Bayes Point in Kernel Space 1999 Proceedings of IJCAI Workshop Support Vector Machines 23--27
    From a Bayesian perspective Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the center of mass of version space approximates the Bayes point. We suggest estimating the center of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results are presented indicating that Bayes Point Machines consistently outperform Support Vector Machines.

  • Ralf Herbrich, Thore Graepel, Colin Campbell. Bayesian Learning in Reproducing Kernel Hilbert Spaces 1999 Technical University of Berlin
    Support Vector Machines find the hypothesis that corresponds to the centre of the largest hypersphere that can be placed inside version space, i.e. the space of all consistent hypotheses given a training set. The boundaries of version space touched by this hypersphere define the support vectors. An even more promising approach is to construct the hypothesis using the whole of version space. This is achieved by the Bayes point: the midpoint of the region of intersection of all hyperplanes bisecting version space into two volumes of equal magnitude. It is known that the centre of mass of version space approximates the Bayes point. The centre of mass is estimated by averaging over the trajectory of a billiard in version space. We derive bounds on the generalisation error of Bayesian classifiers in terms of the volume ratio of version space and parameter space. This ratio serves as an effective VC dimension and greatly influences generalisation. We present experimental results indicating that Bayes Point Machines consistently outperform Support Vector Machines. Moreover, we show theoretically and experimentally how Bayes Point Machines can easily be extended to admit training errors.

  • Thore Graepel, Ralf Herbrich, Klaus Obermayer. Bayesian transductive classification by maximizing volume in version space 1999 Proceedings of Learning 1999 Conference
    According to Vapnik, when solving a given problem one should avoid solving a more general problem as an intermediate step. A direct application of this principle reduces the more general problem of inferring a functional dependency on the whole of input space to the problem of estimating the values of a function at given points (working set), a paradigm referred to as "transductive inference". We consider the case of binary classification by linear discriminant functions. The simplification of the problem results from the fact that the infinite number of linear discriminants is boiled down to a finite number of equivalence classes on the working set. The number of equivalence classes is bounded from above by the growth function. Each equivalence class corresponds to a polyhedron in parameter space. In a PAC style setting we consider only the region of parameter space with zero training error, often referred to as the version space. From a Bayesian point of view, we suggest to measure the prior probability of a labeling of the working set as the volume of the corresponding polyhedron w.r.t. the apriori distribution in parameter space. Then the maximum aposteriori (MAP) scheme recommends to choose the labeling of maximum volume. In the case of general priors a Monte Carlo sampling of version space leads to estimates of the relevant volumes. Sampling, however, is computationally costly because it is necessary to check one constraint per data point in the training set and the rejection rate grows exponentially with the dimensionality of the input space. We thus suggest to sample the version space using an ergodic billiard, an idea first employed by Rujan for finding the center of mass of version space. Starting from the point corresponding to the Optimal Hyperplane into a randomly selected direction, one follows the trajectory of an ideal billiard ball and determines the largest subvolume in version space as the one, where the ball spends the longest time on its journey

  • Thore Graepel, Ralf Herbrich, Peter Bollmann--Sdorra, Klaus Obermayer. Classification on Pairwise Proximity Data 1999 Advances in Neural Information Processing Systems 11 438--444
    We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.

  • Thore Graepel, Ralf Herbrich, Bernhard Schölkopf, Alex Smola, Peter Bartlett, Klaus Robert-Müller, Klaus Obermayer, Robert Williamson. Classification on Proximity Data with LP--Machines 1999 Proceedings of the Ninth International Conference on Artificial Neural Networks 304--309
    We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to nu--SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with nu--SV learning in proximity space and K--nearest-neighbors on real world data from Neuroscience and molecular biology.

  • Ralf Herbrich, Max Keilbach, Thore Graepel, Peter Bollmann--Sdorra, Klaus Obermayer. Neural Networks in Economics: Background, Applications, and New Developments 1999 Advances in Computational Economics 169--196 11
    Neural Networks were developed in the sixties as devices for classification and regression. The approach was originally inspired from Neuroscience. Its attractiveness lies in the ability to ``learn'', i.e. to generalize to as yet unseen observations. One aim of this paper is to give an introduction to the technique of Neural Networks and an overview of the most popular architectures. We start from statistical learning theory to introduce the basics of learning. Then, we give an overview of the general principles of neural networks and of their use in the field of Economics. A second purpose is to introduce a recently developed Neural Network Learning technique, so called Support Vector Network Learning, which is an application of ideas from statistical learning theory. This approach has shown very promising results on problems with a limited amount of training examples. Moreover, utilizing a technique that is known as the kernel trick, Support Vector Networks can easily be adapted to nonlinear models. Finally, we present an economic application of this approach from the field of preference learning.

  • Ralf Herbrich, Thore Graepel, Klaus Obermayer. Regression Models for Ordinal Data: A Machine Learning Approach 1999 Technical University of Berlin
    In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. The task of ordinal regression arises frequently in the social sciences and in information retrieval where human preferences play a major role. Also many multi-class problems are really problems of ordinal regression due to an ordering of the classes. Although the problem is rather novel to the Machine Learning Community it has been widely considered in Statistics before. All the statistical methods rely on a probability model of a latent (unobserved) variable and on the condition of stochastic ordering. In this paper we develop a distribution independent formulation of the problem and give uniform bounds for our risk functional. The main difference to classification is the restriction that the mapping of objects to ranks must be transitive and asymmetric. Combining our theoretical framework with results from measurement theory we present an approach that is based on a mapping from objects to scalar utility values and thus guarantees transitivity and asymmetry. Applying the principle of Structural Risk Minimization as employed in Support Vector Machines we derive a new learning algorithm based on large margin rank boundaries for the task of ordinal regression. Our method is easily extended to nonlinear utility functions. We give experimental results for an Information Retrieval task of learning the order of documents with respect to an initial query. Moreover, we show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks

  • Ralf Herbrich, Thore Graepel, Klaus Obermayer. Support Vector Learning for Ordinal Regression 1999 Proceedings of the Ninth International Conference on Artificial Neural Networks 97--102
    We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to Support Vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents w.r.t. an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as Support Vector classification and Support Vector regression in the case of more than two ranks.

  • Thore Graepel, Klaus Obermayer. A Self-Organizing Map for Proximity Data 1999 Neural Computation 139-155 11
    We derive an efficient algorithm for topographic mapping of proximity data (TMP), which can be seen as an extension of Kohonen's Self-Organizing Map to arbitrary distance measures. The TMP cost function is derived in a Baysian framework of Folded Markov Chains for the description of autoencoders. It incorporates the data via a dissimilarity matrix D and the topographic neighborhood via a matrix H of transition probabilities. From the principle of Maximum Entropy a non-factorizing Gibbs-distribution is obtained, which is approximated in a mean-field fashion. This allows for Maximum Likelihood estimation using an EM algorithm. In analogy to the transition from Topographic Vector Quantization (TVQ) to the Self-organizing Map (SOM) we suggest an approximation to TMP which is computationally more efficient. In order to prevent convergence to local minima, an annealing scheme in the temperature parameter is introduced, for which the critical temperature of the first phase-transition is calculated in terms of D and H. Numerical results demonstrate the working of the algorithm and confirm the analytical results. Finally, the algorithm is used to generate a connection map of areas of the cat's cerebral cortex.

1998

  • Ralf Herbrich, Thore Graepel, Peter Bollmann--Sdorra, Klaus Obermayer. Learning a Preference Relation in IR 1998 Proceedings Workshop Text Categorization and Machine Learning, International Conference on Machine Learning 1998 80--84
    In this paper we investigate the problem of learning a preference relation from a given set of ranked documents. We show that the Bayes's optimal decision function, when applied to learning a preference relation, may violate transitivity. This is undesirable for information retrieval, because it is in conflict with a document ranking based on the user's preferences. To overcome this problem we present a vector space based method that performs a linear mapping from documents to scalar utility values and thus guarantees transitivity. The learning of the relation between documents is formulated as a classification problem on pairs of documents and is solved using the principle of structural risk minimization for good generalization. The approach is extended to polynomial utility functions by using the potential function method (the so called ``kernel trick''), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs. The resulting algorithm is tested on an example with artificial data. The algorithm successfully learns the utility function underlying the training examples and shows good classification performance.

  • Ralf Herbrich, Thore Graepel, Peter Bollmann--Sdorra, Klaus Obermayer. Supervised Learning of Preference Relations 1998 Proceedings Fachgruppentreffen Maschinelles Lernen 43--47
    In this paper we investigate the problem of learning a preference relation from a given set of ranked objects. The main difference between preference learning and standard classification learning is the demand, that the mapping from objects to ranks has to be transitive and antisymmetric. To model this problem we present an approach that performs a linear mapping from objects to scalar utility values and thus guarantees transitivity and antisymmetry. The learning of the preference between objects is formulated as a classification problem on pairs of objects and is solved using the principle of structural risk minimization. The approach is extended to nonlinear utility functions by using the potential function method (the so called kernel trick), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs.

  • Matthias Burger, Thore Graepel, Klaus Obermayer. An Annealed Self-Organizing Map for Source Channel Coding 1998 Advances in Neural Information Processing Systems NIPS 10 430-436
    We derive and analyse robust optimization schemes for noisy vector quantization on the basis of deterministic annealing. Starting from a cost function for central clustering that incorporates distortions from channel noise we develop a soft topographic vector quantization algorithm (STVQ) which is based on the maximum entropy principle and which performs a maximum-likelihood estimate in an expectation-maximization (EM) fashion. Annealing in the temperature parameter beta leads to phase transitions in the existing code vector representation during the cooling process for which we calculate critical temperatures and modes as a function of eigenvectors and eigenvalues of the covariance matrix of the data and the transition matrix of the channel noise. A whole family of vector quantization algorithms is derived from STVQ, among them a deterministic annealing scheme for Kohonen's self-organizing map (SOM). This algorithm, which we call SSOM, is then applied to vector quantization of image data to be sent via a noisy binary symmetric channel. The algorithm's performance is compared to those of LBG and STVQ. While it is naturally superior to LBG, which does not take into account channel noise, its results compare very well to those of STVQ, which is computationally much more demanding.

  • Thore Graepel, Klaus Obermayer. Fuzzy Topographic Kernel Clustering 1998 Proceedings of the 5th GI Workshop Fuzzy Neuro Systems '98 W. Brauer 90--97
    A new topographic clustering algorithm is proposed, which -- by the use of integral operator kernel functions -- efficiently estimates the centers of clusters in a high-dimensional feature space, which is related to data space by some non linear map. Like in the Self-Organizing Map topography is imposed by assuming finite transition probabilities between cluster indices. The optimization of the associated cost function is achieved by estimating the parameters via an EM-scheme and determini stic annealing. The effect of different radial basis function kernels on topographic maps of handwritten digit data is examined in computer simulations.

  • Thore Graepel, Matthias Burger, Klaus Obermayer. Self-Organizing Maps: Generalizations and New Optimization Techniques 1998 Neurocomputing 173-190 20
    We offer three algorithms for the generation of topographic mappings to the practitioner of unsupervised data analysis. The algorithms are each based on the minimization of a cost function which is performed using an EM algorithm and deterministic annealing. The soft topographic vector quantization algorithm (STVQ) - like the original Self-Organizing Map (SOM) - provides a tool for the creation of self-organizing maps of Euclidean data. Its optimization scheme, however, offers an alternative to the heuristic stepwise shrinking of the neighborhood width in the SOM and makes it possible to use a fixed neighborhood function solely to encode desired neighborhood relations between nodes. The kernel-based soft topographic mapping (STMK) is a generalization of STVQ and introduces new distance measures in data space based on kernel functions. Using the new distance measures corresponds to performing the STVQ in a high-dimensional feature space, which is related to data space by a nonlinear mapping. This preprocessing can reveal structure of the data which may go unnoticed if the STVQ is performed in the standard Euclidean space. The soft topographic mapping for proximity data (STMP) is another generalization of STVQ that enables the user to generate topographic maps for data which are given in terms of pairwise proximities. It thus offers a flexible alternative to multidimensional scaling methods and opens up a new range of applications for Self-Organizing Maps. Both STMK and STMP share the robust optimization properties of STVQ due to the application of deterministic annealing. In our contribution we discuss the algorithms together with their implementation and provide detailed pseudo-code and explanations.

  • Thore Graepel. Statistical Physics of Clustering Algorithms 1998 Berlin, Germany Department of Computer Science, Technical Universi
    This thesis presents a principled approach to clustering by formulating the problem in a probabilistic autoencoder framework which is based on folded Markov chains. As a suitable optimization technique deterministic annealing is introduced, which performs robust optimization based on an analogy to the cooling of a system in statistical physics. Application of deterministic annealing to the derived clustering cost functions leads to three algorithms: soft topographic vector quantization (STVQ), which performs topographic clustering on Euclidean feature vectors, kernel-based soft topographic mapping (STMK), which allows to do the clustering in a high dimensional Euclidean feature space by the application of the kernel trick, and soft topographic mapping for proximity data (STMP), which generalizes STVQ to arbitrary pairwise dissimilarity data in a mean field fashion. All three algorithms are analysed w.r.t.\ the annealing process and their application is demonstrated on both artificial and real world data.