Conference Proceedings
Up
Book
Journal Articles
Conference Proceedings
Theses
Technical Reports
Miscellaneous
News Paper and Web Articles

 

2007

  • Ralf Herbrich, Tom Minka, Thore Graepel. TrueSkill(TM): A Bayesian Skill Rating System 2007 Advances in Neural Information Processing Systems 20
    We present a new Bayesian skill rating system which can be viewed as a generalisation of the Elo system used in Chess. The new system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. Inference is performed by approximate message passing on a factor graph representation of the model. We present experimental evidence on the increased accuracy and convergence speed of the system compared to Elo and report on our experience with the new rating system running in a large-scale commercial online gaming service under the name of TrueSkill.

2006

  • David Stern, Ralf Herbrich, Thore Graepel. Bayesian Pattern Ranking for Move Prediction in the Game of Go 2006 Proceedings of the International Conference of Machine Learning 873--880
    We investigate the problem of learning to predict moves in the board game of Go from game records of expert players. In particular, we obtain a probability distribution over legal moves for professional play in a given position. This distribution has numerous applications in computer Go, including serving as an efficient stand-alone Go player. It would also be effective as a move selector and move sorter for game tree search and as a training tool for Go players. Our method has two major components: a) a pattern extraction scheme for efficiently harvesting patterns of given size and shape from expert game records and b) a Bayesian learning algorithm (in two variants) that learns a distribution over the values of a move given a board position based on the local pattern context. The system is trained on 181,000 expert games and shows excellent prediction performance as indicated by its ability to perfectly predict the moves made by professional Go players in 34\% of test positions.

  • David Stern, Ralf Herbrich, Thore Graepel. Learning To Solve Game Trees 2006 Proceedings of the International Conference of Machine Learning 873--880

2005

  • Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth. A Large Deviation Bound for the Area Under the ROC Curve 2005 Advances in Neural Information Processing Systems 17
    The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.

  • Arthur Gretton, Alexander Smola, Olivier Bousquet, Ralf Herbrich, Andrei Belitski, Mark Augath, Yusuke Murayama, Jon Pauls, Bernhard Schölkopf, Nikos Logothetis. Kernel Constrained Covariance for Dependence Measurement 2005 Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics
    We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.

  • Shyamsundar Rajaram, Thore Graepel, Ralf Herbrich. Poisson-Networks: A Model for structured point processes 2005 Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics
    Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. In this paper, we develop the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. Further, we discuss fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.

2004

  • Thore Graepel, Ralf Herbrich. Invariant Pattern Recognition by Semidefinite Programming Machines 2004 Advances in Neural Information Processing Systems 16 33--40
    Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm - the Semidefinite Programming Machine (SDPM) - which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods.

  • Thore Graepel, Ralf Herbrich, Andriy Kharechko, John Shawe-Taylor. Semidefinite Programming by Perceptron Learning 2004 Advances in Neural Information Processing Systems 16 457--464
    We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods.

  • Thore Graepel, Ralf Herbrich, Julian Gold. Learning to Fight 2004 Proceedings of the International Conference on Computer Games: Artificial Intelligence, Design and Education
    We apply reinforcement learning to the problem of finding good policies for a fighting agent in a commercial computer game. The learning agent is trained using the SARSA algorithm for on-policy learning of an action-value function represented by linear and neural network function approximators. We discuss the selection and construction of features, actions, and rewards as well as other design choices necessary to integrate the learning process into the game. The learning agent is trained against the built-in AI of the game with different rewards encouraging aggressive or defensive behaviour. We show that the learning agent finds interesting (and partly near optimal) policies in accordance with the reward functions provided. We also discuss the particular challenges arising in the application of reinforcement learning to the domain of computer games.

  • Shivani Agarwal, Thore Graepel, Ralf Herbrich, Dan Roth. A Large Deviation Bound for the Area Under the ROC Curve 2004 Advances in Neural Information Processing Systems 17
    The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.

2003

  • Neil Lawrence, Matthias Seeger, Ralf Herbrich. Fast Sparse Gaussian Process Methods: The Informative Vector Machine 2003 Advances in Neural Information Processing Systems 15 625--632
    We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in O(d) rather than O(n), d<<n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n d^2), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities (`error bars'), allows for Bayesian model selection and is less complex in implementation.

  • Edward Harrington, Ralf Herbrich, Jyrki Kivinen, John C. Platt, Robert C. Williamson. Online Bayes Point Machines 2003 Proceedings of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining 241--252
    We present a new and simple algorithm for learning large margin classifiers that works in a truly online manner. The algorithm generates a linear classifier by averaging the weights associated with several perceptron-like algorithms run in parallel in order to approximate the Bayes point. A random subsample of the incoming data stream is used to ensure diversity in the perceptron solutions. We experimentally study the algorithm's performance on online and batch learning settings. The online experiments showed that our algorithm produces a low prediction error on the training sequence and tracks the presence of concept drift. On the batch problems its performance is comparable to the maximum margin algorithm which explicitly maximises the margin.

  • Arthur Gretton, Ralf Herbrich, Alexander Smola. The Kernel Mutual Information 2003 Proceedings of IEEE Internaltional Conference on Acoustics, Speech and Signal Processing
    We introduce a new contrast function, the kernel mutual information (KMI), to measure the degree of independence of continuous random variables. This contrast function provides an approximate upper bound on the mutual information, as measured near independence, and is based on a kernel density estimate of the mutual information between a discretised approximation of the continuous random variables. We show that Bach and Jordan's kernel generalised variance (KGV) is also an upper bound on the same kernel density estimate, but is looser. Finally, we suggest that the addition of a regularising term in the KGV causes it to approach the KMI, which motivates the introduction of this regularisation.

  • Robertson, S. E., Walker, S., Zaragoza, H., Herbrich, R.. Microsoft Cambridge at TREC 2002: Filtering track 2003 The Eleventh Text REtrieval Conference, TREC 2002 Voorhees, E. M. and Harman, D. K. 439--446

2002

  • Ralf Herbrich, Robert C. Williamson. Algorithmic Luckiness 2002 Advances in Neural Information Processing Systems 14 391--397
    In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [Taylor et al., 1998] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.

  • Simon Hill, Hugo Zaragoza, Ralf Herbrich, Peter J. Rayner. Average Precision and the Problem of Generalisation 2002 Proceedings of the ACM SIGIR Workshop on Mathematical and Formal Methods in Information Retrieval
    In this paper we study the problem of generalisation in information retrieval. In particular we study precision-recall curves and the average precision value. We provide two types of bounds: large-deviation bounds of the average precision and maximum deviation bounds with respect to a given point of the precision recall curve. The first type of bounds are useful to answer the question: how far can true average precision be from the value observed on a test collection? The second is useful for obtaining bounds on average precision when tight bounds on a particular point of the curve can be established, as is the case when training SVMs or Perceptrons for document categorisation.

  • Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, Jaz Kandola. The Perceptron Algorithm with Uneven Margins 2002 Proceedings of the International Conference of Machine Learning 379--386
    The perceptron algorithm with margins is a simple, fast and effective learning algorithm for linear classifiers; it produces decision hyperplanes within some constant ratio of the maximal margin. In this paper we study this algorithm and a new variant: the perceptron algorithm with uneven margins, tailored for document categorisation problems (i.e. problems where classes are highly unbalanced and performance depends on the ranking of patterns). We discuss the interest of these algorithms from a theoretical point of view, provide a generalisation of Novikoff's theorem for uneven margins, give a geometrically description of these algorithms and show experimentally that both algorithms yield equal or better performances than support vector machines, while reducing training time and sparsity, in classification (USPS) and document categorisation (Reuters) problems.

  • Stephen E. Robertson, Stephen Walker, Hugo Zaragoza, Ralf Herbrich. Microsoft Cambridge at TREC 2002: Filtering Task 2002 Proceedings of the Ninth Text Retrieval Conference (TREC-9) 361--368
    Six runs were submitted for the Adaptive Filtering track, four on the adaptive filtering task (ok11af??), and two on the routing task (msPUM?). The adaptive filtering system has been somewhat modified from the one used for TREC-10, largely for efficiently and flexibility reasons; the basic filtering algorithms remain similar to those used in recent TRECs. For the routing task, a completely new system based on perceptrons with uneven margins was used.

2001

  • Bernhard Schölkopf, Ralf Herbrich, Alexander J. Smola. A Generalized Representer Theorem 2001 Proceedings of the Fourteenth Annual Conference on Computational Learning Theory 416--426
    Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.

  • Ralf Herbrich, Thore Graepel. A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work 2001 Advances in Neural Information Processing Systems 13 224-230
    We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and --for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets.

  • Thore Graepel, Ralf Herbrich, Robert C. Williamson. From Margin to Sparsity 2001 Advances in Neural Information Processing Systems 13 210--216
    We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the dual perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself.

  • Ralf Herbrich, Thore Graepel. Large Scale Bayes Point Machines 2001 Advances in Neural Information Processing Systems 13 528--534
    The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [Herbrich et al., 2000] is restricted to small sample size because it requires O(m^2) of memory and O(N*m^2) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes Point Machines are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1\% generalisation error for a given rejection rate of 10\%.

  • Thore Graepel, Mike Goutrie, Marco Krüger, Ralf Herbrich. Learning on Graphs in the Game of Go 2001 Proceedings of the Ninth International Conference on Artificial Neural Networks 347--352
    We consider the game of Go from the point of view of machine learning and as a well-defined domain for learning on graph representations. We discuss the representation of both board positions and candidate moves and introduce the common fate graph (CFG) as an adequate representation of board positions for learning. Single candidate moves are represented as feature vectors with features given by subgraphs relative to the given move in the CFG. Using this representation we train a support vector machine (SVM) and a kernel perceptron to discriminate good moves from bad moves on a collection of life-and-death problems and on 9x9 game records. We thus obtain kernel machines that solve Go problems and play 9x9 Go.

  • Arthur Gretton, Arnaud Doucet, Ralf Herbrich, Peter J. W. Rayner, Bernhard Schölkopf. Support Vector Regression for Black-Box System Identification 2001 11th IEEE Workshop on Statistical Signal Processing
    In this paper, we demonstrate the use of support vector regression (SVR) techniques for black-box system identification. There methods derive from statistical learning theory, and are of great theoretical and practical interest. We briefly describe the theory underpinning SVR, and compare support vector methods with other approaches using radial basis networks. Finally, we apply SVR to modeling the behaviour of a hydralic robot arm, and show that SVR improves on previously published results.

  • Thore Graepel, Ralf Herbrich. The Kernel Gibbs Sampler 2001 Advances in Neural Information Processing Systems 13 514--520
    We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

2000

  • Thore Graepel, Ralf Herbrich, Klaus Obermayer. Bayesian Transduction 2000 Advances in Neural Information Processing Systems 12 456--462
    Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk--sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.

  • Thore Graepel, Ralf Herbrich, John Shawe-Taylor. Generalisation Error Bounds for Sparse Linear Classifiers 2000 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory 298--303
    We provide small sample size bounds on the generalisation error of linear classifiers that are sparse in their dual representation given by the expansion coefficients of the weight vector in terms of the training data. These results theoretically justify algorithms like the Support Vector Machine, the Relevance Vector Machine and K-nearest-neighbour. The bounds are a-posteriori bounds to be evaluated after learning when the attained level of sparsity is known. In a PAC-Bayesian style prior knowledge about the expected sparsity is incorporated into the bounds. The proofs avoid the use of double sample arguments by taking into account the sparsity that leaves unused training points for the evaluation of classifiers. We furthermore give a PAC-Bayesian bound on the average generalisation error over subsets of parameter space that may pave the way combining sparsity in the expansion coefficients and margin in a single bound. Finally, reinterpreting a mistake bound for the classical perceptron algorithm due to Novikoff we demonstrate that our new results put classifiers found by this algorithm on a firm theoretical basis.

  • Ralf Herbrich, Thore Graepel, Colin Campbell. Robust Bayes Point Machines 2000 Proceedings of ESANN 2000 49--54
    Support Vector Machines choose the hypothesis corresponding to the centre of the largest hypersphere that can be inscribed in version space. If version space is elongated or irregularly shaped a potentially superior approach is take into account the whole of version space. We propose to construct the Bayes point which is approximated by the centre of mass. Our implementation of a Bayes Point Machine (BPM) uses an ergodic billiard to estimate this point in the kernel space. We show that BPMs outperform hard margin Support Vector Machines (SVMs) on real world datasets. We introduce a technique that allows the BPM to construct hypotheses with non--zero training error similar to soft margin SVMs with quadratic penelisation of the margin slacks. An experimental study reveals that with decreasing penelisation of training error the improvement of BPMs over SVMs decays, a finding that is explained by geometrical considerations.

  • Ralf Herbrich, Thore Graepel, John Shawe-Taylor. Sparsity vs. Large Margins for Linear Classifiers 2000 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory 304--308
    We provide small sample size bounds on the generalisation error of linear classifiers that take advantage of large observed margins on the training set and sparsity in the data dependent expansion coefficients. It is already known from results in the luckiness framework that both criteria independently have a large impact on the generalisation error. Our new results show that they can be combined which theoretically justifies learning algorithms like the Support Vector Machine or the Relevance Vector Machine. In contrast to previous studies we avoid using the classical technique of symmetrisation by a ghost sample but directly using the sparsity for the estimation of the generalisation error. We demonstrate that our result leads to practical useful results even in case of small sample size if the training set witnesses our prior belief in sparsity and large margins.

1999

  • Ralf Herbrich, Jason Weston. Adaptive Margin Support Vector Machines for Classification Learning 1999 Proceedings of the Ninth International Conference on Artificial Neural Networks 880--885
    In this paper we propose a new learning algorithm for classification learning based on the Support Vector Machine (SVM) approach. Existing approaches for constructing SVMs are based on minimization of a regularized margin loss where the margin is treated equivalently for each training pattern. We propose a reformulation of the minimization problem such that adaptive margins for each training pattern are utilized, which we call the Adaptive Margin (AM-) SVM. We give bounds on the generalization error of AM-SVMs which justify their robustness against outliers, and show experimentally that the generalization error of AM-SVMs is comparable to classical SVMs on benchmark datasets from the UCI repository.

  • Ralf Herbrich, Thore Graepel, Colin Campbell. Bayes Point Machines: Estimating the Bayes Point in Kernel Space 1999 Proceedings of IJCAI Workshop Support Vector Machines 23--27
    From a Bayesian perspective Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the center of mass of version space approximates the Bayes point. We suggest estimating the center of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results are presented indicating that Bayes Point Machines consistently outperform Support Vector Machines.

  • Thore Graepel, Ralf Herbrich, Klaus Obermayer. Bayesian transductive classification by maximizing volume in version space 1999 Proceedings of Learning 1999 Conference
    According to Vapnik, when solving a given problem one should avoid solving a more general problem as an intermediate step. A direct application of this principle reduces the more general problem of inferring a functional dependency on the whole of input space to the problem of estimating the values of a function at given points (working set), a paradigm referred to as "transductive inference". We consider the case of binary classification by linear discriminant functions. The simplification of the problem results from the fact that the infinite number of linear discriminants is boiled down to a finite number of equivalence classes on the working set. The number of equivalence classes is bounded from above by the growth function. Each equivalence class corresponds to a polyhedron in parameter space. In a PAC style setting we consider only the region of parameter space with zero training error, often referred to as the version space. From a Bayesian point of view, we suggest to measure the prior probability of a labeling of the working set as the volume of the corresponding polyhedron w.r.t. the apriori distribution in parameter space. Then the maximum aposteriori (MAP) scheme recommends to choose the labeling of maximum volume. In the case of general priors a Monte Carlo sampling of version space leads to estimates of the relevant volumes. Sampling, however, is computationally costly because it is necessary to check one constraint per data point in the training set and the rejection rate grows exponentially with the dimensionality of the input space. We thus suggest to sample the version space using an ergodic billiard, an idea first employed by Rujan for finding the center of mass of version space. Starting from the point corresponding to the Optimal Hyperplane into a randomly selected direction, one follows the trajectory of an ideal billiard ball and determines the largest subvolume in version space as the one, where the ball spends the longest time on its journey

  • Thore Graepel, Ralf Herbrich, Peter Bollmann--Sdorra, Klaus Obermayer. Classification on Pairwise Proximity Data 1999 Advances in Neural Information Processing Systems 11 438--444
    We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.

  • Thore Graepel, Ralf Herbrich, Bernhard Schölkopf, Alex Smola, Peter Bartlett, Klaus Robert-Müller, Klaus Obermayer, Robert Williamson. Classification on Proximity Data with LP--Machines 1999 Proceedings of the Ninth International Conference on Artificial Neural Networks 304--309
    We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to nu--SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with nu--SV learning in proximity space and K--nearest-neighbors on real world data from Neuroscience and molecular biology.

  • Ralf Herbrich, Thore Graepel, Klaus Obermayer. Support Vector Learning for Ordinal Regression 1999 Proceedings of the Ninth International Conference on Artificial Neural Networks 97--102
    We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to Support Vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents w.r.t. an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as Support Vector classification and Support Vector regression in the case of more than two ranks.

Before 1999

  • Ralf Herbrich, Thore Graepel, Peter Bollmann--Sdorra, Klaus Obermayer. Learning a Preference Relation in IR 1998 Proceedings Workshop Text Categorization and Machine Learning, International Conference on Machine Learning 1998 80--84
    In this paper we investigate the problem of learning a preference relation from a given set of ranked documents. We show that the Bayes's optimal decision function, when applied to learning a preference relation, may violate transitivity. This is undesirable for information retrieval, because it is in conflict with a document ranking based on the user's preferences. To overcome this problem we present a vector space based method that performs a linear mapping from documents to scalar utility values and thus guarantees transitivity. The learning of the relation between documents is formulated as a classification problem on pairs of documents and is solved using the principle of structural risk minimization for good generalization. The approach is extended to polynomial utility functions by using the potential function method (the so called ``kernel trick''), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs. The resulting algorithm is tested on an example with artificial data. The algorithm successfully learns the utility function underlying the training examples and shows good classification performance.

  • Ralf Herbrich, Thore Graepel, Peter Bollmann--Sdorra, Klaus Obermayer. Supervised Learning of Preference Relations 1998 Proceedings Fachgruppentreffen Maschinelles Lernen 43--47
    In this paper we investigate the problem of learning a preference relation from a given set of ranked objects. The main difference between preference learning and standard classification learning is the demand, that the mapping from objects to ranks has to be transitive and antisymmetric. To model this problem we present an approach that performs a linear mapping from objects to scalar utility values and thus guarantees transitivity and antisymmetry. The learning of the preference between objects is formulated as a classification problem on pairs of objects and is solved using the principle of structural risk minimization. The approach is extended to nonlinear utility functions by using the potential function method (the so called kernel trick), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs.

  • Ralf Herbrich, Tobias Scheffer. Generation of task specific segmentation procedures as a model selection task 1997 Proceedings of the Visual Information Processing Workshop 11-24
    The segmentation of an image, i.e. the partitioning of the image into regions with equal properties, is an essential task in the image processing chain in pattern recognition tasks. This paper presents an approach to supervised image segmentation that use a model selection algorithm before the induction to select the best region model. Afterwards, a segmentation algorithm will be induced in the selected model. This approach is non--parametric; the model selection will be performed automatically. The method is evaluated on two datasets. At first we used the well known dataset of brodatz textures. Here we assume that the regions can be distinguished by local peaks in their frequency spectrum. Therefore we use gabor filters to expand the gray value information. The second task is to learn the segmentation of xray--images of spines. We will show that this approach is capable to select the best model automatically and can learn the better segmentation algorithm than without model selection.

  • Tobias Scheffer, Ralf Herbrich. Unbiased Assessment of Learning Algorithms 1997 Proceedings of the International Joint Conference on Artificial Intelligence 798--803
    In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.

  • Tobias Scheffer, Ralf Herbrich, Fritz Wysotzki. Efficient theta-subsumption based on graph algorithms 1996 Proceedings International Workshop on Inductive Logic Programming 312--329

  • Tobias Scheffer, Ralf Herbrich, Fritz Wysotzki. Graph based subsumption algorithms for machine learning 1996 Proceedings zum Fachgruppentreffen Maschinelles Lernen 101--105

This site was last updated 25-02-2005