John Guiver and Edward Snelson
In this paper we address the issue of learning to rank for document retrieval using Thurstonian models based on sparse Gaussian processes. Thurstonian models represent each document for a given query as a probability distribution in a score space; these distributions over scores naturally give rise to distributions over document rankings. However, in general we do not have observed rankings with which to train the model; instead, each document in the training set is judged to have a particular relevance level: for example "Bad", "Fair", "Good", or "Excellent". The performance of the model is then evaluated using information retrieval (IR) metrics such as Normalised Discounted Cumulative Gain (NDCG). Recently Taylor et al. presented a method called SoftRank which allows the direct gradient optimisation of a smoothed version of NDCG using a Thurstonian model. In this approach, document scores are represented by the outputs of a neural network, and score distributions are created artificially by adding random noise to the scores. The SoftRank mechanism is a general one; it can be applied to different IR metrics, and make use of different underlying models. In this paper we extend the SoftRank framework to make use of the score uncertainties which are naturally provided by a Gaussian process (GP), which is a probabilistic non-linear regression model. We further develop the model by using sparse Gaussian process techniques, which give improved performance and efficiency, and show competitive results against baseline methods when tested on the publicly available LETOR OHSUMED data set. We also explore how the available uncertainty information can be used in prediction and how it affects model performance.
Publisher Association for Computing Machinery, Inc.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or email@example.com. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.