Krysta M. Svore and Christopher J.C. Burges
The Web search ranking task has become increasingly important due to the rapid growth of the internet. With the growth of the Web and the number of Web search users, the amount of available training data for learning Web ranking models has also increased. We investigate the problem of learning to rank on a cluster using Web search data composed of 140,000 queries and approximately fourteen million URLs. For datasets much larger than this, distributed computing will become essential, due to both speed and memory constraints. We compare to a baseline algorithm that has been carefully engineered to allow training on the full dataset using a single machine, in order to evaluate the loss or gain incurred by the distributed algorithms we consider. The underlying algorithm we use is a boosted tree ranking algorithm called LambdaMART, where a split at a given vertex in each decision tree is determined by the split criterion for a particular feature. Our contributions are two-fold. First, we implement a method for improving the speed of training when the training data fits in main memory on a single machine by distributing the vertex split computations of the decision trees. The model produced is equivalent to the model produced from centralized training, but achieves faster training times. Second, we develop a training method for the case where the training data size exceeds the main memory of a single machine. Our second approach easily scales to far larger datasets, i.e., billions of examples, and is based on data distribution. Results of our methods on a real-world Web dataset indicate significant improvements in training speed.
Publisher Cambridge University Press
This chapter has been accepted for publication and will appear in a revised form in "Scaling Up Machine Learning: Parallel and Distributed Approaches" to be published by Cambridge University Press. (http://www.cambridge.org/us/knowledge/isbn/item6542017/?site_locale=en_US) Material on these pages is copyright Cambridge University Press or reproduced with permission from other copyright owners. It may be downloaded and printed for personal reference, but not otherwise copied, altered in any way or transmitted to others (unless explicitly stated otherwise) without the written permission of Cambridge University Press. Hypertext links to other Web locations are for the convenience of users and do not constitute any endorsement or authorisation by Cambridge University Press.