Yahoo! Learning to Rank Challenge Datasets

Yahoo! Labs organizes a learning to rank challenge in March 2010. Two large scale datasets are released. The challenge consists of two tracks: a standard learning to rank track as well as a transfer learning one. It is open to all research groups in academia and industry.

The datasets come from web search ranking and are of a subset of what Yahoo! uses to train its ranking function. They consist of features vectors extracted from query-urls pairs along with relevance judgments. The relevance judgments can take 5 different values from 0 (irrelevant) to 4 (perfectly relevant). The queries, urls and features descriptions are not disclosed, only the feature values. There are two datasets for this challenge, each corresponding to a different country: a large one (labeled set1) and a small one (labeled set2). Both datasets are related, but also different to some extent. Each dataset is divided into 3 sets: training, validation, and test.

The statistics for the various sets are as follows:

  Set 1 Set 2
  Train Val Test Train Val Test
# queries 19,944 2,994 6,983 1,266 1,266 3,798
# urls 473,134 71,083 165,660 34,815 34,881 103,174
# features 519 596

There are 700 features in total. Some of them are defined in set1 or set2 only, while some others are defined in both sets. When a feature is undefined for a set, its value is 0. All the features have been normalized to be in the [0,1] range.

More details can be found at Yahoo! Learning to Rank Challenge.

©2009 Microsoft Corporation. All rights reserved.  Terms of Use | Trademarks | Privacy Statement