Yahoo! Learning to Rank Challenge Datasets
Yahoo! Labs organizes a learning to rank challenge in March 2010. Two large scale datasets are released. The challenge consists of two tracks: a standard learning to rank track as well as a transfer learning one. It is open to all research groups in academia and industry.
The datasets come from web search ranking and are of a subset of what Yahoo! uses to train its ranking function. They consist of features vectors extracted from query-urls pairs along with relevance judgments. The relevance judgments can take 5 different values from 0 (irrelevant) to 4 (perfectly relevant). The queries, urls and features descriptions are not disclosed, only the feature values. There are two datasets for this challenge, each corresponding to a different country: a large one (labeled set1) and a small one (labeled set2). Both datasets are related, but also different to some extent. Each dataset is divided into 3 sets: training, validation, and test.
The statistics for the various sets are as follows:
| Set 1 | Set 2 | |||||
| Train | Val | Test | Train | Val | Test | |
| # queries | 19,944 | 2,994 | 6,983 | 1,266 | 1,266 | 3,798 |
| # urls | 473,134 | 71,083 | 165,660 | 34,815 | 34,881 | 103,174 |
| # features | 519 | 596 | ||||
There are 700 features in total. Some of them are defined in set1 or set2 only, while some others are defined in both sets. When a feature is undefined for a set, its value is 0. All the features have been normalized to be in the [0,1] range. p>
More details can be found at Yahoo! Learning to Rank Challenge.

