LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This version, 4.0, was released in July 2009. Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from Million Query track of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.
LETOR4.0 contains 8 datasets for four ranking settings derived from the two query
sets and the Gov2 web page collection. The 5-fold cross validation strategy is adopted
and the 5-fold partitions are included in the package. In each fold, there are three
subsets for learning: training set, validation set and testing set. The datasets
can be downloaded here.
- Supervised ranking
There are three versions for each dataset in this setting: NULL, MIN, QueryLevelNorm.
- NULL verion: Since some document may do not contain query terms, we use ``NULL'' to indicate language model features, for which would be minus infinity values. This version of the data cannot be directly be used for learning; the ``NULL'' should be processed first.
- MIN version: Replace the ``NULL'' value in NULL version with the minimal vale of this feature under a same query. This data can be directly used for learning.
- QueryLevelNorm version: Conduct query level normalization based on data in MIN version. This data can be directly used for learning. We further provide 5 fold partitions of this version for cross fold validation.
2 qid:10032 1:0.056537 2:0.000000 3:0.666667 4:1.000000 5:0.067138 ... 45:0.000000 46:0.076923 #docid = GX029-35-5894638 inc = 0.0119881192468859 prob = 0.139842
0 qid:10032 1:0.279152 2:0.000000 3:0.000000 4:0.000000 5:0.279152 ... 45:0.250000 46:1.000000 #docid = GX030-77-6315042 inc = 1 prob = 0.341364
0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 5:0.134276 ... 45:0.750000 46:1.000000 #docid = GX140-98-13566007 inc = 1 prob = 0.0701303
1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 5:0.600707 ... 45:0.500000 46:0.000000 #docid = GX256-43-0740276 inc = 0.0136292023050293 prob = 0.400738
- Semi-supervised ranking
The data format in this setting is the same as that in supervised ranking setting. The only difference is that the datasets in this setting contains both judged and undged query-document pair (in training set but not in validation and testing set) while the datasets in supervised ranking contain only judged query-document pair. The relevance label ``-1'' indicates the query-document pair is not judged. An example is shown as follow.
-1 qid:18219 1:0.022594 2:0.000000 3:0.250000 4:0.166667 ... 45:0.004237 46:0.081600 #docid = GX004-66-12099765 inc = -1 prob = 0.223732
0 qid:18219 1:0.027615 2:0.500000 3:0.750000 4:0.333333 ... 45:0.010291 46:0.046400 #docid = GX004-93-7097963 inc = 0.0428115405134536 prob = 0.860366
-1 qid:18219 1:0.018410 2:0.000000 3:0.250000 4:0.166667 ... 45:0.003632 46:0.033600 #docid = GX005-04-11520874 inc = -1 prob = 0.0980801
- Rank aggregation
In the setting, a query is associated with a set of input ranked lists. The task of rank aggregation is to output a better final ranked list by aggregating the multiple input lists. A row in the data indicate a query-document pair. Several rows are shown as below.
0 qid:10002 1:1 2:30 3:48 4:133 5:NULL ... 25:NULL #docid = GX008-86-4444840 inc = 1 prob = 0.086622
0 qid:10002 1:NULL 2:NULL 3:NULL 4:NULL 5:NULL ... 25:NULL #docid = GX037-06-11625428 inc = 0.0031586555555558 prob = 0.0897452
2 qid:10032 1:6 2:96 3:88 4:NULL 5:NULL ... 25:NULL #docid = GX029-35-5894638 inc = 0.0119881192468859 prob = 0.139842
The first column is relevance label of this pair, the second column is query id, the following columns are ranks of the document in the input ranked lists, and the end of the row is comment about the pair, including id of the document.In the above example, 2:30 means that the ranks of the document is 30 in the second input list. Note that large ranks mean top positions in the input ranked list, and ``NULL'' means the document does not appear in a ranked list. The larger the relevance label, the more relevant the query-document pair. There are 21 input lists in MQ2007-agg dataset and 25 input lists in MQ2008-agg dataset.
- Listwise ranking
The data format in the setting is very similar to that in supervised ranking. The difference is that the ground truth of this setting is a permutation for a query instead of multiple level relevance judgements. As shown in the following examples, the first column is the relevance degree of a document in ground truth permutation. Large value of the relevance degree means top position of the document in the permutation. The other columns are the same as that in the setting of supervised ranking.
1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 ... 46:0.000000 #docid = GX057-59-4044939 inc = 1 prob = 0.698286
1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 ... 46:0.000000 #docid = GX235-84-0891544 inc = 1 prob = 0.567746
1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 ... 46:0.000000 #docid = GX016-48-5543459 inc = 1 prob = 0.775913
1005 qid:10 1:0.004901 2:0.160000 3:0.018182 4:0.666667 ... 46:0.000000 #docid = GX068-48-12934837 inc = 1 prob = 0.659932
More low level information
- Meta data
Meta data for all queries in the two query sets. The information can be used to reproduce some features like BM25 and LMIR, and can also be used to construct some new features. Meta data for MQ2007 query set (~ 60M). Meta data for MQ2008 query set (~ 50M). Collection info (~1 k).
- Link graph (~ 480M) of Gov2 collection
Each line contains the inlinks of a web page. The first column is the MSRA doc id of the web page, the second column is the number of inlinks of this page, and the following columns list the MSRA doc ids of all the inlinks of this page. The mapping from MSRA doc id to TREC doc id can be found here (~ 140M).
- Sitemap (~ 65M) of Gov2 collection
Each line is a web page. The first column is the MSRA doc id of the page, the second column is the depth of the url (number of slashes), the third column is the length of url (without ``http://"), the fourth column is the number of its child pages in the sitemap, the fifth column is the MSRA doc id of its parent page (-1 indicates no parent page).
- Similarity relation of Gov2 collection
The data is organized by queries. Similarity for MQ2007 query set (~ 4.3G), similarity for MQ2008 query set(part1 and part2, ~ 4.9G).
The order of queries in the two files is the same as that in Large_null.txt in the MQ2007-semi dataset and MQ2008-semi dataset.
The order of documents of a query in the two files is also the same as that in Large_null.txt in the MQ2007-semi dataset and MQ2008-semi dataset.
Each row in the similarity files describes the similarity between a page and all the other pages under a same query. Note that i-th row in the similiar files is exactly corresponding to the i-th row in Large_null.txt in MQ2007-semi dataset or MQ2008-semi dataset. Here is the an example line:
qid:10002 qdid:1 406:0.785623 178:0.785519 481:0.784446 63:0.741556 882:0.512454 ...
The first column shows the query id, and the second column shows the page index under the query. For example, for a query with 1000 web pages, the page index ranges from 1 to 1000. The following columns show the similarity between this page and the other pages. For example, 406:0.785623 indicates that the similarity between this page (with index 1 under the query) and the page (with index 406 under the query) is 0.785623. We sort the pages according to the descending order of similarity. The similarity between two pages is consine similarity between the contents of the two pages.
- The following people contributed to the the construction of the LETOR4.0 dataset: Tao Qin, Tie-Yan Liu, Wenkui Ding, Jun Xu, Hang Li, Ben Carterette, Javed Aslam, James Allan, Stephen Robertson, Virgil Pavlu, and Emine Yilmaz .
- We would like to thank the following teams to kindly and generiously share their runs submitted to TREC2007/2008: NEU team, U. Massachusetts team, I3S_Group_of_ICT team, ARSC team, IBM Haifa team, MPI-d5 team, Sabir.buckley team, HIT team, RMIT team, U. Amsterdam team, U. Melbourne team,
- If you have any questions or suggestions with this version, please kindly let us know. Our goal is to make the dataset reliable and useful for the community.