LETOR is a package of benchmark data sets for research on LEarning TO Rank. LETOR3.0 contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines, for the OHSUMED data collection and the '.gov' data collection. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008.
- Similarity relation of OHSUMED collection is released.
- Sitemap of Gov collection is released.
- Link graph of Gov collection is released.
What's new in LETOR3.0?LETOR3.0 contains several significant updates comparing with version 2.0:
- Add four new datasets: homepage finding 2003, homepage finding 2004, named page finding 2003 and named page finding 2004. Plus the three datasets (OHSUMED, topic distillation 2003 and topic distillation 2004) in LETOR2.0, there are seven datasets in LETOR3.0.
- New document sampling strategy for each query; and so the three datasets in LETOR3.0 are different from those in LETOR2.0;
- New low level features for learning;
- Meta data is provided for better investigation of ranking features;
- More baselines;
Introduction to LETOR3.0 datasetsPlease access this page for download.
A brief description about the directory tree is as follows:
|Folder or file||Description|
|Letor.pdf||An incomplete document about the whole dataset.|
|EvaluationTool||The evaluation tools|
|Gov||Contain 6 datasets in .Gov|
|Gov\Meta||Meta data for all queries in 6 datasets in .Gov. The information can be used to extract some new features.|
|Gov\Feature_null||Original feature files of 6 datasets in .Gov. Since some document may do not contain query terms, we use ``NULL'' to indicate language model features, for which would be minus infinity values.|
|Gov\Feature_min||Replace the ``NULL'' value in Gov\Feature_null with the minimal vale of this feature under a same query. This data can be directly used for learning.|
|Gov\QueryLevelNorm||Conduct query level normalization based on data files in Gov\Feature_min. This data can be directly used for learning.|
|OHSUMED||Contain the OHSUMED dataset|
|OHSUMED\Meta||Meta data for all queries in 6 datasets in .gov. The information can be used to extract some new features.|
|OHSUMED\Feature_null||Original feature files of OHSUMED. Since some document may do not contain query terms, we use ``NULL'' to indicate language model features, for which would be a minus infinity values.|
|OHSUMED\Feature_min||Replace the ``NULL'' value in OHSUMED \Feature_null with the minimal vale of this feature under a same query. This data can be directly used for learning.|
|OHSUMED\QueryLevelNorm||Conduct query level normalization based on data files in OHSUMED \Feature_min. This data can be directly used for learning.|
More InformationAfter the release of LETOR3.0, we have recieved many valuable suggestions and feedbacks. According to the suggestions, we release more information about the datasets.
- Similarity relation of OHSUMED collection
Similarity relation. The data is organized by queries. The order of queries in the file is the same as that in OHSUMED\Feature_null\ALL\OHSUMED.txt. The documents of a query in the similarity file are also in the same order as the OHSUMED\Feature_null\ALL\OHSUMED.txt file The similarity graph among documents under a specific query is encoded by a upper triangle matrix. Here is the example for a query:
S(1,2) S(1,3) S(1,4) ... S(1,N)
S(2,3) S(2,4) ... S(2,N)
in which N is the number of documents under this query, S(i,j) means the similarity between the i-th and j-th documents of the query. We simply use cosine similarity beteen the contents of two documents.
- Sitemap of Gov collection
Sitemap. Each line is a web page. The first column is the MSRA doc id of the page, the second column is the depth of the url (number of slashes), the third column is the lenghth of url (without ``http://"), the fourth column is the number of its child pages in the sitemap, the fifth column is the MSRA doc id of its parent page (-1 indicates no parent page).
Mapping from MSRA doc id to TREC doc id
- Link graph of Gov collection
Link graph. Each line is a hyperlink. The first column is the MSRA doc id of the source of the hyperlink, and the second column is the MSRA doc id of the destination of the hyperlink.
Mapping from MSRA doc id to TREC doc id
- The old version of LETOR can be found here.
- The following people contributed to the the construction of the LETOR dataset: Tao Qin, Tie-Yan Liu, Jun Xu, Chaoliang Zhong, Kang Ji, and Hang Li.
- If you have any questions or suggestions with this version, please kindly let us know. Our goal is to make the dataset reliable and useful for the community.