LETOR is a package of benchmark data sets for LEarning TO Rank, released by Microsoft Research Asia.
Ranking is the central problem for many applications, and using machine learning technologies to learn the ranking function has been a promising research direction. However, the lack of public benchmark datasets (e.g. standard features, relevance judgments, data partitioning, and evaluation metrics) makes the existing work difficult to be compared with each other.
To solve this problem, in LETOR, we extracted features for each query-document pair in the OHSUMED and TREC collections (which are widely used in the literature of information retrieval (IR)). Our extracted features cover most of the 'standard' features in IR, including classical features (such as term frequency, inverse document frequency, BM25 and language models for IR), and the features proposed in SIGIR papers these years (such as HostRank, Feature propagation and Topical PageRank). Note that from these features, one cannot reconstruct the original documents in the OHSUMED and TREC collections. We benchmarked several state-of-the-arts ranking models with these features and provide baseline results for future studies. We also released an evaluation tool, hoping that by using this single tool, the experimental results of different methods can be easily and impartially compared.
Download the TREC dataset, the OHSUMED dataset, and the evaluation tools.
1) The following people contributed to the the construction
of the LETOR dataset:
Tie-Yan Liu,
Jun Xu,
Tao Qin, Wenying Xiong,
Min Lu, Zhen Liao, Mingfeng Tsai,
Taifeng Wang
and Hang Li.
2) Please
cite the following paper when you use LETOR dataset in your research:
Tie-Yan Liu, Jun Xu, Tao Qin,
Wenying Xiong and Hang Li, LETOR: Benchmark dataset for research on learning to rank for information
retrieval, LR4IR 2007, in conjunction with SIGIR 2007.
3) If you find any problem in this dataset, please kindly
let us know. We will
upgrade it accordingly to fix these problems. Our goal is to make the dataset
reliable and useful for the community.