Di He, Liwei Wang, Wei Chen, and Tie-Yan Liu
1 November 2010
This paper theoretically verifies whether a ranking measure is appropriate for Web search, from the convergence point of view. Many ranking measures have been proposed in the literature, such as Precision@k, WTA and NDCG, most of which contain a position discount function to reflect users’ attention on top-ranked documents. A common practice to use these measures to evaluate ranking models is to check the values of these measures on a benchmark dataset, which usually contains a relatively small number of labeled documents per query. However, as we know, in real Web search, one usually needs to deal with an extremely large number of documents for many queries. Therefore, it is unclear whether the evaluation result in terms of a given measure on the benchmark dataset can consistently reflect the performance of the model in real Web search. If not, we think the ranking measure and the corresponding evaluation results cannot be used to select ranking models in a reliable manner. In this regard, we argue that the convergence of a ranking measure with the increasing number of documents is a dispensable property from both theory and application points of view. We then perform formal study on the convergence of ranking measures. Our theoretical analysis indicates that (i) when the discount function in a ranking measure decreases sharply with respect to positions (e.g., truncated at top k positions), the ranking measure will not converge; (ii) when the decrease is slow (e.g., with a logarithm discount), the ranking measure will converge to a model-independent constant; (ii) only when the decrease rate is in a certain range, the ranking measure can converge to a model-dependent value and be feasible for model selection. These findings can not only help us judge whether a ranking measure is good, but also provide a way of improving it. We have conducted experiments on both toy and real data. The corresponding experimental results well verified the above theoretical findings.