Comparing the Sensitivity of Information Retrieval Metrics

Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.

We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.

PDF file

In  Proceedings of SIGIR

Publisher  Association for Computing Machinery, Inc.
Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or The definitive version of this paper can be found at ACM’s Digital Library --


> Publications > Comparing the Sensitivity of Information Retrieval Metrics