Jinyoung Kim, Gabriella Kazai, and Imed Zitouni
Evaluation of information retrieval (IR) systems has recently been exploring the use of preference judgments over two search result lists. Unlike the traditional method of collecting relevance labels per single result, this method allows to consider the interaction between search results as part of the judging criteria. For example, one result set may be preferred over another if it has a more diverse set of relevant results, covering more diverse user intents. In this paper, we investigate how assessors determine their preference for one set of results over another with the aim to understand the role of relevance dimensions in preference-based evaluation. We run a series of experiments collecting overall and per relevance dimension preferences in side-by-side comparisons of two search result lists, as well as relevance judgments for the individual documents. Our analysis of the collected judgments reveals that preference judgments combine multiple dimensions of relevance that go beyond the traditional notion of relevance centered on topicality (aboutness). Measuring performance based on traditional single document judgments and NDCG aligns well with our topicality-based relevance dimension preferences, but shows misalignment with the overall preferences, largely due to the diversity dimension. As a judging method, our dimensional preference judging leads to improved judgment quality.
|Published in||The 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2013, Dublin, Ireland|