Collecting High Quality Overlapping Labels at Low Cost

Proceedings of SIGIR |

Published by Association for Computing Machinery, Inc.

This paper studies quality of human labels used to train search engines’ rankers. Our specific focus is performance improvements obtained by using overlapping labels, which is by collecting multiple human judgments for each training sample. This paper presents a new method of effectively and efficiently producing and using overlapping labels to improve data quality and search engine accuracy. This paper explores whether, when, and for which data points one should obtain multiple, expert training labels, as well as what to do with them once they have been obtained. The proposed labeling scheme collects multiple overlapping labels only for a subset of training samples, specifically for those labeled relevant by a single judge. Our experiments show that this labeling scheme improves the NDCG of both LambdaRank and LambdaMart rankers on several real-world Web test sets, with a low labeling overhead of around 1.4 labels per sample. Moreover, these NDCG improvements are at least as good as collecting multiple overlapping labels on the entire data set. This labeling scheme also outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best improvement in search engine retrieval accuracy.