A Machine Learning Approach for Improved BM25 Retrieval

MSR-TR-2009-92 |

BM25 is one of the most widely used information retrieval functions because of its consistently high retrieval accuracy. Despite its widespread use, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness when the document description is over single or multiple fields. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures.