Even though modern retrieval systems typically use a multitude of features to rank documents, the backbone for search ranking is usually the standard tf.idf retrieval models.
We address a limitation of the fundamental retrieval models, the modeling of the vocabulary mismatch between query terms and relevant documents. Vocabulary mismatch happens when query terms fail to appear in the documents that are relevant to the query, causing suboptimal retrieval. Mismatch is a well-known and long standing problem in retrieval. However, it is not well understood how often query terms mismatch relevant documents, neither how mismatch affects retrieval performance. We formally define term mismatch. We show that mismatch is a very common problem in search, that it allows us to understand the common failures of the current retrieval models and the behaviors of many of the retrieval techniques, and that a large potential gain is possible by simply making the retrieval models mismatch-aware. We also demonstrate several initial successes in addressing term mismatch in retrieval using novel mismatch prediction methods and theoretically motivated retrieval techniques, which could lead to even larger gains in retrieval.