A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing

  • ,
  • Patrick Nguyen ,
  • Xiaolong(Shiao-Long) Li ,
  • Chris Thrasher ,
  • Mu Li ,
  • Kuansan Wang

Proceeding of the 33rd Annual ACM SIGIR Conference |

Published by Association for Computing Machinery, Inc.

This paper presents a comparative study of the recently released Bing Web N-gram Language Models (BWNLM) on three web search and natural language processing tasks: search query spelling correction, query reformulation, and statistical machine translation. BWNLM, as well as the corresponding web services, are much more accessible and easier to use than the previously released text corpora used for large language model training, including the LDC English Gigaword corpus and the Google Web 1T N-gram corpus, because the BWNLM web services provide the access to the smoothed n-gram probabilities based on a set of language models trained from the different text fields from the web documents as well as search queries. Our results show that BWNLM outperform the n-gram models trained on the Gigaword corpus and the Google Web 1T N-gram corpus on all the three tasks. In particular, the significant improvements on search query spelling correction and search query reformulation, resulting from BWNLM, demonstrate the benefit of training multiple language models on different portions of web data and search queries in a principled way with zero count cutoff.