Jianfeng Gao, Patrick Nguyen, Xiaolong Li, Chris Thrasher, Mu Li, and Kuansan Wang
23 July 2010
This paper presents a comparative study of the recently released Bing Web N-gram Language Models (BWNLM) on three web search and natural language processing tasks: search query spelling correction, query reformulation, and statistical machine translation. BWNLM, as well as the corresponding web services, are much more accessible and easier to use than the previously released text corpora used for large language model training, including the LDC English Gigaword corpus and the Google Web 1T N-gram corpus, because the BWNLM web services provide the access to the smoothed n-gram probabilities based on a set of language models trained from the different text fields from the web documents as well as search queries. Our results show that BWNLM outperform the n-gram models trained on the Gigaword corpus and the Google Web 1T N-gram corpus on all the three tasks. In particular, the significant improvements on search query spelling correction and search query reformulation, resulting from BWNLM, demonstrate the benefit of training multiple language models on different portions of web data and search queries in a principled way with zero count cutoff.
In Proceeding of the 33rd Annual ACM SIGIR Conference
Publisher Association for Computing Machinery, Inc.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or email@example.com. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.