Ye-Yi Wang, Raphael Hoffmann, Xiao Li, and Jakub Syzmanski
Understanding intents from search queries can improve a user's search experience and boost a site's advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time.
This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance -- even with far less training data. We focus our study on understanding the correct objectives for the semi-supervised lexicon learning algorithms that are crucial for the success of query tagging. Prior work on lexicon acquisition has largely focused on the precision of the lexicons, but we show that precision is not important if the lexicons are used for query tagging. A more adequate criterion should emphasize a trade-off between maximizing the recall of semantic class instances in the data, and minimizing the confusability. This ensures that the similar levels of precision and recall are observed on both training and test set, hence prevents over-fitting the lexicon features in a sequential labeling model. Experimental results on retail product queries from a commercial search engine show that enhancing a query tagger with lexicons learned based on this objective reduces word level tagging errors by up to 25% compared to the baseline tagger that does not use any lexicon features. In contrast, lexicons obtained through a precision-centric learning algorithm even degrade the performance of a tagger compared to the baseline. Furthermore, the proposed method outperforms one in which semantic class lexicons have been extracted from a structured database.
In The 18th ACM Conference on Information and Knowledge Management
Publisher Association for Computing Machinery, Inc.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or firstname.lastname@example.org. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.