Semi-Supervised Learning of Semantic Classes for Query Understanding – from the Web and for the Web

  • Ye-Yi Wang ,
  • Raphael Hoffmann ,
  • Xiao Li ,
  • Jakub Szymanski

The 18th ACM Conference on Information and Knowledge Management |

Published by Association for Computing Machinery, Inc.

Understanding intents from search queries can improve a user’s search experience and boost a site’s advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time.

This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance – even with far less training data. We focus our study on understanding the correct objectives for the semi-supervised lexicon learning algorithms that are crucial for the success of query tagging. Prior work on lexicon acquisition has largely focused on the precision of the lexicons, but we show that precision is not important if the lexicons are used for query tagging. A more adequate criterion should emphasize a trade-off between maximizing the recall of semantic class instances in the data, and minimizing the confusability. This ensures that the similar levels of precision and recall are observed on both training and test set, hence prevents over-fitting the lexicon features in a sequential labeling model. Experimental results on retail product queries from a commercial search engine show that enhancing a query tagger with lexicons learned based on this objective reduces word level tagging errors by up to 25% compared to the baseline tagger that does not use any lexicon features. In contrast, lexicons obtained through a precision-centric learning algorithm even degrade the performance of a tagger compared to the baseline. Furthermore, the proposed method outperforms one in which semantic class lexicons have been extracted from a structured database.