Asli Celikyilmaz, Gokhan Tur, and Dilek Hakkani-Tur
While data-driven methods for spoken language understanding (SLU) provide state of the art performances and reduce maintenance and model adaptation costs compared to handcrafted parsers, the collection and annotation of domain-specific natural language utterances for training remains a time-consuming task. A recent line of research has focused on enriching the training data with in-domain utterances by mining search engine query logs to improve the SLU tasks. However genre mismatch is a big obstacle as search queries are typically keywords. In this paper, we present an efficient discriminative binary classification method that filters large collection of online web search queries only to select the natural language like queries. The training data used to build this classifier is mined from search query click logs, represented as a bipartite graph. Starting from queries which contain natural language salient phrases, random graph walk algorithms are employed to mine corresponding keyword queries. Then an active learning method is employed for quickly improving on top of this automatically mined data. The results show that our method is robust to noise in search queries by improving over a baseline model previously used for SLU data collection. We also show the effectiveness of detected natural language like queries in extrinsic evaluations on domain detection and slot filling tasks.
Publisher Annual Conference of the International Speech Communication Association (Interspeech)