- Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, and Geoffrey Zweig, Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE – Institute of Electrical and Electronics Engineers, March 2015.
Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this paper, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain.
- Daniel Guo, Gokhan Tur, Wen-tau Yih, and Geoffrey Zweig, Joint Semantic Utterance Classification and Slot Filling with Recursive Neural Networks, in 2014 IEEE Spoken Language Technology Workshop (SLT 2014), IEEE – Institute of Electrical and Electronics Engineers, 10 December 2014.
In recent years, continuous space models have proven to be highly effective at language processing tasks ranging from paraphrase detection to language modeling. These models are distinctive in their ability to achieve generalization through continuous space representations, and compositionality through arithmetic operations on those representations. Examples of such models include feed-forward and recurrent neural network language models. Recursive neural networks (RecNNs) extend this framework by providing an elegant mechanism for incorporating both discrete syntactic structure and continuous-space word and phrase representations into a powerful compositional model. In this paper, we show that RecNNs can be used to perform the core spoken language understanding (SLU) tasks in a spoken dialog system, more specifically domain and intent determination, concurrently with slot filling, in one jointly trained model. We find that a very simple RecNN model achieves competitive performance on the benchmark ATIS task, as well as on a Microsoft Cortana conversational understanding task.
- Yun-Nung Vivian Chen, Dilek Hakkani-Tur, and Gokhan Tur, DERIVING LOCAL RELATIONAL SURFACE FORMS FROM DEPENDENCY-BASED ENTITY EMBEDDINGS FOR UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING, IEEE – Institute of Electrical and Electronics Engineers, December 2014.
Recent works showed the trend of leveraging web-scaled structured semantic knowledge resources such as Freebase for open domain spoken language understanding (SLU). Knowledge graphs provide sufficient but ambiguous relations for the same entity, which can be used as statistical background knowledge to infer possible relations for interpretation of user utterances. This paper proposes an approach to capture the relational surface forms by mapping dependency-based contexts of entities from the text domain to the spoken domain. Relational surface forms are learned from dependency-based entity embeddings, which encode the contexts of entities from dependency trees in a deep learning model. The derived surface forms carry functional dependency to the entities and convey the explicit expression of relations. The experiments demonstrate the efficiency of leveraging derived relational surface forms as local cues together with prior background knowledge.
- Qi Li, Gokhan Tur, Dilek Hakkani-Tur, Xiang Li, Tim Paek, Asela Gunawardana, and Chris Quirk, Distributed open-domain conversational understanding framework with domain independent extractors, in IEEE Spoken Language Technology Workshop, IEEE – Institute of Electrical and Electronics Engineers, December 2014.
Traditional spoken dialog systems are usually based on a centralized architecture, in which the number of domains is predefined, and the provider is fixed for a given domain and intent. The spoken language understanding (SLU) component is responsible for detecting domain and intents, and filling domain-specific slots. It is expensive and time-consuming in this architecture to add new and/or competing domains, intents, or providers. The rapid growth of service providers in the mobile computing market calls for an extensible dialog system framework. This paper presents a distributed dialog infrastructure where each domain or provider is agnostic of others, and processes the user utterances independently using their own knowledge or models, so that a new domain and new provider can be easily incorporated in. In addition, to facilitate each service provider building their own SLU models or algorithms, we introduce a new component, extractors, to provide intermediate semantic annotations such as entity mention tags, which can be plugged in arbitrarily as well. Each service provider can then rapidly develop their SLU parser with minimum efforts by providing some example sentences with intents and slots if needed. Our preliminary experimental results demonstrate the power of this new framework compared to a centralized architecture.
- Xiang Li, Gokhan Tur, Dilek Hakkani-Tur, and Qi Li, PERSONAL KNOWLEDGE GRAPH POPULATION FROM USER UTTERANCES IN CONVERSATIONAL UNDERSTANDING, IEEE – Institute of Electrical and Electronics Engineers, December 2014.
Knowledge graphs provide a powerful representation of entities and the relationships between them, but automatically constructing such graphs from spoken language utterances presents the novelty and numerous challenges. In this paper, we introduce a statistical language understanding approach to automatically construct personal (user-centric) knowledge graphs in conversational dialogs. Such information has the potential to better understand the users’ requests, fulfilling them, and enabling other technologies such as developing better inferences or proactive interactions. Knowledge encoded in semantic graphs such as Freebase has been shown to benefit semantic parsing and interpretation of natural language utterances. Hence, as a first step, we exploit the personal factual relation triples from Freebase to mine natural language snippets with a search engine, and the resulting snippets containing pairs of related entities to create the training data. This data is then used to build three key language understanding components: (1) Personal Assertion Classification identifies the user utterances that are relevant with personal facts, e.g., “my mother’s name is Rosa”; (2) Relation Detection classifies the personal assertion utterance into one of the predefined relation classes, e.g., “parents ”; and (3) Slot Filling labels the attributes or arguments of relations, e.g., “name(parents):Rosa”. Our experiments using the Microsoft conversational understanding system demonstrate the performance of this proposed approach on the population of personal knowledge graphs.
- Murat Akbacak, Dilek Hakkani-Tur, and Gokhan Tur, Rapidly building domain-specific entity-centric language models using semantic web knowledge resources, in Proceedings of Interspeech, ISCA - International Speech Communication Association, September 2014.
For domain-specific speech recognition tasks, it is best if the statistical language model component is trained with text data that is content-wise and style-wise similar to the targeted domain for which the application is built. For state-of-the-art language modeling techniques that can be used in real-time within speech recognition engines during first-pass decoding (e.g., N-gram models), the above constraints have to be fulfilled in the training data. However collecting such data, even through crowd sourcing, is expensive and time consuming, and can still be not representative of how a much larger user population would interact with the recognition system. In this paper, we address this problem by employing several semantic web sources that already contain the domain-specific knowledge, such as query click logs and knowledge graphs. We build statistical language models that meet the requirements listed above for domain-specific recognition tasks where natural language is used and the user queries are about name entities in a specific domain. As a case study, in the movies domain where users’ voice queries are movie related, compared to a generic web language model, a language model trained with the above resources not only yields significant perplexity and word-error-rate improvements, but also presents an approach where such language models can be rapidly developed for other domains.
- Dilek Hakkani-Tur, Asli Celikyilmaz, Larry Heck, Gokhan Tur, and Geoff Zweig, Probabilistic Enrichment of Knowledge Graph Entities for Relation Detection in Conversational Understanding, in Proceedings of Interspeech, ISCA - International Speech Communication Association, September 2014.
Knowledge encoded in semantic graphs such as Freebase has been shown to benefit semantic parsing and interpretation of natural language user utterances. In this paper, we propose new methods to assign weights to semantic graphs that reflect common usage types of the entities and their relations. Such statistical information can improve the disambiguation of entities in natural language utterances. Weights for entity types can be derived from the populated knowledge in the semantic graph, based on the frequency of occurrence of each type. They can also be learned from the usage frequencies in real world natural language text, such as related Wikipedia documents or user queries posed to a search engine. We compare the proposed methods with the unweighted version of the semantic knowledge graph for the relation detection task and show that all weighting methods result in better performance in comparison to using the unweighted version.
- Hany Hassan, Lee Schwartz, Dilek Hakkani-Tur, and Gokhan Tur, Segmentation and Disfluency Removal for Conversational Speech Translation, no. MSR-TR-2014-80, September 2014.
In this paper we focus on the effect of on-line speech segmentation and disfluency removal methods on conversational speech translation. In a real-time conversational speech to speech translation system, on-line segmentation of speech is required to avoid latency beyond few seconds. While sentential unit segmentation and disfluency removal have been heavily studied mainly for off-line speech processing, to the best of our knowledge, the combined effect of these tasks on conversational speech translation has not been investigated. Furthermore, optimization of performance given maximum allowable system latency to enable a conversation is a newer problem for these tasks. We show that the conventional assumption of doing segmentation followed by disfluency removal is not the best practice. We propose a new approach to do simple-disfluency removal followed by segmentation and then by complex-disfluency removal. The proposed approach shows a significant gain on translation performance of up to 3 Bleu points with only 6 second latency to look ahead, using state-of-the-art machine translation and speech recognition systems.
- Gokhan Tur, Anoop Deoras, and Dilek Hakkani-Tur, Detecting Out-Of-Domain Utterances Addressed to a Virtual Personal Assistant, in Proceedings of Interspeech, ISCA - International Speech Communication Association, September 2014.
Conversational understanding systems, especially virtual personal assistants (VPAs), perform “targeted” natural language understanding, assuming their users stay within the walled gardens of covered domains, and back-off to generic web search otherwise. However, users usually do not know the concept of domains and sometimes simply do not distinguish the system from simple voice search. Hence it becomes an important problem to identify these rejected out-of-domain utterances which are actually intended for the VPA. This paper presents a study tackling this new task, showing that how one utters a request is more important for this task than what is uttered, resembling addressee detection or dialog act tagging. To this end, syntactic and semantic parse “structure” features are extracted in addition to lexical features to train a binary SVM classifier using a large number of random web search queries and VPA utterances from multiple domains. We present controlled experiments leaving one domain out and check the precision of the model when combined with unseen queries. Our results indicate that such structured features result in higher precision especially when the test domain bears little resemblance to the existing domains.
- Ali El-Kahky, Derek Liu, Ruhi Sarikaya, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck, Extending Domain Coverage of Language Understanding Systems via Intent Transfer Between Domains Using Knowledge Graphs and Search Query Click Logs, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2014.
This paper proposes a new technique to enable Natural Language Understanding (NLU) systems to handle user queries beyond their original semantic schemas defined by their intents and slots. Knowledge graph and search query logs are used to extend NLU system’s coverage by transferring intents from other domains to a given domain. The transferred intents as well as existing intents are then applied to a set of new slots that they are not trained with. The knowledge graph and search click logs are used to determine whether the new slots (i.e. entities) or their attributes in the graph can be used together with the new intents without re-training the underlying NLU models with the expanded (i.e. with new intents and slots) schema. Experimental results show that the proposed technique can in fact be used in extending NLU system’s domain coverage in fulfilling the user’s request.
- Yangfeng Ji, Dilek Hakkani-Tur, Asli Celikyilmaz, Larry Heck, and Gokhan Tur, A Variational Bayesian Model for User Intent Detection, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2014.
State-of-the art spoken language understanding models that automatically capture user intents in human to machine dialogs are often trained with a small number of manually annotated examples collected from the application domain. Search query logs provide a large number of unlabeled queries that would be beneficial to improve such supervised classification. Furthermore, the contents of user queries as well as the URLs they click provide information about user’s intent. In this paper, we propose a variational Bayesian approach for modeling latent intents of user queries and URLs that they clicked on when available. We use this model to enhance supervised intent classification of user queries from conversational interactions. Our experimental results demonstrate the effectiveness of this approach, showing further improvements when a large number of search queries are used.
- Gokhan Tur, Ye-Yi Wang, and Dilek Hakkani-Tur, Understanding Spoken Language, CRC Press, May 2014.
- Yann Dauphin, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck, Zero-Shot Learning and Clustering for Semantic Utterance Classification, International Conference on Learning Representations (ICLR), April 2014.
We propose a novel zero-shot learning method for semantic utterance classification (SUC). It learns a classifier f : X -> Y for problems where none of the semantic categories Y are present in the training set. The framework uncovers the link between categories and utterances through a semantic space. We show that this semantic space can be learned by deep neural networks trained on large amounts of search engine query log data. What’s more, we propose a novel method that can learn discriminative semantic features without supervision. It uses the zero-shot learning framework to guide the learning of the semantic features. We demonstrate the effectiveness of the zero-shot semantic learning algorithm on the SUC dataset collected by . Furthermore, we achieve state-of-the-art results by combining the semantic features with a supervised method.
- Gokhan Tur, Anoop Deoras, and Dilek Hakkani-Tur, Semantic Parsing Using Word Confusion Networks With Conditional Random Fields, Annual Conference of the International Speech Communication Association (Interspeech), September 2013.
A challenge in large vocabulary spoken language understanding (SLU) is robustness to automatic speech recognition (ASR) errors. The state of the art approaches for semantic parsing rely on using discriminative sequence classification methods, such as conditional random fields (CRFs). Most dialog systems employ a cascaded approach where the best hypotheses from the ASR system are fed into the following SLU system. In our previous work, we have proposed the use of lattices towards joint recognition and parsing. In this paper, extending this idea, we propose to exploit word confusion networks (WCNs), compiled from ASR lattices for both CRF modeling and decoding. WCNs provide a compact representation of multiple aligned ASR hypotheses, without compromising recognition accuracy. For slot filling, we show significant semantic parsing performance improvements using WCNs compared to ASR 1-best output, approximating the oracle path performance.
- Larry Heck, Dilek Hakkani-Tur, Madhu Chinthakunta, Gokhan Tur, Rukmini Iyer, Partha Parthasarathy, Lisa Stifelman, Elizabeth Shriberg, and Ashley Fidler, Multimodal Conversational Search and Browse, IEEE Workshop on Speech, Language and Audio in Multimedia, August 2013.
In this paper, we create an open-domain conversational system by combining the power of internet browser interfaces with multi-modal inputs and data mined from web search and browser logs. The work focuses on two novel components: (1) dynamic contextual adaptation of speech recognition and understanding models using visual context, and (2) fusion of users’ speech and gesture inputs to understand their intents and associated arguments. The system was evaluated in a living room setup with live test subjects on a real-time implementation of the multimodal dialog system. Users interacted with a television browser using gestures and speech. Gestures were captured by Microsoft Kinect skeleton tracking and speech was recorded by a Kinect microphone array. Results show a 16% error rate reduction (ERR) for contextual ASR adaptation to clickable web page content, and 7-10% ERR when using gestures with speech. Analysis of the results suggest a strategy for selection of multimodal intent when users clearly and persistently indicate pointing intent (e.g., eye gaze), giving a 54.7% ERR over lexical features.
- Larry Heck, Dilek Hakkani-Tur, and Gokhan Tur, Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing, in Proceedings of Interspeech, International Speech Communication Association, August 2013.
The past decade has seen the emergence of web-scale structured and linked semantic knowledge resources (e.g., Freebase, DBPedia). These semantic knowledge graphs provide a scalable “schema for the web”, representing a significant opportunity for the spoken language understanding (SLU) research community. This paper leverages these resources to bootstrap a web-scale semantic parser with no requirement for semantic schema design, no data collection, and no manual annotations. Our approach is based on an iterative graph crawl algorithm. From an initial seed node (entity-type), the method learns the related entity-types from the graph structure, and automatically annotates documents that can be linked to the node (e.g., Wikipedia articles, web search documents). Following the branches, the graph is crawled and the procedure is repeated. The resulting collection of annotated documents is used to bootstrap web-scale conditional random field (CRF) semantic parsers. Finally, we use a maximum-a-posteriori (MAP) unsupervised adaptation technique on sample data from a specific domain to refine the parsers. The scale of the unsupervised parsers is on the order of thousands of domains and entity-types, millions of entities, and hundreds of millions of relations. The precision-recall of the semantic parsers trained with our unsupervised method approaches those trained with supervised annotations.
- Dilek Hakkani-Tur, Asli Celikyilmaz, Larry Heck, and Gokhan Tur, A Weakly-Supervised Approach for Discovering New User Intents from Search Query Logs, Annual Conference of the International Speech Communication Association (Interspeech), August 2013.
State-of-the art spoken language understanding models that automatically capture user intents in human to machine dialogs are trained with manually annotated data, which is cumbersome and time-consuming to prepare. For bootstrapping the learning algorithm that detects relations in natural language queries to a conversational system, one can rely on publicly available knowledge graphs, such as Freebase, and mine corresponding data from the web. In this paper, we present an unsupervised approach to discover new user intents using a novel Bayesian hierarchical graphical model. Our model employs search query click logs to enrich the information extracted from bootstrapped models. We use the clicked URLs as implicit supervision and extend the knowledge graph based on the relational information discovered from this model. The posteriors from the graphical model relate the newly discovered intents with the search queries. These queries are then used as additional training examples to complement the bootstrapped relation detection models. The experimental results demonstrate the effectiveness of this approach, showing extended coverage to new intents without impacting the known intents.
- Asli Celikyilmaz, Gokhan Tur, and Dilek Hakkani-Tur, IsNL? A Discriminative Approach to Detect Natural Language Like Queries for Conversational Understanding, Annual Conference of the International Speech Communication Association (Interspeech), August 2013.
While data-driven methods for spoken language understanding (SLU) provide state of the art performances and reduce maintenance and model adaptation costs compared to handcrafted parsers, the collection and annotation of domain-specific natural language utterances for training remains a time-consuming task. A recent line of research has focused on enriching the training data with in-domain utterances by mining search engine query logs to improve the SLU tasks. However genre mismatch is a big obstacle as search queries are typically keywords. In this paper, we present an efficient discriminative binary classification method that filters large collection of online web search queries only to select the natural language like queries. The training data used to build this classifier is mined from search query click logs, represented as a bipartite graph. Starting from queries which contain natural language salient phrases, random graph walk algorithms are employed to mine corresponding keyword queries. Then an active learning method is employed for quickly improving on top of this automatically mined data. The results show that our method is robust to noise in search queries by improving over a baseline model previously used for SLU data collection. We also show the effectiveness of detected natural language like queries in extrinsic evaluations on domain detection and slot filling tasks.
- Dong Wang, Dilek Hakkani-Tur, and Gokhan Tur, Understanding Computer-Directed Utterances in Multi-User Dialog Systems, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.
This work aims to understand user requests when multiple users are interacting with each other and a spoken dialog system. More specifically, we explore the use of multi-human conversational context to improve domain detection in a human-computer interaction system. We investigate the different effects of human-directed context and computer-directed context, and compare the impact of using different context window sizes. Furthermore, we employ topic segmentation to chunk conversations for determining context boundaries. The experimental results show that the use of conversational context helps reduce domain detection error rate, especially in some specific domains. And though computer directed context is more reliable, the results show that the combination of both computer and human addressed utterances within a reasonable window size performs the best.
- Gokhan Tur, Ye-Yi Wang, and Dilek Hakkani-Tur, TechWare: Spoken Language Understanding (SLU) Resources, in IEEE Signal Processing Magazine, May 2013.
This column first presents a very high level review of the SLU technology, starting from its place in a spoken dialog system, then focusing on well established SLU tasks such as domain detection, intent determination, and slot filling, along with corresponding benchmark data sets and methods.
- Xiaodong He, Li Deng, Dilek Hakkani-Tur, and Gokhan Tur, Multi-Style Adaptive Training for Robust Cross-Lingual Spoken Language Understanding, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.
Given the increasingly available machine translation (MT) services nowadays, one efficient strategy for cross-lingual spoken language understanding (SLU) is to first translate the input utterance from the second language into the primary language, and then call the primary language SLU system to decode the semantic knowledge. However, errors introduced in the MT process create a condition similar to the “mismatch” condition encountered in robust speech recognition. Such mismatch makes the performance of cross-lingual SLU far from acceptable. Motivated by successful solutions developed in robust speech recognition, we in this paper propose a multi-style adaptive training method to improve the robustness of the SLU system for cross-lingual SLU tasks. For evaluation, we created an English-Chinese bilingual ATIS database, and then carried out a series of experiments on that database to experimentally assess the proposed methods. Experimental results show that, without relying on any data in the second language, the proposed method significantly improves the performance on a cross-lingual SLU task while producing no degradation for input in the primary language. This greatly facilitates porting SLU to as many languages as there are MT systems without any human effort. We further study the robustness of this approach to another type of mismatch condition, caused by speech recognition errors, and demonstrate its success also.
- Gokhan Tur, Asli Celikyilmaz, and Dilek Hakkani-Tur, Latent Semantic Modeling for Slot Filling in Conversational Understanding, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.
In this paper, we propose a new framework for semantic template filling in a conversational understanding (CU) system. Our method decomposes the task into two steps: latent n-gram clustering using a semi-supervised latent Dirichlet allocation (LDA) and sequence tagging for learning semantic structures in a CU system. Latent semantic modeling has been investigated to improve many natural language processing tasks such as syntactic parsing or topic tracking. However, due to several complexity problems caused by issues involving utterance length or dialog corpus size, it has not been analyzed directly for semantic parsing tasks. In this paper, we tackle with these complexities by first extenting the LDA by introducing prior knowledge we obtain from semantic knowledge bases. Later, we use the topic posteriors obtained from the new LDA model as additional constraints to sequence learning model for the semantic template filling task. Our experiment results show significant performance gains on semantic slot filling models when features from latent semantic models are used in conditional random field (CRF).
- Dilek Hakkani-Tur, Larry Heck, and Gokhan Tur, Using a Knowledge Graph and Query Click Logs for Unsupervised Learning of Relation Detection, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013.
We present unsupervised methods for training relation detection models from the semantic knowledge graphs of the semantic web. The detected relations are used to synthetically generate natural language spoken queries against a back-end knowledge base. For each relation, we leverage the complete set of entities that are connected to each other in the graph with the specific relation, and search these pairs on the web. We use the snippets that the search engine returns to create examples that can be used as the training data for each relation. We further refine the annotations of these examples using the knowledge graph itself and a bootstrap approach. Furthermore, we use the URLs returned for the pair by the search engine to mine additional examples from the search engine query click logs. In our experiments, we show that, we can achieve relation detection models that perform 59.9% macro F-measure on the relations that are in the knowledge graph without any manual labeling, resulting in a comparable performance with supervised training.
- Asli Celikyilmaz, Dilek Hakkani-Tür, Gokhan Tur, and Ruhi Sarikaya, Semi-Supervised Semantic Tagging for Conversational Understanding Using Markov Topic Regression, Association for Computational Linguistics, 2013.
Finding concepts in natural language utterances is a challenging task, especially given the scarcity of labeled data for learning semantic ambiguity. Furthermore, data mismatch issues, which arise when the expected test (target) data does not exactly match the training data, aggravate this scarcity problem. To deal with these issues, we describe an efficient semi-supervised learning (SSL) approach which has two components: (i) Markov Topic Regression is a new probabilistic model to cluster words into semantic tags concepts). It can efficiently handle semantic ambiguity by extending standard topic models with two new features. First, it encodes word ngram features from labeled source and unlabeled target data. Second, by going beyond a bag-of-words approach, it takes into account the inherent sequential nature of utterances to learn semantic classes based on context. (ii) Retrospective Learner is a new learning technique that adapts to the unlabeled target data. Our new SSL approach improves semantic tagging performance by 4% absolute over the baseline models, and also compares favorably on semi-supervised syntactic tagging.
- Anoop Deoras, Gokhan Tur, Ruhi Sarikaya, and Dilek Hakkani-Tur, Joint Discriminative Decoding of Word and Semantic Tags for Spoken Language Understanding, in IEEE Transactions on Audio, Speech, and Language Processing, IEEE, 2013.
Most Spoken Language Understanding (SLU) systems today employ a cascade approach, where the best hypothesis from Automatic Speech Recognizer (ASR) is fed into understanding modules such as slot sequence classifiers and intent detectors. The output of these modules is then further fed into downstream components such as interpreter and/or knowledge broker. These statistical models are usually trained individually to optimize the error rate of their respective output. In such approaches, errors from one module irreversibly propagates into other modules causing a serious degradation in the overall performance of the SLU system. Thus it is desirable to jointly optimize all the statistical models together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot sequence (semantic tag sequence) jointly given the input acoustic stream. Furthermore, the improved recognition output is then used for an utterance classification task, specifically, we focus on intent detection task. On a SLU task, we show 1.5% absolute reduction (7.6% relative reduction) in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of state-of-the-art large vocabulary ASR followed by conditional random field (CRF) based slot sequence tagger. Similarly, for intent detection, we show 1.2% absolute reduction (12% relative reduction) in classification error rate.
- Li Deng, Gokhan Tur, Xiaodong He, and Dilek Hakkani-Tur, Use of Kernel Deep Convex Networks and End-To-End Learning for Spoken Language Understanding, IEEE Workshop on Spoken Language Technologies, December 2012.
We present our recent and ongoing work on applying deep learning techniques to spoken language understanding (SLU) problems. The previously developed deep convex network (DCN) is extended to its kernel version (K-DCN) where the number of hidden units in each DCN layer approaches infinity using the kernel trick. We report experimental results demonstrating dramatic error reduction achieved by the K-DCN over both the Boosting-based baseline and the DCN on a domain classification task of SLU, especially when a highly correlated set of features extracted from search query click logs are used. Not only can DCN and K-DCN be used as a domain or intent classifier for SLU, they can also be used as local, discriminative feature extractors for the slot filling task of SLU. The interface of K-DCN to slot filling systems via the softmax function is presented. Finally, we outline an end-to-end learning strategy for training the softmax parameters (and potentially all DCN and K-DCN parameters) where the learning objective can take any performance measure (e.g. the F-measure) for the full SLU system.
- Asli Celikyilmaz, Dilek Hakkani-Tur, and Gokhan Tur, Statistical Semantic Interpretation Modeling for Spoken Language Understanding with Enriched Semantic Features, IEEE Workshop on Spoken Language Technologies, December 2012.
In natural language human-machine statistical dialog systems, semantic interpretation is a key task typically performed following semantic parsing, and aims to extract canonical meaning representations of semantic components. In the literature, usually manually built rules are used for this task, even for implicitly mentioned nonnamed semantic components (like genre of a movie or price range of a restaurant). In this study, we present statistical methods for modeling interpretation, which can also beneﬁt from semantic features extracted from large in-domain knowledge sources. We extract features from user utterances using a semantic parser and additional semantic features from textual sources (online reviews, synopses, etc.) using a novel tree clustering approach, to represent unstructured information that correspond to implicit semantic components related to targeted slots in the user’s utterances. We evaluate our models on a virtual personal assistance system and demonstrate that our interpreter is effective in that it does not only improve the utterance interpretation in spoken dialog systems (reducing the interpretation error rate by 36% relative compared to a language model baseline), but also unveils hidden semantic units that are otherwise nearly impossible to extract from purely manual lexical features that are typically used in utterance interpretation.
- Dilek Hakkani-Tur, Gokhan Tur, Larry Heck, Ashley Fidler, and Asli Celikyilmaz, A Discriminative Classification-Based Approach to Information State Updates for a Multi-Domain Dialog System, Annual Conference of the International Speech Communication Association (Interspeech), September 2012.
We propose a discriminative classification approach for updating the current information state of a multi-domain dialog system based on user responses. Our method uses a set of lexical and domain independent features to compare the spoken language understanding (SLU) output for the current user turn with the previous information state. We then update the information state accordingly, employing a discriminative machine learning approach. Using a data set collected from our conversational interaction system, we investigate the impact of features based on context dependent and context independent SLU tagging schemas. We show that the proposed approach outperforms two non-trivial baselines, one based on manually crafted rules and the other on classification with lexical features alone. Furthermore, such an approach allows the addition of new domains to the dialog manager in a seamless way.
- Gokhan Tur, Minwoo Jeong, Ye-Yi Wang, Dilek Hakkani-Tur, and Larry Heck, Exploiting the Semantic Web for Unsupervised Natural Language Semantic Parsing, in Proceedings of Interspeech, International Speech Communication Association, September 2012.
In this paper, we propose to bring together the semantic web experience and statistical natural language semantic parsing modeling. The idea is that, the process for populating knowledgebases by semantically parsing structured web pages may provide very valuable implicit annotation for language understanding tasks. We mine search queries hitting to these web pages in order to semantically annotate them for building statistical unsupervised slot filling models, without even a need for a semantic annotation guideline. We present promising results demonstrating this idea for building an unsupervised slot filling model for the movies domain with some representative slots. Furthermore, we also employ unsupervised model adaptation for cases when there are some in-domain unannotated sentences available. Another key contribution of this work is using implicitly annotated natural-language-like queries for testing the performance of the models, in a totally unsupervised fashion. We believe, such an approach also ensures consistent semantic representation between the semantic parser and the backend knowledge-base.
- Anoop Deoras, Ruhi Sarikaya, Gokhan Tur, and Dilek Hakkani-Tur, Joint Decoding for Speech Recognition and Semantic Tagging, Annual Conference of the International Speech Communication Association (Interspeech), September 2012.
Most conversational understanding (CU) systems today employ a cascade approach, where the best hypothesis from automatic speech recognizer (ASR) is fed into spoken language understanding (SLU) module, whose best hypothesis is then fed into other systems such as interpreter or dialog manager. In such approaches, errors from one statistical module irreversibly propagates into another module causing a serious degradation in the overall performance of the conversational understanding system. Thus it is desirable to jointly optimize all the statistical modules together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot (semantic tag) sequence jointly given the input acoustic stream. On Microsoft's CU system, we show 1.3% absolute reduction in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of the state-of-the-art recognizer followed by a slot sequence tagger.
- Dilek Hakkani-Tur, Gokhan Tur, and Asli Celikyilmaz, Mining Search Query Logs for Spoken Language Understanding, in North Ameircan Association for Computational Linguistics NAACL-2012: Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data, June 2012.
In a spoken dialog system that can handle natural conversation between a human and a machine, spoken language understanding (SLU) is a crucial component aiming at capturing the key semantic components of utterances. Building a robust SLU system is a challenging task due to variability in the usage of language, need for labeled data, and requirements to expand to new domains (movies, travel, finance, etc.). In this paper, we survey recent research on bootstrapping or improving SLU systems by using information mined or extracted from web search query logs, which include (natural language) queries entered by users as well as the links (web sites) they click on. We focus on learning methods that help unveiling hidden information in search query logs via implicit crowd-sourcing.
- Gokhan Tur, Li Deng, Dilek Hakkani-Tur, and Xiaodong He, Towards Deeper Understanding Deep Convex Networks for Semantic Utterance Classification, IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP), March 2012.
Following the recent advances in deep learning techniques, in this paper, we present the application of special type of deep architecture — deep convex networks (DCNs) — for semantic utterance classification (SUC). DCNs are shown to have several advantages over deep belief networks (DBNs) including classification accuracy and training scalability. However, adoption of DCNs for SUC comes with non-trivial issues. Specifically, SUC has an extremely sparse input feature space encompassing a very large number of lexical and semantic features. This is about a few thousand times larger than the feature space for acoustic modeling, yet with a much smaller number of training samples. Experimental results we obtained on a domain classification task for spoken language understanding demonstrate the effectiveness of DCNs. The DCN-based method produces higher SUC accuracy than the Boosting-based discriminative classifier with word trigrams.
- Dilek Hakkani-Tur, Gokhan Tur, Rukmini Iyer, and Larry Heck, Translating Natural Language Utterances to Search Queries for SLU Domain Detection Using Query Click Logs, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2012.
Logs of user queries from a search engine (such as Bing or Google) together with the links clicked provide valuable implicit feedback to improve statistical spoken language nderstanding (SLU) models. However, the form of natural language utterances occurring in spoken interactions with a computer differs stylistically from that of keyword search queries. In this paper, we propose a machine translation approach to learn a mapping from natural language utterances to search queries. We train statistical translation models, using task and domain independent semantically equivalent natural language and keyword search query pairs mined from the search query click logs. We then extend our previous work on enriching the existing classification feature sets for input utterance domain detection with features computed using the click distribution over a set of clicked URLs from search engine query click logs of user utterances with automatically translated queries. This approach results in significant improvements for domain detection, especially when detecting the domains of user utterances that are formulated as natural language queries and effectively complements to the earlier work using syntactic transformations.
- Dilek Hakkani-Tür, Gokhan Tur, Larry Heck, Asli Celikyilmaz, Ashley Fidler, Dustin Hillard, Rukmini Iyer, and S. Parthasarathy, Employing Web Search Query Click Logs for Multi-Domain Spoken Language Understanding, IEEE Automatic Speech Recognition and Understanding Workshop, December 2011.
In this paper, we describe methods to exploit search queries mined from search engine query logs to improve domain detection in spoken language understanding. We propose extending the label propagation algorithm, a graph-based semi-supervised learning approach, to incorporate noisy domain information estimated from search engine links the users click following their queries. The main contributions of our work are the use of search query logs for domain classification, integration of noisy supervision into the semi- supervised label propagation algorithm, and sampling of high-quality query click data by mining query logs and using classification confidence scores. We show that most semi-supervised learning methods we experimented with improve the performance of the supervised training, and the biggest improvement is achieved by label propagation that uses noisy supervision. We reduce the to error rate of domain detection by 20% relative, from 6.2% to 5.0%.
- Asli Celikyilmaz, Dilek Hakkani-Tur, Gokhan Tur, Ashley Fidler, and Dustin Hillard, Exploiting Distance Based Similarity in Topic Models for User Intent Detection, IEEE Automatic Speech Recognition and Understanding Workshop, December 2011.
One of the main components of spoken language understanding is intent detection, which allows user goals to be identified. A challenging sub-task of intent detection is the identification of intent bearing phrases from a limited amount of training data, while maintaining the ability to generalize well. We present a new probabilistic topic model for jointly identifying semantic intents and common phrases in spoken language utterances. Our model jointly learns a set of intent dependent phrases and captures semantic intent clusters as distributions over these phrases based on a distance dependent sampling method. This sampling method uses proximity of words utterances when assigning words to latent topics. We evaluate our method on labeled utterances and present several examples of discovered semantic units. We demonstrate that our model outperforms standard topic models based on bag-of-words assumption.
- Dilek Hakkani-Tur, Gokhan Tur, Larry Heck, and Elizabeth Shriberg, Bootstrapping Domain Detection Using Query Click Logs for New Domains, in Proceedings of Interspeech, International Speech Communication Association, August 2011.
Domain detection in spoken dialog systems is usually treated as a multi-class, multi-label classification problem, and training of domain classifiers requires collection and manual annotation of example utterances. In order to extend a dialog system to new domains in a way that is seamless for users, domain detection should be able to handle utterances from the new domain as soon as it is introduced. In this work, we propose using web search query logs, which include queries entered by users and the links they subsequently click on, to bootstrap domain detection for new domains. While sampling user queries from the query click logs to train new domain classifiers, we introduce two types of measures based on the behavior of the users who entered a query and the form of the query. We show that both types of measures result in reductions in the error rate as compared to randomly sampling training queries. In controlled experiments over five domains, we achieve the best gain from the combination of the two types of sampling criteria.
- Asli Celikyilmaz, Dilek Hakkani-Tur, and Gokhan Tur, Multi-Domain Spoken Language Understanding with Approximate Inference, Annual Conference of the International Speech Communication Association (Interspeech), August 2011.
This paper presents a semi-latent topic model for semantic domain detection in spoken language understanding systems. We use labeled utterance information to capture latent topics, which directly correspond to semantic domains. Additionally, we introduce an ’informative prior’ for Bayesian inference that can simultaneously segment utterances of known domains into classes and divide them from out-of-domain utterances. We show that our model generalizes well on the task of classifying spoken language utterances and compare its results to those of an unsupervised topic model, which does not use labeled information.
- Dustin Hillard, Asli Celikyilmaz, Gokhan Tur, and Dilek Hakkani Tur, Learning Weighted Entity Lists from Web Click Logs for Spoken Language Understanding, Annual Conference of the International Speech Communication Association (Interspeech), August 2011.
Named entity lists provide important features for language understanding, but typical lists can contain many ambiguous or incorrect phrases. We present an approach for automatically learning weighted entity lists by mining user clicks from web search logs. The approach significantly outperforms multiple baseline approaches and the weighted lists improve spoken language understanding tasks such as domain detection and slot filling. Our methods are general and can be easily applied to large quantities of entities, across any number of lists.
- Dilek Hakkani-Tur, Gokhan Tur, and Larry Heck, Research Challenges and Opportunities in Mobile Applications, in IEEE Signal Processing Magazine, , August 2011.
We have attempted to distill the research challenges and opportunities for mobile applications, ranging from personalization to connectivity. We believe that interactive multimodal mobile applications are still in their infancy and this is an exciting emerging field for research and development.
- Gokhan Tur, Dilek Hakkani-Tur, Dustin Hillard, and Asli Celikyilmaz, Towards Unsupervised Spoken Language Understanding: Exploiting Query Click Logs for Slot Filling, Annual Conference of the International Speech Communication Association (Interspeech), August 2011.
In this paper, we present a novel approach to exploit user queries mined from search engine query click logs to bootstrap or improve slot filling models for spoken language understanding. We propose extending the earlier gazetteer population techniques to mine unannotated training data for semantic parsing. The automatically annotated mined data can then be used to train slot specific parsing models. We show that this method can be used to bootstrap slot filling models and can be combined with any available annotated data to improve performance. Furthermore, this approach may eliminate the need for populating and maintaining in-domain gazetteers, in addition to providing complementary information if they are already available.
- Xiao Li, Ye-Yi Wang, and Gokhan Tur, Multi-Task Learning for Spoken Language Understanding with Shared Slots, Annual Conference of the International Speech Communication Association (Interspeech), August 2011.
This paper addresses the problem of learning multiple spoken language understanding (SLU) tasks that have overlapping sets of slots. In such a scenario, it is possible to achieve better slot filling performance by learning multiple tasks simultaneously, as opposed to learning them independently. We focus on presenting a number of simple multi-task learning algorithms for slot filling systems based on semi-Markov CRFs, assuming the knowledge of shared slots. Furthermore, we discuss an intradomain clustering method that automatically discovers shared slots from training data. The effectiveness of our proposed approaches is demonstrated in an SLU application that involves three different yet related tasks.
- Asli Celikyilmaz, Gokhan Tur, and Dilek Hakkani-Tur, Leveraging Web Query Logs to Learn User Intent Via Bayesian Latent Variable Model, in ICML Workshop on Combining Learning Strategies to Reduce Label Cost, July 2011.
A key task in Spoken Language Understanding (SLU) is interpreting user intentions from speech utterances. This task is considered to be a classification problem with the goal of categorizing a given speech utterance into one of many semantic intent classes. Due to substantial utterance var, significant quantity of labeled utterances is needed to build robust intent detection systems. In this paper, we approach intent detection as a two-stage semi-supervised learning problem, which utilizes a large number of unlabeled queries collected from internet seach engine click logs. We first capture the underlying structure of the user queries using bayesian latent feature model. We then propagate this structure onto the unlabeled queries to obtain quality training data via a graph summarization algorithm. Our approach improves intent detection compared to comparison to our baseline, which uses a standard classification model with actual features.
- Dilek Hakkani-Tur, Larry Heck, and Gokhan Tur, Exploiting Query Click Logs for Utterance Domain Detection in Spoken Language Understanding, in Proceedings of the ICASSP, Prague, Czech Republic, May 2011.
In this paper, we describe methods to exploit search queries mined from search engine query logs to improve domain detection in spoken language understanding. We propose extending the label propagation algorithm, a graph-based semi-supervised learning approach, to incorporate noisy domain information estimated from search engine links the users click following their queries. The main contributions of our work are the use of search query logs for domain classification, integration of noisy supervision into the semi-supervised label propagation algorithm, and sampling of high-quality query click data by mining query logs and using classification confidence scores. We show that most semi-supervised learning methods we experimented with improve the performance of the supervised training, and the biggest improvement is achieved by label propagation that uses noisy supervision. We reduce the to error rate of domain detection by 20% relative, from 6.2% to 5.0%.
- Gokhan Tur, Dilek Hakkani-Tür, Larry Heck, and S. Parthasarathy, Sentence Simplification for Spoken Language Understanding, in IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE SPS, May 2011.
In this paper, we present a sentence simplification method and demonstrate its use to improve intent determination and slot filling tasks in spoken language understanding (SLU) systems. This research is motivated by the observation that, while current statistical SLU models usually perform accurately for simple, well-formed sentences, error rates increase for more complex, longer, more natural or spontaneous utterances. Furthermore, users familiar with web search usually formulate their information requests as a keyword search query, suggesting that frameworks which can handle both forms of inputs is required. We propose a dependency parsing-based sentence simplification approach that extracts a set of keywords from natural language sentences and uses those in addition to entire utterances for completing SLU tasks. We evaluated this approach using the well-studied ATIS corpus with manual and automatic transcriptions and observed significant error reductions for both intent determination (30% relative) and slot filling (15% relative) tasks over the state-of-theart performances.
- Gokhan Tur and Renato DeMori, Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, John Wiley and Sons, New York, NY, 2011.
- Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck, What's Left to Be Understood in ATIS?, IEEE Workshop on Spoken Language Technologies, December 2010.
One of the main data resources used in many studies over the past two decades for spoken language understanding (SLU) research in spoken dialog systems is the airline travel information system (ATIS) corpus. Two primary tasks in SLU are intent determination (ID) and slot filling (SF). Recent studies reported error rates below 5% for both of these tasks employing discriminative machine learning techniques with the ATIS test set. While these low error rates may suggest that this task is close to being solved, further analysis reveals the continued utility of ATIS as a research corpus. In this paper, our goal is not experimenting with domain specific techniques or features which can help with the remaining SLU errors, but instead exploring methods to realize this utility via extensive error analysis. We conclude that even with such low error rates, ATIS test set still includes many unseen example categories and sequences, hence requires more data. Better yet, new annotated larger data sets from more complex tasks with realistic utterances can avoid over-tuning in terms of modeling and feature design. We believe that advancements in SLU can be achieved by having more naturally spoken data sets and employing more linguistically motivated features while preserving robustness due to speech recognition noise and variance due to natural language.