Asli Celikyilmaz, Dilek Hakkani-Tür, Gokhan Tur, and Ruhi Sarikaya
Finding concepts in natural language utterances is a challenging task, especially given the scarcity of labeled data for learning semantic ambiguity. Furthermore, data mismatch issues, which arise when the expected test (target) data does not exactly match the training data, aggravate this scarcity problem. To deal with these issues, we describe an efficient semi-supervised learning (SSL) approach which has two components: (i) Markov Topic Regression is a new probabilistic model to cluster words into semantic tags concepts). It can efficiently handle semantic ambiguity by extending standard topic models with two new features. First, it encodes word ngram features from labeled source and unlabeled target data. Second, by going beyond a bag-of-words approach, it takes into account the inherent sequential nature of utterances to learn semantic classes based on context. (ii) Retrospective Learner is a new learning technique that adapts to the unlabeled target data. Our new SSL approach improves semantic tagging performance by 4% absolute over the baseline models, and also compares favorably on semi-supervised syntactic tagging.
Publisher Association for Computational Linguistics