Speaker Hany Hassan Awadalla and Fei Xia
Affiliation MSR Machine Translation, Linguistics Department at UW
Host Michael Gamon
Date recorded 16 May 2014
Domain adaptation via effective feature engineering across domains Yan Song and Fei Xia
Domain adaptation aims at bridging the performance gap when training and test data come from different domains. In the field of natural language processing, domain adaptation techniques have been applied to tasks such as POS tagging, parsing, machine translation, and sentiment analysis, and there have been extensive studies in this area in the past decade.
In this talk, we present two novel approaches to domain adaptation. In the first approach, we address the limitations of two existing domain adaptation methods, training data selection and feature augmentation, by combining the two methods; that is, we propose to use training data selection to divide the source domain training data into two parts, pseudo target data (the selected part) and source data (the unselected part), and then apply feature augmentation on the two parts of the training data. In the second approach, we improve system performance by introducing new features that represent shared properties among two domains. Our experiments show that these approaches can boost system performance not only when the training and test data come from different domains, but also when they are from two closely related languages.
Yan Song was a visiting student at UW in 2011-2012. After receiving his PhD from City University of Hong Kong in 2014, he joined Microsoft Search Technology Center Asia in Beijing, China as a speech scientist. Fei Xia is an Associate Professor at the Linguistics Department at UW. She is one of the organizers of the UW/MS NLP Symposium.
Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data Hany Hassan Awadallah, Senior Research Scientist, MSR Machine Translation
Statistical phrase-based translation learns translation rules from bilingual corpora, and has traditionally only used monolingual evidence to construct features that rescore existing translation candidates.
In this work, we present a semi-supervised graph-based approach for generating new translation rules that leverages bilingual and monolingual data. We report results on a large Arabic-English system and a medium-sized Urdu-English system. Our proposed approach signiﬁcantly improves the performance of competitive phrase- based systems, leading to consistent improvements between 1 and 4 BLEU points on standard evaluation sets.
©2014 Microsoft Corporation. All rights reserved.