Applications of Data Selection via Cross-Entropy Difference for Real-World Statistical Machine Translation

Amittai Axelrod, QingJun Li, and Will Lewis

Abstract

We broaden the application of data selection methods for domain adaptation to a larger number of languages, data, and decoders than shown in previous work, and explore comparable applications for both monolingual and bilingual cross-entropy difference methods. We compare domain adapted systems against very large general-purpose systems for the same languages, and do so without a bias to a particular direction. We present results against real-world general purpose systems tuned on domain-specific data, which are substantially harder to beat than standard research baseline systems. We show better performance for nearly all domain adapted systems, despite the fact that the domain adapted systems are trained on a fraction of the content of their general domain counterparts. The high performance of these methods suggest applicability to a wide variety of contexts, particularly in scenarios where only small supplies of unambiguously domain-specific data are available, yet it is believed that additional similar data is included in larger heterogenous-content general-domain corpora.

Details

Publication typeInproceedings
Published inProceedings of the International Workshop on Spoken Language Translation (IWSLT 2012)
PublisherInternaltional Workshop on Spoken Language Translation (IWSLT)
> Publications > Applications of Data Selection via Cross-Entropy Difference for Real-World Statistical Machine Translation