Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction

Chris Quirk, Raghavendra Udupa, and Arul Menezes

Abstract

The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce competitive improvements on cross-domain test data without suffering in-domain degradation even at very large scale.

Details

Publication typeInproceedings
Published inProceedings of MT Summit XI
URLhttp://www.eamt.org
PublisherEuropean Association for Machine Translation
> Publications > Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction