Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction

The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce competitive improvements on cross-domain test data without suffering in-domain degradation even at very large scale.

mtsummit2007_compcorp.pdf
PDF file

In  Proceedings of MT Summit XI

Publisher  European Association for Machine Translation
Copyright 2007 by the European Association for Machine Translation.

Details

TypeInproceedings
URLhttp://www.eamt.org
> Publications > Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction