MT Detection in Web-Scraped Parallel Corpora

The Web is an invaluable source of parallel data, but in recent years it has become polluted with increasing amounts of machine-translated content. Using such data to train an MT system can introduce error and decrease the resulting quality of the system. In this paper, we present an algorithm for filtering machine-translated content from Web-scraped parallel corpora, and discuss its application in cleaning such corpora for use in training statistical machine translation systems. We demonstrate that our algorithm is capable of identifying machine-translated content in parallel corpora for a variety of language pairs, and that in some cases it can be very effective in improving the quality of an MT system. Trained on our filtered corpus, our most successful MT system outperformed one trained on the full, unfiltered corpus, thus challenging the conventional wisdom in natural language processing that “more data is better data”.

MT-Summit-Detection_Lewis_0819.pdf
PDF file

In  Proceedings of MT Summit XIII

Publisher  Asia-Pacific Association for Machine Translation
Asia-Pacific Association for Machine Translation

Details

TypeInproceedings
> Publications > MT Detection in Web-Scraped Parallel Corpora