Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
MT Detection in Web-Scraped Parallel Corpora

Spencer Rarrick, Chris Quirk, and William Lewis

Abstract

The Web is an invaluable source of parallel data, but in recent years it has become polluted with increasing amounts of machine-translated content. Using such data to train an MT system can introduce error and decrease the resulting quality of the system. In this paper, we present an algorithm for filtering machine-translated content from Web-scraped parallel corpora, and discuss its application in cleaning such corpora for use in training statistical machine translation systems. We demonstrate that our algorithm is capable of identifying machine-translated content in parallel corpora for a variety of language pairs, and that in some cases it can be very effective in improving the quality of an MT system. Trained on our filtered corpus, our most successful MT system outperformed one trained on the full, unfiltered corpus, thus challenging the conventional wisdom in natural language processing that “more data is better data”.

Details

Publication typeInproceedings
Published inProceedings of MT Summit XIII
PublisherAsia-Pacific Association for Machine Translation
> Publications > MT Detection in Web-Scraped Parallel Corpora