Spencer Rarrick, Chris Quirk, and William Lewis
The Web is an invaluable source of parallel data, but in recent years it has become polluted with increasing amounts of machine-translated content. Using such data to train an MT system can introduce error and decrease the resulting quality of the system. In this paper, we present an algorithm for filtering machine-translated content from Web-scraped parallel corpora, and discuss its application in cleaning such corpora for use in training statistical machine translation systems. We demonstrate that our algorithm is capable of identifying machine-translated content in parallel corpora for a variety of language pairs, and that in some cases it can be very effective in improving the quality of an MT system. Trained on our filtered corpus, our most successful MT system outperformed one trained on the full, unfiltered corpus, thus challenging the conventional wisdom in natural language processing that “more data is better data”.
In Proceedings of MT Summit XIII
Publisher Asia-Pacific Association for Machine Translation
Asia-Pacific Association for Machine Translation