Collective Corpus Weighting and Phrase Scoring for SMT using Graph-based Random Walk

NLPCC 2013 |

Data quality is one of the key factors in Statistical Machine Translation (SMT). Previous research addressed the data quality problem in SMT by corpus weighting or phrase scoring, but these two types of methods were often investigated independently. To leverage the dependencies between them, we propose an intuitive approach to improve translation modeling by collective corpus weighting and phrase scoring. The method uses the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. An effective graph-based random walk is designed to estimate the quality of sentence pairs and phrase pairs simultaneously. Extensive experimental results show that our method improves performance significantly and consistently in several Chinese-to-English translation tasks.