Amittai Axelrod, Xiaodong He, Li Deng, Alex Acero, and Mei-Yuh Hwang
The IWSLT benchmark task is an annual evaluation campaign on spoken language translation held by the International Workshop on Spoken Language Processing (IWSLT). The task is to translate TED talks (www.ted.com). This task presents two unique challenges: Firstly, the underlying topic switches sharply from talk to talk, and each one contains only tens to hundreds of utterances. The translation system therefore needs to adapt to the current topic quickly and dynamically. Secondly, unlike other machine translation benchmark tasks, only a very small relevant parallel corpus (transcripts of TED talks) is available. Therefore, it is necessary to perform accurate translation model estimation with limited data. In this paper, we present our recent progress and two new methods on the IWSLT TED talk translation task from Chinese into English. In particular, to address the first problem, we use unsupervised topic modeling to select additional topic dependent parallel data from a globally irrelevant corpus. These additional data slices can then be used to build an unsupervised topic-adapted machine translation system. For the second problem, we develop a discriminative training method to estimate the translation models more accurately. Our experimental evaluation results show that both methods improve the translation quality over a state-of-the-art baseline.
Publisher IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP)