Large-Scale Thai Statistical Machine Translation

Glenn Slayden; Mei-Yuh Hwang; Lee Schwartz

Large-Scale Thai Statistical Machine Translation

Glenn Slayden ,
Mei-Yuh Hwang ,
Lee Schwartz

MSR-TR-2010-41 | February 2010

Download BibTex

Thai language text presents unique challenges for integration into large-scale multi-language statistical machine translation (SMT) systems, largely stemming from the nominal lack of punctuation and inter-word space. We review our independent solutions for Thai character sequence normalization, tokenization, typed-entity identification, sentence-breaking, and text re-spacing. We describe a general maximum entropy-based classifier for sentence breaking, whose algorithm can be easily extended to other languages such as Arabic. After integration of all components, we obtain a final translation BLEU score of 0.19 for English to Thai and 0.21 for Thai to English.