Glenn Slayden, Mei-Yuh Hwang, and Lee Schwartz
1 February 2010
Thai language text presents unique challenges for integration into large-scale multi-language statistical machine translation (SMT) systems, largely stemming from the nominal lack of punctuation and inter-word space. We review our independent solu-tions for Thai character sequence normalization, to-kenization, typed-entity identification, sentence-breaking, and text re-spacing. We describe a general maximum entropy-based classifier for sentence breaking, whose algorithm can be easily extended to other languages such as Arabic. After integration of all components, we obtain a final translation BLEU score of 0.19 for English to Thai and 0.21 for Thai to English.
![]() PDF file |
| Type | TechReport |
| Number | MSR-TR-2010-41 |