Large-Scale Thai Statistical Machine Translation

Thai language text presents unique challenges for integration into large-scale multi-language statistical machine translation (SMT) systems, largely stemming from the nominal lack of punctuation and inter-word space. We review our independent solu-tions for Thai character sequence normalization, to-kenization, typed-entity identification, sentence-breaking, and text re-spacing. We describe a general maximum entropy-based classifier for sentence breaking, whose algorithm can be easily extended to other languages such as Arabic. After integration of all components, we obtain a final translation BLEU score of 0.19 for English to Thai and 0.21 for Thai to English.

MSR-TR-2010-41.pdf
PDF file

Details

TypeTechReport
NumberMSR-TR-2010-41
> Publications > Large-Scale Thai Statistical Machine Translation