A Kumaran, Mitesh Khapra, and Pushpak Bhattacharyya
Machine Transliteration is an important problem in an increasingly multilingual world, as it plays a critical role in many downstream applications, such as machine translation or crosslingual information retrieval systems. In this paper, we propose compositional machine transliteration systems, where multiple transliteration components may be composed either to improve existing transliteration quality, or to enable transliteration functionality between languages even when no direct parallel names corpora exist between them. Specifically, we propose two distinct forms of composition - Serial and Parallel. Serial compositional system chains individual transliteration components, say, X-to-Y and Y-to-Z systems, to provide transliteration functionality, X-to-Z. In parallel composition evidence from multiple transliteration paths between X-to-Z are aggregated for improving the quality of a direct system. We demonstrate the functionality and performance benefits of the compositional methodology using a state of the art machine transliteration framework in English and a set of Indian languages, namely, Hindi, Marathi and Kannada. Finally, we underscore the utility and practicality of our compositional approach by showing that a CLIR system integrated with compositional transliteration systems performs consistently on par with and some time better than that integrated with a direct transliteration system.
|Published in||ACM Transactions on Asian Language Information Processing (TALIP) Journal|
|Publisher||Association for Computing Machinery, Inc.|
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or firstname.lastname@example.org. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.
K Saravanan, Raghavendra Udupa, and A Kumaran. Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining, the Forum for Information Retrieval Evaluation (FIRE-2010) Workshop, Kolkata, India, February 2010.
Mitesh Khapra, A Kumaran, and Pushpak Bhattacharyya. Everybody loves a rich cousin: An empirical study of Transliteration through Bridge Languages, Association for Computational Linguistics, June 2010.
Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang. Report of NEWS 2009 Machine Transliteration Shared Task, Association for Computational Linguistics, August 2009.