Improving Tamil-English Cross-Language Information Retrieval by Transliteration Generation and Mining

While state of the art Cross-Language Information Retrieval (CLIR) systems are reasonably accurate and largely robust, they typically make mistakes in handling proper or common nouns. Such terms suffer from compounding of errors during the query translation phase, and during the document retrieval phase. In this paper, we propose two techniques, specifically, transliteration generation and mining, to effectively handle such query terms that may occur in their transliterated form in the target corpus. Transliteration generation approach generates the possible transliteration equivalents for the out of vocabulary (OOV) terms during the query translation phase. The mining approach mines potential transliteration equivalents for the OOV terms, from the first-pass retrieval from the target corpus, for a final retrieval. An implementation of such an integrated system achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual English-English task, and 0.4145 in the Tamil-English task. The Tamil-English cross-language retrieval performance improved from 75% to 81% of the English-English monolingual retrieval performance, underscoring the effectiveness of the integrated CLIR system in enhancing the performance of the CLIR system.

In  proceedings of Tamil Internet Conference 2011, in Philadelphia, PA

Publisher  INFITT

Details

TypeInproceedings
> Publications > Improving Tamil-English Cross-Language Information Retrieval by Transliteration Generation and Mining