Mining Named Entity Transliteration Equivalents from Comparable Corpora

Raghavendra Udupa, K Saravanan, A Kumaran, and Jagadeesh Jagarlamudl

Abstract

Named Entities (NEs) form a significant fraction of query terms in Information Retrieval (IR) systems and their retrieval has been shown to correlate highly with the IR system performance. NEs are even more important in Cross Language Information Retrieval (CLIR), as in addition to being a significant component of query terms. In the recent times, the large quantity and the perpetual availability of news corpora in many of the world’s languages simultaneously, has spurred interest in a promising alternative to NE translation or transliteration, particularly, the mining of Named Entity Transliteration Equivalents (NETEs) from such news corpora (Klementiev and Roth, 2006; Tao et al., 2006). Formally, comparable news corpora are time-aligned news stories in a pair of languages, over a reasonably long duration. NETEs mined from comparable news corpora could be valuable in many tasks such as CLIR and MT, to effectively complement the bilingual dictionaries and the machine transliteration systems. This opportunity is precisely what we address in our work. We introduce a novel method, called MINT (MIning Namedentity Transliteration equivalents), with the following innovations for effective mining of NETEs from comparable corpora: MINT relies on little linguistic resources, requiring a Named Entity Recoginizer (NER) in only one language; hence NETEs from even a resource poor language may be mined, when paired with a language where an NER is available.

Details

Publication typeInproceedings
Published inthe 17th ACM conference on Information and knowledge management (CIKM 2008), Napa Valley, USA
PublisherAssociation for Computing Machinery, Inc.
> Publications > Mining Named Entity Transliteration Equivalents from Comparable Corpora