A large number of languages, including Arabic, Russian, and most of the South and South East Asian languages, are written using indigenous scripts. However, often the websites and the user generated content (such as tweets and blogs) in these languages are written using Roman script due to various socio-cultural and technological reasons. This process of phonetically representing the words of a language in a non-native script is called transliteration. Transliteration, especially into Roman script, is used abundantly on the Web not only for documents, but also for user queries that intend to search for these documents.
A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. For instance, the word dhanyavad ("thank you" in Hindi and many other Indian languages) can be written in Roman script as dhanyavaad, dhanyvad, danyavad, danyavaad, dhanyavada, dhanyabad and so on. The aim of this shared task is to systematically formalize several research problems that one must solve to tackle this unique situation prevalent in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community around this important problem that has received very little attention till date.
This being the first year, we plan to host a query labeling task, which is one of the first steps before one can tackle the bigger problem, and an ad hoc retrieval task for Hindi film lyrics, which is one of the most searched items in India and a perfect and practical example of transliterated search. In the coming years, we plan to expand these tasks to more languages and more domains; we also plan to host other sub-tasks related to Transliterated Search.
Subtask 1: Query Word Labeling
Suppose that q: w1 w2 w3 … wn, is a query is written Roman script. The words, w1 w2 etc., could be standard English words or transliterated from another language L. The task is to label the words as E or L depending on whether it an English word, or a transliterated L-language word. And then, for each transliterated word, provide the correct transliteration in the native script (i.e., the script which is used for writing L).
Names of people and places in L should be considered as transliterated entries, whenever it is a native name. Thus, Arundhati Roy is a transliterated name, but Ruskin Bond is not. But transliterations of such names will not be evaluated and therefore, can be skipped during labeling.
The datasets will be separated by languages, such that in each set queries will contain English words mixed with transliterated words from at most one other language (say Hindi), which will be made known a priori.
Languages: English-Hindi, English-Bangla, English-Kannada*, English-Gujarati* (*Subject to availability of data)
Data and Resources:
For each E-L pair, we will release:
- Queries with labels and transliterations (i.e., intended outputs) for training or development.
- At least 5000 L words in native script and their Roman transliterations.
- English wordlist with frequency
- Language-L wordlist with frequency
- Optional: List of common names of people and places in L.
For the following queries, where L is Hindi and has to be labeled as H.
sachin tendulkar number of centuries
sachin\H tendulkar\H number\E of\E centuries\E
palak paneer recipe
palak\H=पालक paneer\H=पनीर recipe\E
mungeri lal ke haseen sapney
mungeri\H lal\H ke\H=के haseen\H=हसीन sapney\H=सपने
iguazu water fall argentina
iguazu\E water\E fall\E argentina\E
Subtask 2: Multi-script Ad hoc retrieval for Hindi Song Lyrics
Input is query written in Devanagari script or its Roman transliterated form of a (possibly partial or incorrect) Hindi song title or some part of the lyrics. Output is a ranked list of songs both in Devanagari and Roman scripts, retrieved from a corpus of Hindi film lyrics, where some of the documents are in Devanagari and some in Roman transliterated form.
Data and Resources:
- Hindi song lyrics corpus of ~60000 documents
- Around 20 queries along with Relevance judgment for around 50 documents (i.e., songs) per query which is expected to contain all the relevant documents. This set can be used for training/development.
- All the other resources for subtask 1
- Training/Dev data release: 6th Sep 2013
- Test Set release: 30th Sep 2013
- Submit Run: 15th Oct 2013
- Results distributed: 30th Oct 2013
- Working Note due: 15th Nov 2013
- FIRE Workshop: 4-6th Dec 2013
The following generic datasets that could be useful for both subtask 1 and 2 are available from http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.html
- Word lists with corpus frequencies for English, Hindi, Bangla and Gujarati
- Links to monolingual corpora of English, Hindi and Gujarati
- Word transliteration pairs for Hindi-English, Bangla-English and Gujarati-English which could be useful for training or testing transliteration systems.
Dataset for Subtask 1:
You can access 500, 100 and 150 labeled queries for English, Bangla and Gujarati respectively as per the description of Subtask 1. Due to the small size of the data, we do not recommend that you use these for training your algorithms. Rather, you could use those as development set for tuning model parameters and gaining insight into the problem. This will also give you an idea of the precise input-output format required for the task.
To obtain these datasets, please send an email with subject "Request for FIRE 2013 Transliteration Track Datasets (Subtask 1)" to monojitc [AT] microsoft [DOT] com with cc to rishiraj [DOT] saharoy [AT] gmail [DOT] com. The email should contain the full name(s), affiliation(s), address(es), and email(s) of all the team members, and mobile no. of at least one member who can be contacted if need be. For further details visit the data page: http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.html
Dataset for Subtask 2:
This dataset can as well be obtained through emailing us. Drop an email to monojitc [AT] microsoft [DOT] com with cc to rishiraj [DOT] saharoy [AT] gmail [DOT] com, subject: "Request for FIRE 2013 Transliteration Track Datasets (Subtask 2)". Please mention the full name(s), affiliation(s), address(es), and email(s) of all the team members, and mobile no. of at least one member who can be contacted if need be. For further details visit the data page.
Test data release and submission dates:
For logistic reasons, we would like each team to submit their output exactly within two days of release of the test data. However, each team will be allowed to choose their preferred dates (within a stipulated time period) for test data release. The data will be sent to you by email on your chosen date, and we will expect your runs to be submitted within 48 hours from the time the data was sent. If we do not receive your run within 48 hours, we cannot guarantee the evaluation of your submission.
Your preferred date for test data release has to be between 1st and 15th October 2013. No test data will be released before 1st or after 15th October 2013. Also note that we will not release the development-cum-training datasets (which is currently being distributed) beyond 13th October 2013.
Evaluation report will be sent to the teams by 30th October 2013.
The output format is exactly the same as the annotated dev data for Subtask 1. Words have to be marked as \E or \L(L being H or B or G according as the concerned language is Hindi or Bangla or Gujarati)=<word in native script>. For example,
beetein\H=बीतें lamhein\H=लम्हें video\E download\E
For subtask 2, output format is like the qrels dev file, without, of course, the relevance judgment. You will be provided query-ids, and the output (top-ten documents only) should look like, for example:
There should be a newline after each query-id and doc-id, and the last doc-id for each query must be followed by two newlines.
Number of Runs allowed:
A "run" is the output of your system on our test data in the prescribed format. If you want to test different approaches to solve the same subtask, you can do so by submitting multiple runs for the subtask. A team is allowed to submit at most 3 runs per subtask.
Step 1: Send an email addressed to monojitc [AT] microsoft [DOT] com with cc to rishiraj [DOT] saharoy [AT] gmail [DOT] com by 4th October 2013 with subject "Test data release date". Clearly identify yourself so that we know which team you represent, and specify your preferred date for test data release, and the tracks (subtask I, subtask II or both) for which you would like to receive the test data. Also specify the email address(es) where you would like to receive the test data.
Step 2: Read and familiarize yourself with the input/output format of the test data. Make sure that the output file(s) generated by your system conform to the specified format exactly, because otherwise we may not be able to evaluate your submission or evaluation results might be incorrect.
Step 3: On your chosen date, you will receive the test data by 10am IST.
Step 4: Download the test set(s). Run your system(s) on these files. If you wish to submit multiple runs for a subtask, name your output text files as subtask<#>run<#>.txt (e.g., subtask1run2.txt). For subtask1, append the language name with an underscore (_). For example, subtask1run3_gujarati.txt. Put all your output files in a zipped file. If you want, you can add a Readme file in the zipped archive with some notes that you want us to know before evaluating your submission.
Step 5: Email us this zipped file before 10am IST on the third day from your chosen date for test data release (e.g., if you had chosen 7th October 2013 as the test data release, you will receive the test data on 7th Oct by 10am IST, and you have to send us back the output files by 10am IST on 9th Oct).
Resources and References
Here are a list of papers that you might find useful while solving these tasks:
- Umair Z Ahmed, Kalika Bali, Monojit Choudhury, and Sowmya V. B., Challenges in Designing Input Method Editors for Indian Languages: The Role of Word-Origin and Context, in Proceedings of IJCNLP Workshop on Advances in Text Input Methods , Association for Computational Linguistics, November 2011 [This paper can help you understand the issues in Roman transliteration of Indian languages and the existing open problems and research challenges in the area.]
- Sarvnaz Karimi, Falk Scholer and Andrew Turpin, Machine Transliteration Survey. In ACM Computing Surveys (CSUR), Volume 43 Issue 3, April 2011 [A general survey on transliteration techniques]
- P. J. Antony and K. P. Soman, Machine Transliteration for Indian Languages: A Literature Survey. In International Journal of Scientific & Engineering Research, Volume 2, Issue 12, December-2011 [Provides a useful list of references on machine transliteration for India languages.]
- Kanika Gupta and Monojit Choudhury and Kalika Bali, Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics, In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), 2012. [This paper will give you an idea about the type and extent of spelling variations in Bollywood song lyrics over the Web. The mined dataset is available and can be used for training a Hindi-English transliteration system.]
- B King, S Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods In Proceedings of NAACL-HLT, 2013 [This work proposes a weakly supervised algorithm for language identification in mixed language texts and is directly applicable to Task I]
This is a growing list, so keep visiting ...
- Monojit Choudhury, Microsoft Research India
- Prof. Prasenjit Majumder, DAIICT Gandhinagar
- Rishiraj Saha Roy, IIT Kharagpur
- Komal Agarwal, DAIICT
- Dastagiri Reddy, IIT Kharagpur
- Komal Agarwal, DAIICT
- Ranita, IIT Kharagpur
- Rishiraj Saha Roy, IIT Kharagpur
- Rohan Ramanath, CMU
- Swadhin Pradhan, IIT Kharagpur
- Yogarshi Vyas, IIT Kharagpur