I am a researcher in Microsoft Research Lab India since 2007. My research interests cut across the areas of Linguistics, Cognition and Computation. Currently, I am working on script and code-mixing, especially in social media and web search. We have introduced the notion of Mixed-Script Information Retrieval, where the query and the documents can be in different, and possibly, more than one scripts but in the same language; the task is to retrieve the relevant documents across scripts. Such situations arise quite commonly for Indian languages, where the documents (say song lyrics or posts on discussion forums) can be either written in the native script or in Romanized form. In fact, a large amount of Indian language (and also Greek, Arabic, etc.) content on the Web is available in Romanized form. Mixed-script IR entails challenges such as indexing cross-script indexing, handling transliteration induced spelling variations in queries and documents, code-mixed query understanding and query completion.
Code-mixing or use of more than one languages in a single conversation or utterance is a phenomenon that is observed in all multilingual societies. Due to social media and online forums, code-mixing is now rampant on the Internet. I am interested in developing core NLP techniques for identifying and processing code-mixed text. I am also interested in studying the extent, distribution and socio-linguistic factors influencing code-mixing.
I am also work on computational musicology. I would like to understand how the (computationally defined) structure of music correlates to and causes certain emotional responses and preferences in individuals and cultures. In particular, I am studying the usage of musical scales and their evolution across the musical cultures of the world, and the cognitive models of scale perception.
I also work on various NLP and Information Retrieval techniques for Indian languages. In the past I have worked on language evolution, evolution of the structure of Web search queries and complex networks.
Spandana Gella, Kalika Bali, and Monojit Choudhury, "ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification, NLPAI, December 2014.
Gokul Chittaranjan, Yogarshi Vyas, Kalika Bali, and Monojit Choudhury, Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System, in Proceedings of the First Workshop on Computational Approaches to Code Switching, Association for Computational Linguistics, Doha, Qatar, October 2014.
Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choudhury, POS Tagging of English-Hindi Code-Mixed Social Media Content, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, October 2014.
Kalika Bali, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas, "I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook, in Proceedings of the First Workshop on Computational Approaches to Code Switching, Association for Computational Linguistics, Doha, Qatar, October 2014.
Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly, Srivatsan Laxman, and Monojit Choudhury, Discovering and understanding word level user intent in Web search queries, in Web Semantics: Science, Services and Agents on the World Wide Web, Elsevier, August 2014.
Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly, and Monojit Choudhury, Automatic Discovery of Adposition Typology, in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Coling 2014, August 2014.
Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso, Query Expansion for Mixed-Script Information Retrieval, ACM – Association for Computing Machinery, July 2014.
Rishiraj Saha Roy, Yogarshi Vyas, Niloy Ganguly, and Monojit Choudhury, Improving Unsupervised Query Segmentation using Parts-of-Speech Sequence Information, in Proceedings of the 37th Annual ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR '14), ACM – Association for Computing Machinery, July 2014.
Rishiraj Saha Roy, M. Dastagiri Reddy, Niloy Ganguly, and Monojit Choudhury, Understanding the Linguistic Structure and Evolution of Web Search Queries, EVOLANG, April 2014.
Sai Sumanth Miryala, Ranjita Bhagwan, Monojit Choudhury, and Kalika Bali, Automatically Identifying Vocal Expressions for Music Transcription, in 2013 International Society of Music Information Retrieval, November 2013.
Please consider submitting your work to First Workshop on Language Technologies for Indian Social Media Text, to be held in conjunction to ICON 2014 (deadline: 7th Nov 2014). As a part of this workshop, I am offering a tutorial on Code-mixing in Social Media.
We are organizing the FIRE Shared Task on Transliterated Search. The deadline for task registration has passed.