Word-Level Language Identification Using CRF: Code-Switching Shared Task Report of MSR India System

  • Gokul Chittaranjan ,
  • Yogarshi Vyas ,
  • ,
  • Monojit Choudhury

Proceedings of the First Workshop on Computational Approaches to Code Switching |

Published by Association for Computational Linguistics

We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages. Its performance is benchmarked against the test sets provided by the shared task on code-mixing (Solorio et al., 2014) for four language pairs, namely, English-Spanish (En-Es), English-Nepali (En-Ne),English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects. The experimental results show a consistent performance across the language pairs.