Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Automatic Language Identification

There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very `clean' editorially
managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular.

With this work, we implement an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly take advantage of the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in
correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models
with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models.

With our work we provide a guide and a publicly available tool to the mining community for language identification on web and social data.

Below there are links to our published paper as well as our code, binaries and documentation for our tool. For consistency reasons, please use our publication instead of the project url to cite our work.

Publications

Data

We offer a pre-compiled set of language profiles for download. There are three groups of datasets:

  1. 52 languages trained from Wikipedia
  2. 26 languages trained on tweets
  3. 49 languages trained on tweets.

Each group consists of 6 files, character {1..5}-grams, character 3-grams, word 1-grams; each one with (tlc) or without (ncf) case folding. The file name encodes the dataset properties.

The data can be downloaded from the MSR ftp site: ftp://ftp.research.microsoft.com/Users/LanguageIdentifier/