September 12, 2013 4:16 PM PT
Quick: What do these tweets have in common?
For one thing, you’d need some help translating them if you’re not bilingual or up on social-media lingo. More notably, these tweets illustrate an intriguing new twist on a problem that researchers thought they had solved some time ago: how to train computers to identify the language in which documents are written.
Why does the language in which a tweet is written matter? Twitter and other social-media platforms hold a treasure trove of information about public opinion and social trends, and that data, when mined, can provide insights into consumer preferences, political views, and all sorts of other behaviors and demographic patterns.
On a more practical level, it also can help with machine translation and the development of more adaptive devices and services.
“I am very interested in how we humans make decisions, with the aim of creating better and more intelligent devices and services that continuously adapt,” Goldszmidt says. “Social media offer a window into the social, cognitive, and emotional aspects of this decision-making process.”
But many text-mining techniques are language-specific, so robust language identification is a prerequisite for that type of processing.
Unlike most of the web, including most blogs, social-media postings are full of slang, hashtags, misspellings, emoticons, and offbeat punctuation and capitalization. Some people mix languages. Length restrictions, such as the 140-character limit in Twitter, encourage compressed, ungrammatical writing. These days, even human readers can have difficulty decoding what tweets and other social-media posts are saying, so imagine how tricky it can be for computers.
“When we started, I took automated language identification to be a solved problem,” Goldszmidt says. “It surprised me that we had to change a lot of our assumptions because of the particular nature of how people express themselves in Twitter. That aspect turned out to be a research project in itself.”
When Goldszmidt, Paparizos, and Najork began their project, existing language-identification technologies worked only for longer, cleaner documents. Some of the tools could recognize just a handful of languages with a high level of accuracy. But on Twitter, most of the more than 100 million postings each day are just a few words long, and they come in hundreds of languages. Accuracy can degrade quickly in this domain.
By his own admission, Goldszmidt, a native Spanish speaker, is a contributor to the language-identification problem. He sometimes mixes spelling and semantics from English and Spanish—for example, texting “aim jom” to his family to mean “I’m home.” Paparizos sometimes uses both Greek and English in the same message.
With access to the Twitter stream, the researchers set about finding an automated way to classify the kinds of cryptic postings that they themselves were creating.
“We started out by first training a set of language identifiers on very clean content—Wikipedia pages,” Najork says. “Wikipedia pages are, of course, labeled with the language they’re written in.”
For their training set, the researchers identified the languages with at least 50,000 articles in Wikipedia. Those languages—52 in all—included the major European and Asian languages, as well as some unexpected ones: two synthetic languages, Esperanto and Volapük, along with several languages that have a relatively small number of speakers, such as Latin, Occitan, Galician, and Basque.
“We trained those language identifiers,” Najork says, “and then threw them at our collection of tweets.”
You might assume that Twitter profile information could help in the process of language identification—after all, users can indicate a language preference when they set up their account. But many people never bother to change the default setting, which is English. Some people even set a fake location—for example, they might claim to be at the North Pole when in fact they’re in South Africa.
But it turns out that for about 3 percent of tweets, the user has GPS location tracking turned on. In that case, “you know where the person was who wrote this tweet at that particular point in time,” Najork says. In other words, GPS metadata becomes supporting evidence in predicting the primary language of a tweet.
“It looks like an English-language tweet, and it was sent by someone in the U.S.,” Najork says. “Now you have reasonable evidence that this is actually an English-language tweet.”
By extracting tweets for which the geographic data matches the predicted language, you get a new set of reliable training data—Twitter training data. This is the key to automatic machine learning of Twitter vernacular in each language—the OMGs, LOLs, and so on.
Getting to this optimal solution was painstaking.
“We tried more than 60 different techniques and parameters,” Paparizos says, “and studied what works well and when.”
Some of their findings were unexpected. For example, it’s common practice in language identification to convert all text to lowercase before analyzing it, because many people don’t bother to properly capitalize words. But it turns out that keeping the existing capitalization helps differentiate between languages. German, for example, requires capitalization of all nouns, so tweets in German still tend to follow that rule.
The biggest surprise, though, is the simplicity of the solution itself. It requires no knowledge of the grammar or word roots of any language. It simply compares character patterns and word patterns in a document to the patterns in trained language profiles. It then tells you the most probable primary language.
“The tool works for any kind of language identification,” Paparizos says. “Essentially, we give you a trained data set to work with. And then if you want to train the tool to recognize Klingon or some other language, it’s very easy to add Klingon training data, and the tool will tell you whether something is in Klingon or English or Greek or any other language it has been trained to identify.
“The method that produces the best result is surprising to people, because it’s simple and it’s unexpected, but it performs well. Whenever I talk about it to other people, they say, ‘I can’t believe that method works the best.’ This is a little bit unexpected, because it is not a standard classification method.”
The new tool also works for any social-media data.
“This works for Twitter, but you can do the same thing for Facebook,” Paparizos says. “Anyone can add their own training data and train the language identifier again with their own data.”
The tool already is opening the door to a vast array of research possibilities.
“People who do work in social mining and web mining are excited that there’s a tool they can use that gives them the language,” Paparizos says. “Before, there was no comparable tool for social postings.”
Najork is particularly interested in doing “sentiment analysis” on people who tweet—studying how they feel about their lives or about certain issues, or tracking the level of happiness in various countries over time. Another area he thinks would be fascinating to study is the correlation between sentiments in social media and financial markets.
“Social search is a very interesting, new phenomenon—the fact that we have a lot of content generated by a really wide fraction of the population,” he says. “We can now do all types of analyses on a really representative set of the population. You really can get to understand social phenomena much, much better, and I think that’s an absolutely fascinating new area.”