Cairo Lab Makes Arabic Contributions to Office
Microsoft Research
December 18, 2013 9:00 AM PT

From a computational viewpoint, Arabic is a complex language. Although it consists of a number of distinct yet related variants, they all share a common written language, which is why Arabic commonly is treated as a single linguistic grouping. Using that interpretation, Arabic, used by more than 300 million people, ranks among the world’s six most popular languages.

Arabic has a set of unique characteristics that have resulted in IT tools that underperform compared with those for other languages. That, says Achraf Chalabi of the Natural Language Processing Group at Advanced Technology Labs Cairo, (ATL Cairo) is simply unacceptable.

The Cairo lab, founded in 2006 and now one of 13 Microsoft Research labs across the globe, pursues applied research with a focus on explorations and incubations in the areas of languages and content services. Back in 2008, Chalabi and his small team began to address a big challenge.

The ATL Cairo team that worked on providing Arabic-language-processing techniques for Office 2013 (from left): Ahmed Shaaban, Achraf Chalabi, Eslam Kamal, Mohamed El-Sharqwi, Sayed Hassan, Omar Abou El-kheir, and Eman Hisham.

“The charter of our team is to improve Arabic-specific features across Microsoft products,” Chalabi says, “in terms of proofreading, translation, search, speech, and document management. In most of the core features, our goal is to improve Arabic quality across different products.”

To do so, though, they needed to identify the tools necessary for the job. That meant taking the time to construct a natural-language-processing (NLP) infrastructure, in particular for the Arabic language.

“To address those peculiarities of the Arabic language in an efficient way,” he continues, “we had to build that infrastructure. We needed to deal with the characteristics of the language genuinely. We had to tackle them from the roots.”

You can probably guess where this is going. The NLP tools in place were built for English, then adapted to cope with the problems of other languages. In the case of Arabic, that approach didn’t turn out to be efficient. The resultant features were lacking in quality—and, therefore, users.

Thus, the team faced a formidable challenge—during a time when the team’s home nation of Egypt was experiencing, along with much of the Middle East and North Africa, the turbulences associated with the Arab Spring. Nevertheless, the addressed its obstacles, cleared each hurdle, and, in spring 2012, contributed an Arabic NLP tool kit that was incorporated into Office 2013.

Now, with that accomplishment secure and with a successful collaborative relationship with the Office group in place, the Cairo NLP team has been asked to scale the data-driven approach that worked with Arabic across 24 other languages. What worked for Arabic is now being applied to English, French, Spanish, and others.

Success, though, was not assured, not with the five key complexities the Arabic language poses:

No Vowels

“People write Arabic without vowels,” says, Mohamed El-Sharqwi, a research software-development engineer (RSDE) at the Cairo lab. “The Arabic alphabet does have consonants and vowels, but Arabic text is dominantly written without vowels.”

In written Arabic, vowels are denoted by the use of diacritical marks. It is convenient to avoid using such marks, but it forces the reader to concentrate to identify the missing vowels by analyzing the context.

Adults find this relatively easy, but youths struggle—and so do others. The team tells the tale of television news anchors who start reading a report, realize partway through that an assumption about vowel prediction was incorrect, and have to go back to the beginning and try again.

It’s even more difficult for a computer, which doesn’t possess the knowledge and contextual cues that humans have. For computers, interpreting vowel-less written Arabic becomes a serious challenge. The ambiguity involved presents a challenge not posed by the processing of languages such as English or French.

Free Word Order

The Arabic language offers certain freedoms for writers and speakers, such as a relatively free word-order structure. People can swap the order of words and still produce a sentence that is valid both syntactically and semantically.

“In Arabic, one can say, ‘The man ate the meal,’ or ‘The meal ate it the man,’ or ‘Ate the man the meal,’ or ‘Ate the meal the man,’” El-Sharqwi observes. “You can just permutate the different sentence constituents in whatever order and still have a linguistically correct sentence.

“This free word-order nature of the language also adds computational complexities. For existence, the formal grammar required to parse Arabic sentences is much larger than if we’re processing a language with fixed word-order syntax, such as English.”

Spelling Errors

Spelling mistakes in Arabic can be as much as three times as prevalent as in English. An explanation for this is elusive, but the truth is readily observable.

The NLP team analyzed a thousand news articles from websites and found an average rate of misspelled words of 6 percent. And this, presumably, from text edited by professionals. The rate could easily double among average people.

“When we analyzed these errors,” says Eslam Kamal, also an RSDE at ATL Cairo, “we found that most of them fall under what we call ‘common Arabic mistakes.’ Arabic has different shapes for the same character, and usually, people don’t know the correct spelling rules.

“The good news here is that if we have a high-quality morphological analyzer, we can easily identify these errors automatically and correct them.”

Long Sentences

The written Arabic language has punctuation marks. The average Arabic writer has rare use for them. What results is long, long sentences—an entire paragraph long perhaps.

“A paragraph can be one sentence without having any full stop between sentences,” Kamal says. “People usually resort to coordinating conjunctions to connect sentences together.”

It’s a quantifiable problem. The team found that while an average sentence length in English is 15 words, in Arabic that increases to 21—40 percent longer.

Rich Morphology

“Arabic is a morphology-rich language,” El-Sharqwi says. “It’s highly inflectional. A single stem in Arabic can generate hundreds, if not thousands, of possible inflected final-form words.”

This is a problem for search. When searching for an inflected word, it must be reduced to its original stem, which then must be expanded into other inflected forms to achieve acceptable search results.

This combination of factors makes Arabic a challenging language to process computationally. And there aren’t nearly as many linguistic references or training data from which to learn as in English and other widely popular languages. That’s why solutions to conquer such issues often have been adopted from those for other languages.

“The most efficient way to achieve high-quality services and functionality for Arabic is to address these problems,” Chalabi explains, “and make sure that we have the right infrastructure that is able to handle these complexities.”

That is precisely what the NLP team has done.

“We had to build an NLP infrastructure for Arabic,” he continues. “This is a suite of basic components for processing written Arabic text, including a morphological analyzer, a part-of-speech tagger, a named-entity recognizer, an automatic corrector, a parser, a diacritizer, and a tagged corpus.”

Most of these components are data-driven. Building them requires models, and models are generated from labeled data. The team built a corpus of 5 million words and labeled it for morphology, named entities, spelling errors, and syntactic features.

The result—tools delivering solutions—is beautiful to behold:

Automatic diacritizer
Automatic diacritizer
“This one relies on practically all of the other components,” Chalabi notes. “It relies on the morphological analyzer to provide the possible diacritization alternatives. It relies on the part-of-speech tagger, which selects the correct alternative presented by the morphological analyzer based on context. It relies on the named-entity recognizer, to identify non-derived named entities and avoid auto-correcting them, and it uses the auto-corrector to identify common Arabic mistakes and correct them automatically.

“And it relies on the parser, which analyzes a sentence syntactically and associates the syntactic functions with each word or group of words in the sentence. Once we know the syntactic function, we can easily infer the correct case ending.”

  • Free word order: The morphological analyzer, enabling the Arabic grammar checker in Office, addresses the complexity created by Arabic’s free ordering of syntactical elements.
  • Spelling errors: The automatic corrector is the component that identifies the common Arabic mistakes and corrects them automatically. A spelling checker was built atop the automatic corrector and the morphological analyzer.

Arabic auto-corrector

“In Arabic,” El-Sharqwi explains, “we have well-defined morphology rules similar to rules in English for when we need to double the consonant in a word. By applying these rules, we can easily correct these errors automatically. This couldn’t have been done without having a high-quality morphological analyzer. The automatic corrector relies on the morphological analyzer both to detect the spelling errors and then to correct them.”

  • Long sentences: The auto-correction tools and parser help in this regard, as well.
  • Rich morphology: Work on the morphological analyzer and the part-of-speech tagger led to the construction of an Arabic “word breaker,” a technique that originated for segmentation of Chinese sentences. But here, instead of segmenting sentences into words, it breaks words into “morphemes,” the basic constituents of words.

This work has been integrated into SharePoint 2013 and shipped as part of Office 2013.

In the midst of these labors, the Cairo research-and-development team also found time to deliver another solution, addressing the growing use of Romanized Arabic on the web, particularly in social media. Younger generations increasingly are using English keyboards to write Arabic in Roman letters, even though the reading experience is significantly diminished. Writing becomes easier, but reading becomes more difficult. And this trend reduces the size of digitized Arabic content, which hurts search efforts.

The team addressed this issue by building a transliterator that automatically translates Romanized Arabic into native Arabic script. This component is now an input-method editor that helps people to write in Romanized Arabic and have the text converted automatically into Arabic script.

This tool, called Maren, was released in 2009 on the Microsoft Download Center and has been a huge success in the region, with its user base in the millions.

“We are helping the reader read Arabic in the native script,” says Sayed Hassan, a research SDE, “and keeping the Arabic content growing, although the original text was edited in Romanized Arabic.”

Such advances have been greeted with much enthusiasm from colleagues within Office, Chalabi reports. He and his team have worked with various groups within Office, including the FAST, User and Content Intents, International Project Engineering, and Linguistic Technologies teams.

Comments from Petra Maier-Meyer, former principal program-manager lead for the Information Experience Group’s Search Foundation, were typical of those who had a chance to collaborate with the Cairo lab on the Arabic-language work.

“Arabic word breaking has been improved significantly by integrating the Arabic word breaker from ATL Cairo,” she says. “By this, both precision and recall for Arabic word breaking goes up from around 50 percent to above 90 percent, with even better processing throughput: great improvement for Arabic search.”

Maura Molloy, one-time principal group manager for the Linguistic Technologies Team, echoed those sentiments.

“The collaboration with ATL has been very profitable,” Molloy recalls. “We collectively will deliver more value for Arabic language users than could otherwise have been achieved.”

For Hussein Salama, director of ATL Cairo, hearing such testimonials is gratifying.

“I’m very proud of all the components that we’ve been building and of the features that we’ve enhanced,” he says, “starting with Maren transliteration, enhancing Arabic search in the enterprise domain, and enhancing the proofing tools in Word. Basically, this is something that touches every Arabic-speaking person working in Word.”