Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Finding the Right Words
August 6, 2009 3:00 PM PT

During Microsoft Research’s TechFest 2009, customers bombarded Chris Brockett with requests to customize his team’s contextual thesaurus prototype to work with their corporate documents.

For Brockett, a computational linguist with the Natural Language Processing group (NLP) at Microsoft Research Redmond, this level of interest was gratifying proof that the group’s work had captured the imagination of real-world users. An application that can rewrite prose is important not only because users deserve better tools in this space, but also because the underlying work advances a broader research agenda of developing software that appears to “understand” natural language.

The journey that has taken Brockett and his colleagues to this point in their ongoing research is as much of a story as the results they are attempting to achieve. The twists and turns of the past several years, involving different paths of inquiry, are typical of the research world.

It Began with MindNet

Bill Dolan, principal researcher and NLP group manager, recalls being intrigued by the task of paraphrasing as far back as graduate school, when he took on the job of editing papers about a technical field totally unfamiliar to him.

“But to my surprise, I found I could do the task just fine,” Dolan says. “I was startled by how mechanical that kind of editing could be and how confidently I could manipulate the original text while having no real idea of what it meant.”

One of Dolan’s earlier research efforts with paraphrase involved construction of a natural-language query interface for Encarta, part of an ambitious research project called MindNet. The project was aimed at automatically building richly structured knowledge bases from free-text information. The project also involved constructing a semantic search engine that would parse queries and attempt to answer them. Sometimes, the process worked, but more often, even when the answer was present in Encarta and the process had worked perfectly to structure both the query and answer, there wouldn’t be a match.

Paraphrase was the reason: Questions and answers were phrased in fundamentally different ways, using different words and different word orders. A question like “Who is John Lennon’s widow?” might be answered with a sentence like “Yoko Ono’s late husband, John Lennon…”

“We didn’t have any principled way to bridge that gulf,” Dolan says. “Today’s search engines only work because of the Web’s massive redundancy; no matter how you phrase your question, odds are that someone out there has used the same words to answer it.”

In fact, anytime researchers try to build an application that appears to “understand” language—document summarization, dialog systems, or authoring assistance—they always encounter the paraphrase problem: The same information can be expressed in very different ways.

“So it was clear that the ability to paraphrase is important for all kinds of analyses,” Brockett says, “whether it has to do with database searches, summarization of articles, or automated question-and-answer systems.”

Accordingly, the NLP group committed to creating the ability to determine whether or not two sentences are roughly equivalent in meaning even if they don’t contain the same words.

The Monolingual Machine-Translation Approach

The NLP group set out to treat paraphrase acquisition and generation as a machine-learning problem, re-purposing machinery built as part of the team’s ongoing machine-translation effort. This involved generating paraphrases based on statistical models that have been trained on paraphrase sentence pairs.

The team’s method of phrasal replacement proved successful: The test system generated paraphrases that were rated as plausible more often than those generated by other techniques used for comparison.

The team was also the first to exploit Internet news content in a large-scale data set using conventional machine-translation techniques for paraphrase learning and generation. In the process, the researchers developed a method for extracting and building a large data set from news data on the Web. Over a course of eight months, the team collected more than 11,000 clusters of similar topics from more than 177,000 articles. Before this, the data sources for paraphrase work relied mostly on translations of classic literary works, which provided a narrow domain for testing systems. The team’s results were published in a 2004 paper, Monolingual Machine Translation for Paraphrase Generation.

Despite this initial success, the team stepped back from the project.

“We were not satisfied that this approach could eventually produce more complex rewrites that involved substantially different sentence construction and vocabulary,” Brockett says. “In addition, although we had built quite a large collection of training data, it still was not enough in terms of absolute quantity or richness of content.”

Crossing Over to ESL Correction

Nonetheless this work inspired another direction of research: using statistical methods to correct errors by users for whom English is a second language (ESL). Brockett, Dolan, and Brockett’s fellow computational linguist Michael Gamon presented a pilot study on this topic in a 2006 paper, Correcting ESL Errors Using Phrasal SMT Techniques.

“Standard proofing tools for English, such as grammar check and spell check, are really designed for native English speakers,” Brockett says. “But could we have more success for ESL errors by treating them as a machine-translation problem? We set out to show that SMT [statistical machine translation] could be applied to capture ESL errors that are not detected by current proofing tools. We crossed over from monolingual translation to the ESL space.”

A typical example of a sentence for which standard proofing tools are inadequate would be:

  • I knew many informations about Christmas while I was preparing this article.

Grammar and spell checkers might suggest that “many” should be “much,” and “informations” should be “information.” But word substitution alone does not achieve the natural-sounding idiomatic English of rewrites such as:

  • I learned a lot of information about Christmas while I was preparing this article.
  • I learned a lot about Christmas while I was preparing this article.

It takes a wholesale phrasal replacement to produce the desired result.

It turns out that translating from one version of English to another is essentially the same problem as translating from English to Chinese. The words may be different, the order in which they occur may be different, and yet, at a fundamental level, the input and output “mean the same thing.”

The analogy is that English written by a Chinese native speaker could be regarded as Language1, a language that happens to be only slightly different from English; thus, the task is one of translation from Language1 to colloquial English.

The results of this research proved that, with sufficient data, SMT techniques could be effective in correcting ESL errors. At the same time, the experiment showed that “sufficient data” would have to involve a colossal amount of parallel data consisting of sentences containing errors and their corrected versions.

To the team, this did not detract from the fact that SMT provided an extremely successful paradigm within the field of natural-language processing.

“The approach we took benefits from any progress made in SMT itself,” Brockett says. “The architecture does not depend on manual maintenance of rules or regular expressions. Developing and maintaining applications based on this model would require minimal linguistic expertise.”

Reaching Out to an ESL Audience

The results encouraged the NLP team to continue with statistical methods and Web search for another research project: a modular error-detection and -correction system for ESL writing. Knowing they could apply algorithms to work across large data sets, they attempted to create a general proofing application that would target specific, common ESL error types such as choice of determiner (“I am teacher,” “I am a teacher”), choice of preposition (“in the other hand,” “on the other hand”), and adjective/noun confusion (“this is a China book,” “this is a Chinese book.”).

“We treated the difference between ESL English and idiomatic, natural English,” Brockett says, “as being analogous to the differences between, for example, the language used in The New England Journal of Medicine and the language used in the health section of a lifestyle magazine.”

The resulting paper, Using Contextual Speller Techniques and Language Modeling for ESL Error Correction, was jointly written by Gamon; Jianfeng Gao; Brockett; intern Alexandre Klementiev from the University of Illinois at Urbana-Champaign; Dolan; Dmitriy Belenko, formerly of Microsoft Research; and Lucy Vanderwende. It describes the use of a two-stage statistical model consisting of modules to handle each error type and a large language model to determine suggestions for correct usage. The modules and the language model are so large that they must be offered as a service.

This research supplied the fundamentals behind Microsoft’s ESL Assistant, a Web-based service that suggests corrected alternatives to phrases typed in by users. The site also shows a graphical representation of Bing hit counts for different versions of a phrase, automating a strategy that many non-native writers report using as a way to write more native-sounding English. Launched in July 2008, the beta site currently counts approximately 45,000 unique visitors per month; and as word gets out, the numbers are climbing. Also important to the research is that the site gathers feedback from users about the helpfulness of suggested alternatives, crucial training data for the statistical models.

Getting Back to Paraphrasing

During the work on ESL correction, members of the team noticed that many of the errors were soft errors, more about style than syntax, and often involved a choice of words.

“For example, a Japanese speaker might use the word ‘demerit’ in a sentence,” Brockett says. “But a native English speaker would use ‘disadvantage’ or ‘drawback’ in that context. We began looking at those types of mismatches and thought, ‘Yes, we can model this.’ ”

This line of investigation led to the work that Brockett and colleagues Dolan, Gamon, and Xiaodong He demonstrated during Tech Fest 2009, a new, context-sensitive thesaurus intended for use by both native and non-native speakers of English. While there has been work in a similar vein, it has been on a small scale, because to achieve accuracy on a large scale, researchers need large data sets with models trained on billions of words.

“The data is there, out on the Web,” Brockett says. “However, collecting the data and deciding what kinds of data to use and what kind of model to build is an art in its own right. At the moment, I’m working with a data set of about 2.8 billion words, and actually, even that’s not large enough. It’s mostly news data, with some Encarta and Microsoft manuals thrown in to balance things out; this produces data tuned to a newsy style. If we were to use a Web-based language model, however, we would get a more colloquial style.”

The usefulness of such a thesaurus goes well beyond ESL environments. Writers sometimes want more variety of expression or need help using the vernacular of a particular industry. Hence the level of interest from corporate customers, who wanted to know whether Microsoft could customize a contextual thesaurus using documents from their own company.

“You can imagine how this could be packaged in the future,” Brockett says. “With the right learning engine and a large enough data set, you could have an insurance-industry model or an aeronautical-engineering model. Someone new to the company would be guided in their writing to capture the idiom of that company or industry by learning the mappings.”

Fulfilling the Long-Term Vision

The NLP group uses the term “next-generation writing assistance” to distinguish between tools that offer corrections and the tools they hope to enable, which would be more assistive and give writers alternative ways of expressing themselves. The group’s focus on automated translation over the last several years has given the NLP researchers a great deal of experience with the issues surrounding paraphrase, and now, the team’s efforts are aimed at modeling paraphrase alternations as a translation problem.

“Chris and I are convinced,” Dolan says, “that our technology has reached a point that will allow us to make some fundamental changes in the way users interact with word processors. Over time, our models will be trained on vast numbers of user clicks from those ‘did you mean?’ paraphrase suggestions. In the same way that search engines learn from clicks which documents are good hits, this type of feedback loop will allow our models to learn which alternations are natural in specific contexts.”

Research in a field as complex as natural-language processing is an ongoing journey. In this case, the group’s earlier work on paraphrasing with monolingual machine translation, the ESL Assistant project, as well as search and machine translation, represent a confluence of ideas that led to the contextual thesaurus—which, in turn, could play an important part in fulfilling the longer-term NLP vision of developing systems that paraphrase.

“In a decade or two,” Dolan says, “when you’re talking to your house about temperature and security settings during your vacation, that natural interaction will be the result of paraphrase systems. But don’t think about that right now. All we need you to do is rewrite that e-mail so that it sounds a little better, and we’ll take it from there.”