This project is working to develop tools that will help writers by showing them some of the alternative ways by which they can express their ideas.
From Proofing Tools to Writing Assistance
For all the huge leaps in progress in computing technology over the last half century, computers continue to be used extensively for one very old-fashioned purpose: creating text. Yet the range of tools aimed at helping writers with the authoring process has remained fairly static, with spelling and grammar checkers aimed at helping users avoid small or potentially embarrassing errors. Much less effort has been devoted to building tools or applications that assist writers in constructing better prose or finding alternative ways of expressing what they wish to communicate, in part because these have been seen as involving deep natural language understanding and therefore an almost intractable problem.
Helping writers find the right words
Our present goals are more modest, and our research prototype offers a new spin on an old technology: the thesaurus. Writers often have trouble coming up with just the right word to use in a particular context, or they may seek a little variety of expression, or they may need to follow the terminological conventions of a field or industry. For some of these purposes, a thesaurus can be of help, but the results are often not especially relevant in the context intended. For common words, the list of suggestions can be very long and esoteric, and yet somehow the right word never seems to be in the list.
Our solution crucially involves:
- An enormous thesaurus containing 1 million keywords and key phrases. This resource dwarfs the typical desktop thesaurus, which might contain 300K headwords, and its size makes it much more likely that we’ll find an interesting rewrite for any given word or phrase.
- English synonyms and phrasal paraphrases, e.g. express permission/explicit authority, learned as a byproduct of our group’s data-driven Machine Translation effort. (See more at http://microsofttranslator.com and http://blogs.msdn.com/translation/). When we learn that two English words or phrases translate identically into another language, we can also infer that they might be similar in meaning in the right context.
- Very large language models that use sentence context to rank and filter thesaurus candidates in the same way that Word 2007’s “Contextual Speller” uses context to decide which spelling variant (e.g. “you’re” vs. “your”) is most appropriate in a given sentence.
The result of all this is a new kind of thesaurus; one that does not simply point the user to a list of synonyms for a word in their document – most of them not quite right for one reason or another – but that instead suggests a smaller set of synonyms that are most likely to make sense in that particular context. The tool can even attempt to rewrite an entire sentence, selecting among different combinations of word and phrase replacements to choose the contextually most plausible set of all substitutions proposed by the models. For example, the first sentence of this web page,
For all the huge leaps in progress in computing technology over the last half century, computers continue to be used extensively for one very old-fashioned purpose: creating text.
is rewritten as
For all the huge increases in advance in computer equipment over the past fifty years, notebooks continue to be employed widely for one very ancient goal: generating content.
Some of the suggestions in this example are things that a writer might actually want to consider. The result is certainly far more usable than, say, random substitution of synonyms without reference to context, which produces such delights as:
For all the whacking paces in stride in robotics equipment over the valedictory half span, robots continue to be tapped substantively for one very whimsical thrust: wreaking transcription.
It is also obvious that our process is far still from perfect - we would certainly advise against blindly adopting all suggestions that are offered. Errors do creep in when we’ve learned a bad English-English “translation” from our parallel translation data, or when the statistical models lack rich enough information to make the right decision about which alternative is most contextually appropriate.
In suggesting that writers replace content words, we thus take a great deal of risk: a poor choice can dramatically alter the meaning of a sentence - or provoke unintended hilarity.
We must take that risk, however, in order to push the frontiers of editing tool technology and the broader ability to identify and generate paraphrases. From a technical standpoint, the task of filtering potentially huge sets of synonymous words and phrases is itself immensely challenging. Our prototype is necessarily implemented as a web service, since the contextual language models required to make subtle judgments are so large.
The project goal is not to improve the work of poets, professional novelists, or anyone else who considers their writing art. We’re focused primarily on helping users who are writing to achieve a more pragmatic goal – say a project report, a term paper, or an email – and who would like a little assistance in order to find the right words.
Related technology involving web-based authoring assistance can be seen at http://www.eslassistant.com.
The long-term vision: learning to paraphrase
A common complaint about thesauri is that even when one of the suggestions is on-topic, it’s only useful if the entire sentence is rephrased; the synonym cannot simply be plugged into the same slot as the original word. Currently, our tool suffers from this same limitation: we can only replace words or phrases in situ. Our longer-term goals are loftier, and in particular, we hope eventually to provide “Rewrite This” functionality that goes beyond simple word and phrase replacements and will offer more dramatic rewrites along the lines encountered in translating from one language to another, with wholesale rearrangements of words and phrases.
As we progress with this editing work, we anticipate borrowing more and more technology from our group’s extensive work on machine translation (http://translator.live.com). Paraphrasing one English sentence as another is essentially the monolingual version of translating from one language to another. Consider the following two sentences: the words and their order are quite different, yet at some level they “mean the same thing”:
On its way to an extended mission at Saturn, the Cassini probe on Friday makes its closest rendezvous with Saturn's dark moon Phoebe.
The Cassini spacecraft, which is en route to Saturn, is about to make a close pass of the ringed planet's mysterious moon Phoebe.
Recognizing and generating such paraphrase relationships is key to developing software applications that appear to “understand” natural language, since the same command, question, or fact can be expressed in myriad different ways. Rewriting prose in the context of a word processor is an application that interests us not only because users deserve better tools in this space, but also because it pushes this broader research agenda.
- Michael Gamon, Claudia Leacock, Chris Brockett, William B. Dolan, Jianfeng Gao, Dmitriy Belenko, and Alexandre Klementiev, Using Statistical Techniques and Web Search to Correct ESL Errors, in Calico Journal, Vol 26, No. 3, CALICO Journal, June 2009.
- Chris Brockett and William B. Dolan, Support Vector Machines for Paraphrase Identification and Corpus Construction, in Third International Workshop on Paraphrasing (IWP2005), Asia Federation of Natural Language Processing, 2005.
- William B. Dolan and Chris Brockett, Automatically Constructing a Corpus of Sentential Paraphrases, in Third International Workshop on Paraphrasing (IWP2005), Asia Federation of Natural Language Processing, 2005.
- William Dolan, Chris Quirk, and Chris Brockett, Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources, International Conference on Computational Linguistics, August 2004.
- Chris Quirk, Chris Brockett, and William B. Dolan, Monolingual Machine Translation for Paraphrase Generation, Association for Computational Linguistics, July 2004.