Natural Language Processing

The Redmond-based Natural Language Processing group is focused on developing efficient algorithms to process texts and to make their information accessible to computer applications. Since text can contain information at many different granularities, from simple word or token-based representations, to rich hierarchical syntactic representations, to high-level logical representations across document collections, the group seeks to work at the right level of analysis for the application concerned.


The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.

This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. It's ironic that natural language, the symbol system that is easiest for humans to learn and use, is hardest for a computer to master. Long after machines have proven capable of inverting large matrices with speed and grace, they still fail to master the basics of our spoken and written languages.

The challenges we face stem from the highly ambiguous nature of natural language. As an English speaker you effortlessly understand a sentence like "Flying planes can be dangerous". Yet this sentence presents difficulties to a software program that lacks both your knowledge of the world and your experience with linguistic structures. Is the more plausible interpretation that the pilot is at risk, or that the danger is to people on the ground? Should "can" be analyzed as a verb or as a noun? Which of the many possible meanings of "plane" is relevant? Depending on context, "plane" could refer to, among other things, an airplane, a geometric object, or a woodworking tool. How much and what sort of context needs to be brought to bear on these questions in order to adequately disambiguate the sentence?

We address these problems using a mix of knowledge-engineered and statistical/machine-learning techniques to disambiguate and respond to natural language input. Our work has implications for applications like text critiquing, information retrieval, question answering, summarization, gaming, and translation. The grammar checkers in Office for English, French, German, and Spanish are outgrowths of our research; Encarta uses our technology to retrieve answers to user questions; Intellishrink uses natural language technology to compress cellphone messages; Microsoft Product Support uses our machine translation software to translate the Microsoft Knowledge Base into other languages. As our work evolves, we expect it to enable any area where human users can benefit by communicating with their computers in a natural way.

Selected current projects

Machine Translation is currently a major focus of the group. In contrast to most existing commercial MT systems, we are pursuing a data-driven approach which all translation knowledge is learned from existing bilingual text.

The ESL Assistant presents a new paradigm of grammar correction in which large-scale statistical models and web services offer writing assistance for learners of English as a second or foreign language. The service is now available online. Additional information can be found on the team website. Updates on the project will also be available from time to time on the ESL Assistant team blog on MSDN.

Recognizing Textual Entailment has been proposed as a generic task that captures major semantic inference needs across many natural language processing applications. In conjunction with our work in this area, we have made available to the research community Manually Word Aligned RTE 2006 Data Sets (described in Brockett, 2007).

Paraphrase recognition and generation are crucial to creating applications that approximate our understanding of language. We have released a corpus of approximately 5000 sentence pairs that have been annotated by humans to indicate whether or not they can be considered paraphrases. Alignment phrase tables created using the data described in Quirk et al. (2004) and Dolan et al. (2004) are now also available for download.

MindNet aims to formalize the representation of word meanings by developing methods for automatically building semantic networks from text and then exploring their structure. MindNets constructed from Japanese and English dictionary data are available for online browsing.

The Japanese NLP project page summarizes areas of research we are working on in processing Japanese.

Older projects

Amalgam is a novel system developed in the Natural Language Processing group at Microsoft Research for sentence realization during natural language generation that employs machine learning techniques. Sentence realization is the process of generating (realizing) a fluent sentence from a semantic representation.

IntelliShrink is a product that uses linguistic analysis to abbreviate an email message so that it can be displayed on a cell phone. IntelliShrink analyses messages in English, French, German or Spanish.

Recent Publications