The Redmond-based Natural Language Processing group is focused on developing efficient algorithms to process texts and to make their information accessible to computer applications. Since text can contain information at many different granularities, from simple word or token-based representations, to rich hierarchical syntactic representations, to high-level logical representations across document collections, the group seeks to work at the right level of analysis for the application concerned.
The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person.
This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. It's ironic that natural language, the symbol system that is easiest for humans to learn and use, is hardest for a computer to master. Long after machines have proven capable of inverting large matrices with speed and grace, they still fail to master the basics of our spoken and written languages.
The challenges we face stem from the highly ambiguous nature of natural language. As an English speaker you effortlessly understand a sentence like "Flying planes can be dangerous". Yet this sentence presents difficulties to a software program that lacks both your knowledge of the world and your experience with linguistic structures. Is the more plausible interpretation that the pilot is at risk, or that the danger is to people on the ground? Should "can" be analyzed as a verb or as a noun? Which of the many possible meanings of "plane" is relevant? Depending on context, "plane" could refer to, among other things, an airplane, a geometric object, or a woodworking tool. How much and what sort of context needs to be brought to bear on these questions in order to adequately disambiguate the sentence?
We address these problems using a mix of knowledge-engineered and statistical/machine-learning techniques to disambiguate and respond to natural language input. Our work has implications for applications like text critiquing, information retrieval, question answering, summarization, gaming, and translation. The grammar checkers in Office for English, French, German, and Spanish are outgrowths of our research; Encarta uses our technology to retrieve answers to user questions; Intellishrink uses natural language technology to compress cellphone messages; Microsoft Product Support uses our machine translation software to translate the Microsoft Knowledge Base into other languages. As our work evolves, we expect it to enable any area where human users can benefit by communicating with their computers in a natural way.
Selected current projects
Machine Translation is currently a major focus of the group. In contrast to most existing commercial MT systems, we are pursuing a data-driven approach which all translation knowledge is learned from existing bilingual text.
The ESL Assistant presents a new paradigm of grammar correction in which large-scale statistical models and web services offer writing assistance for learners of English as a second or foreign language. The service is now available online. Additional information can be found on the team website. Updates on the project will also be available from time to time on the ESL Assistant team blog on MSDN.
Recognizing Textual Entailment has been proposed as a generic task that captures major semantic inference needs across many natural language processing applications. In conjunction with our work in this area, we have made available to the research community Manually Word Aligned RTE 2006 Data Sets (described in Brockett, 2007).
Paraphrase recognition and generation are crucial to creating applications that approximate our understanding of language. We have released a corpus of approximately 5000 sentence pairs that have been annotated by humans to indicate whether or not they can be considered paraphrases. Alignment phrase tables created using the data described in Quirk et al. (2004) and Dolan et al. (2004) are now also available for download.
MindNet aims to formalize the representation of word meanings by developing methods for automatically building semantic networks from text and then exploring their structure. MindNets constructed from Japanese and English dictionary data are available for online browsing.
The Japanese NLP project page summarizes areas of research we are working on in processing Japanese.
Amalgam is a novel system developed in the Natural Language Processing group at Microsoft Research for sentence realization during natural language generation that employs machine learning techniques. Sentence realization is the process of generating (realizing) a fluent sentence from a semantic representation.
IntelliShrink is a product that uses linguistic analysis to abbreviate an email message so that it can be displayed on a cell phone. IntelliShrink analyses messages in English, French, German or Spanish.
- Microsoft Research Question-Answering Corpus13 November 2008
- Multi-System, Machine-Translated, Word-Order Collection28 March 2008
- NLP Data Sets for Comparative Study of Parameter-Estimation Methods2 June 2007
- Microsoft Research Paraphrase Phrase Tables10 October 2006
- ESL 123 Mass Noun Examples18 July 2006
- Microsoft Research Paraphrase Corpus3 March 2005
- Microsoft Research IME Corpus21 December 2005
- Bilingual Sentence Aligner14 May 2003
- Unification Grammar Sentence Realization Algorithms6 May 2003
- Sauleh Eetemadi and Kristina Toutanova, Asymmetric Features of Human Generated Translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, November 2014.
- Patrick Pantel, Michael Gamon, and Ariel Fuxman, Smart Selection, ACL – Association for Computational Linguistics, 6 June 2014.
- Emre Kıcıman, Scott Counts, Michael Gamon, Munmun De Choudhury, and Bo Thiesson, Discussion Graphs: Putting Social Media Analysis in Context, in Intl. Conf. on Weblogs and Social Media (ICWSM-14), AAAI, 2 June 2014.
- Michael Brooks, Sumit Basu, Charles Jacobs, and Lucy Vanderwende, Divide and Correct: Using Clusters to Grade Short Answers at Scale, ACL – Association for Computational Linguistics, March 2014.
- Michael Gamon, Tae Yano, Xinying Song, Johnson Apacible, and Patrick Pantel, Identifying Salient Entities in Web Pages, ACM International Conference on Information and Knowledge Management (CIKM), 1 November 2013.
- Michael Gamon, Tae Yano, Xinying Song, Johnson Apacible, and Patrick Pantel, Understanding Document Aboutness Step One: Identifying Salient Entities, no. MSR-TR-2013-73, 27 October 2013.
- Sumit Basu, Chuck Jacobs, and Lucy Vanderwende, Powergrading: a Clustering Approach to Amplify Human Effort for Short Answer Grading, in Transactions of the ACL, ACL – Association for Computational Linguistics, October 2013.
- Sumit Basu and Janara Christensen, Teaching Classification Boundaries to Humans, AAAI - Association for the Advancement of Artificial Intelligence, July 2013.
- Munmun De Choudhury, Scott Counts, Eric Horvitz, and Michael Gamon, Predicting Depression via Social Media., AAAI, July 2013.
- Michael Gamon, Martin Chodorow, Claudia Leacock, and Joel Tetreault, Grammatical Error Detection in Automatic Essay Scoring and Feedback, in Handbook of Automated Essay Evaluation, Routledge, May 2013.
- Munmun de Choudhury, Michael Gamon, Aaron Hoff, and Asta Roseway, "Moon Phrases": A Social Media Facilitated Tool for Emotional Reflection and Wellness., European Alliance for Innovation, May 2013.
- Michael Gamon, Martin Chodorow, Claudia Leacock, and Joel Tetreault, Using Learner Corpora for Automatic Error Detection and Correction, in Automatic Treatment and Analysis of Learner Corpus Data, John Benjamins Publishing Company, 2013.
- Hassan Sajjad, Patrick Pantel, and Michael Gamon, Underspecified Query Refinement via Natural, ACL/SIGPARSE, December 2012.
- Patrick Pantel, Thomas Lin, and Michael Gamon, Mining Entity Types from Query Logs via User Intent Modeling, Association for Computational Linguistics, July 2012.
- Munmun De Choudhury, Scott Counts, and Michael Gamon, Not All Moods are Created Equal! Exploring Human Emotional States in Social Media., Association for the Advancement of Artificial Intelligence, June 2012.
- Munmun De Choudhury, Michael Gamon, and Scott Counts, Happy, Nervous or Surprised? Classification of Human Affective States in Social Media, Association for the Advancement of Artificial Intelligence, June 2012.
- Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, and Ariel Fuxman, Active Objects: Actions for Entity-Centric Search, in World Wide Web, ACM, April 2012.
- kumarana, narend, Ashwani Sharma, and Vikram Dendi, WikiBhasha:OurExperiences with Multilingual Content Creation Tool for Wikipedia, in Proceedings of Wikipedia Conference India, Wikimedia Foundation, November 2011.
- Michel Galley and Chris Quirk, Optimal Search for Minimum Error Rate Training, in Proc. of Empirical Methods in Natural Language Processing, July 2011.
- Patrick Pantel and Ariel Fuxman, Jigs and Lures: Associating Web Queries with Strongly-Typed Entities, in Proceedings of Association for Computational Linguistics - Human Language Technology (ACL-HLT-11), June 2011.
- Kristina Toutanova and Michel Galley, Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity, in Proc. of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 2011.
- Michael Gamon, High-Order Sequence Modeling for Language Learner Error Detection, Association for Computational Linguistics, June 2011.
- Cristian Danescu-Niculescu-Mizil, Michael Gamon, and Susan Dumais, Mark My Words! Linguistic Style Accommodation in Social Media., in Proceedings of WWW 2011, Hyderabad, India., ACM, 1 April 2011.
- Eric Crestan and Patrick Pantel, Web-Scale Table Census and Classification, in Proceedings of Web Search and Data Mining (WSDM-11), 2011.
- A Kumaran, Naren Datha, B Ashok, K Saravanan, Anil Ande, Ashwani Sharma, Sridhar Vedantham, Vidya Natampally, Vikram Dendi, and Sandor Maurice, WikiBABEL: A System for Multilingual Wikipedia Content, in in Proceedings of the 'Collaborative Translation: technology, crowdsourcing, and the translator perspective' Workshop (co-located with AMTA 2010 Conference), Denver, Colorado, Association for Machine Translation in the Americas, 31 October 2010.