Natural Language Processing

The information era has brought us vast amounts of digitized text that is generated, propagated, exchanged, stored, and accessed through the Internet every day all over the world. Users demand useful and reliable information from the web in the shortest time possible, but there exist many obstacles to fulfilling this demand.

It is becoming increasingly difficult for users to identify useful information from the thousands of results returned by search engines. The field of natural language processing (NLP) is essential to helping improve the accuracy of search engine results. Almost all information on the web is provided in the form of natural language text. In order to provide better search results, we need to develop practical NLP technologies to extract the key information from the web text.

Also, if web content is written in a language the user doesn’t speak, it isn’t accessible to the user no matter how good the search results are. Developing high quality machine translation (MT) systems to support query and webpage translations is important to improving successful information acquisition via the web.

Collaborative Research and Projects

Microsoft Research announced an annual invitation for proposals (IFP) related to NLP and MT in 2008 and 2009. The purpose of the IFPs was to encourage researchers and practitioners to discuss our most pressing needs with respect to accessing information on the web and new ideas in NLP technologies that might offer viable solutions. Microsoft Research requested research proposals on the following topics:

Web-Scale Natural Language Processing (2008)

  • Information extraction
  • Information gisting of search results
  • Machine translation and cross-lingual information retrieval
  • Monolingual and multilingual online conversational agent or Chatbot

Machine Translation for Multiple Language Information Access (2009)

  • Applied research for the translation of documents
  • New approaches of statistical machine translation
  • Applying machine translation for search engines
  • Relevance ranking of search results in multiple languages
  • Parallel data mining, translation knowledge acquisition
  • Parallel data mining using search engines and various web resources

Listed below are just a few of the funded projects.

Web-Scale NLP: Retrieval Models for Collaborative Question and Answer Archives with Video Presentation

Watch video (00:03:50)

Investigator: Professor Hae-Chang Rim, Korea University

Goal: Explore various methods for addressing the lexical gap problem in community question retrieval models.

Web-Scale NLP: Retrieval Models for Collaborative Question and Answer Archives The most representative application of our research results would be Community Question and Answer Search. A user enters a question to a Community Question Answering service, which has stored large numbers of previously asked questions and their corresponding answers with the aim of returning the most related previously asked questions and their answers. With use of the methods we explored in this project, the search application would be able to retrieve questions and answers that are not only lexically similar but semantically related to the user question. For example, if a user asks a question such as, “Where can I get cheap airplane tickets?”, the application would retrieve not only results that contain the terms “cheap,” “airplane,” and “ticket,” but also results with the related terms “low” and “airfare,” benefiting the user with a more diverse choice of information. We believe that our approach can also contribute to many other search applications involving retrieval of short texts that suffer greatly from lexical gap problems, such as short text advertisements and recently-popularized twitter posts.

Papers published: “Computing Word Semantic Relatedness for Question Retrieval in Community Question Answering” in IEICE Transactions on Information and Systems 2009; “Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models” in EMNLP 2008

Web-Scale NLP: Aspect-based Summarization for Web Search Results

Investigator: Professor Naoaki Okazaki, University of Tokyo

Goal: Generate summaries for the webpages retrieved by a search engine.

This project developed a web-based application that summarizes webpages retrieved by the Bing search API. Given a query from a user, this application immediately shows the search result obtained by the Bing search API, and invokes the summarization service in the background. The summarization service receives URLs that corresponds to source webpages to be summarized. The service downloads the content of the webpages, strips HTML tags to obtain texts, splits the text into sentences, lemmatizes words in sentences, computes TF*IDF scores of words, assigns bonus weights to occurrences of summarization patterns in the source sentences, and calls the MACCORI solver to choose summary sentences. The summarization service can finish this process roughly in two to five seconds.

Papers published: “A Discriminative Alignment Model for Abbreviation Recognition” in Coling 2008; “A Discriminative Candidate Generator for String Transformations” in EMNLP 2008; “Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages” in NAACL/HLT 2009; “Robust Approach to Abbreviating Terms: A Discriminative Latent Variable Model with Global Information” in ACL-IJCNLP 2009

Machine Translation for Multiple Language Information Access: Bridging Morpho-Syntactic Gap Between Source and Target Sentences for English-Korean Statistical Machine Translation

Investigator: Professor Hae-Chang Rim, Korea University

Goal: Explore various methods for mitigating the morpho-syntactic gap in English-Korean statistical machine translation.

Machine Translation for Multiple Language Information AccessThis project developed two different methods to reduce the morpho-syntactic gap between English and Korean in statistical machine translation. The first method is a preprocessing method for machine translation, which transforms a source language sentence to be much closer to a target language sentence in terms of sentence length and word order. The second method is a post-processing method for word alignment, which reflects POS alignment tendency to improve traditional word alignment models

Papers published: “Bridging Morpho-Syntactic Gap between Source and Target Sentences for English-Korean Statistical Machine Translation” in ACL-IJCNLP 2009; “A Post-processing Approach to Statistical Word Alignment Reflecting Alignment Tendency between Part-of-speeches” in COLING 2010; “Discovering More Links: Using Character Alignment to Improve Chinese-Korean Machine Translation” in COLING 2010

Machine Translation for Multiple Language Information Access: Experimental study on structure-matching extension for hierarchical phrase-based translation model

Investigator: Professor Tiejun Zhao, Harbin Institute of Technology

Goal: Investigate the performance improvement of hierarchical phrase-based translation (HPBT) model if fine-grain syntactic knowledge are integrated in the model or related process, e.g., training, tuning, and decoding.

This project successfully developed Bracket Structure Analyzer (BSA), the provider of syntactic information for hierarchical phrases. Furthermore, we are investigating and conducting experiments with syntax-based SMT models.

Papers published: “Improve the Statistical Machine Translation Performance by Refining the Word Alignments” in INFORMATION 2010; “A deterministic method to predict phrase boundaries of a syntactic tree” in ICIC 2010; “Chinese Named Entity Recognition with a Sequence Labeling Approach: Base on Characters, or Base on Words?” in ICIC 2010.