Japanese NLP

Research project on aspects of Japanese natural language processing, including word segmentation, parsing, machine translation.

Overview

The Japanese NLP system at MSR is part of the larger natural language understanding system called NLPWin, which includes both analysis and generation components. The system is currently being used in the MSR-MT machine translation system. Below are some characteristics of the Japanese version of NLPWin; for details of each of these components, please refer to the list of publications below.

Word-breaking/morphological analysis

The purpose of this component is to find possible words given a string of characters, rather than to produce the best word segmentation analysis of a given string. In more technical terms, our approach tries to maximize recall. This approach allows us to focus on finding words that are not in the dictionary or spelled differently from the form in the dictionary. For example, in the sentence

硫黄、リン、炭素、ケイ素と結合することが知られている。 ,

リン and ケイ素 are not found in our dictionary as such, because the base forms of our dictionary are either in hiragana (りん or けいそ) or in kanji (燐 or 珪素).
Our word-breaking component recognizes these orthographic variants and returns their base dictionary forms.

The component also performs:
・ Full morphological analysis of inflected forms and their okurigana-variants, e.g.) 組(み)合(わ)せる
・ Identification of inserted yomi (ruby), e.g.) 有珠(うす)山
Dictionary forms are retrieved from these strings, along with reading information.

Syntactic analysis

The syntactic component consists of two levels of analysis: phrase-structure analysis and LNS (language-neutral syntax).
(1) The phrase-structure component applies syntactic rules on the word candidates returned by the word-breaking component and produces phrase-structure tree(s). For the example sentence above, we produce the following tree:

The parser is a bottom-up chart parser; there are about 150 phrase-structure rules for Japanese.

(2) LNS (language-neutral syntax) is the level of representation where we try to express language-neutral syntactic properties of sentence structure using language-neutral formal vocabulary. For example, Japanese uses morphology (i.e., postpositions) to indicate subject, while English uses word order. Such a difference in the form of encoding grammatical properties is neutralized at the level of LNS -- logical subject is indicated using the language-neutral grammatical relation L_Sub. Below is the LNS representation of the example sentence:

Unlike in the phrase-structure tree, the leaf nodes of the LNS tree are the base forms (called Lemmas) of content words; function words, such as case markers (と and が in the above example), formal nouns (こと), voice/aspect markers (れて and いる) and the light verb (する for the verbal noun 結合) are not present in the LNS tree as a node, but are represented functionally in terms of grammatical relations and operators. The representation also normalizes grammatical paraphrases such as passivization, as is seen in the above example, where _X indicates unspecified (unmentioned) entity.

The LNS representation serves as the basis for other application-specific representations. For example, our machine translation system at one point used predicate-argument structure (aka Logical Form (LF)) derived from LNS as input to its alignment and transfer components.

Generation component

Under construction

English-to-Japanese transliteration

Under construction

Publications