|
Japanese NLP
Contents
Overview
The Japanese NLP system at MSR is part of the larger natural language understanding system called NLPWin, which includes both analysis and generation components. The system is currently being used in the MSR-MT machine translation system. Below are some characteristics of the Japanese version of NLPWin; for details of each of these components, please refer to the list of publications below. Word-breaking/morphological analysis
The purpose of this component is to find possible words given a string of characters, rather than to produce the best word segmentation analysis of a given string. In more technical terms, our approach tries to maximize recall. This approach allows us to focus on finding words that are not in the dictionary or spelled differently from the form in the dictionary. For example, in the sentence Syntactic analysis
The syntactic component consists of two levels of analysis: phrase-structure analysis and LNS (language-neutral syntax). ![]() The parser is a bottom-up chart parser; there are about 150 phrase-structure rules for Japanese.
(2) LNS (language-neutral syntax) is the level of representation where we try to express language-neutral syntactic properties of sentence structure using language-neutral formal vocabulary. For example, Japanese uses morphology (i.e., postpositions) to indicate subject, while English uses word order. Such a difference in the form of encoding grammatical properties is neutralized at the level of LNS -- logical subject is indicated using the language-neutral grammatical relation L_Sub. Below is the LNS representation of the example sentence: ![]() Unlike in the phrase-structure tree, the leaf nodes of the LNS tree are the base forms (called Lemmas) of content words; function words, such as case markers (と and が in the above example), formal nouns (こと), voice/aspect markers (れて and いる) and the light verb (する for the verbal noun 結合) are not present in the LNS tree as a node, but are represented functionally in terms of grammatical relations and operators. The representation also normalizes grammatical paraphrases such as passivization, as is seen in the above example, where _X indicates unspecified (unmentioned) entity. The LNS representation serves as the basis for other application-specific representations. For example, our machine translation system uses predicate-argument structure (aka Logical Form (LF)) derived from LNS as input to its alignment and transfer components. Generation component
Under construction English-to-Japanese transliteration
Under construction Project Members
Publications
Associated Groups
|
||||