Research project on aspects of Japanese natural language processing, including word segmentation, parsing, machine translation.
Overview
The Japanese NLP system at MSR is part of the larger natural language understanding system called NLPWin, which includes both analysis and generation components. The system is currently being used in the MSR-MT machine translation system. Below are some characteristics of the Japanese version of NLPWin; for details of each of these components, please refer to the list of publications below.
Word-breaking/morphological analysis
The purpose of this component is to find possible words given a string of characters, rather than to produce the best word segmentation analysis of a given string. In more technical terms, our approach tries to maximize recall. This approach allows us to focus on finding words that are not in the dictionary or spelled differently from the form in the dictionary. For example, in the sentence
硫黄、リン、炭素、ケイ素と結合することが知られている。 ,
リン and ケイ素 are not found in our dictionary as such, because the base forms of our dictionary are either in hiragana (りん or けいそ) or in kanji (燐 or 珪素).
Our word-breaking component recognizes these orthographic variants and returns their base dictionary forms.
The component also performs:
・ Full morphological analysis of inflected forms and their okurigana-variants, e.g.) 組(み)合(わ)せる
・ Identification of inserted yomi (ruby), e.g.) 有珠(うす)山
Dictionary forms are retrieved from these strings, along with reading information.
Syntactic analysis
The syntactic component consists of two levels of analysis: phrase-structure analysis and LNS (language-neutral syntax).
(1) The phrase-structure component applies syntactic rules on the word candidates returned by the word-breaking component and produces phrase-structure tree(s). For the example sentence above, we produce the following tree: 
The parser is a bottom-up chart parser; there are about 150 phrase-structure rules for Japanese.
(2) LNS (language-neutral syntax) is the level of representation where we try to express language-neutral syntactic properties of sentence structure using language-neutral formal vocabulary. For example, Japanese uses morphology (i.e., postpositions) to indicate subject, while English uses word order. Such a difference in the form of encoding grammatical properties is neutralized at the level of LNS -- logical subject is indicated using the language-neutral grammatical relation L_Sub. Below is the LNS representation of the example sentence: 
Unlike in the phrase-structure tree, the leaf nodes of the LNS tree are the base forms (called Lemmas) of content words; function words, such as case markers (と and が in the above example), formal nouns (こと), voice/aspect markers (れて and いる) and the light verb (する for the verbal noun 結合) are not present in the LNS tree as a node, but are represented functionally in terms of grammatical relations and operators. The representation also normalizes grammatical paraphrases such as passivization, as is seen in the above example, where _X indicates unspecified (unmentioned) entity.
The LNS representation serves as the basis for other application-specific representations. For example, our machine translation system at one point used predicate-argument structure (aka Logical Form (LF)) derived from LNS as input to its alignment and transfer components.
Generation component
Under construction
English-to-Japanese transliteration
Under construction
Publications
- For the overview of the NLPWin system, see the following:
- Heidorn, G. 2000. Intelligent writing assistance. in R.Dale, H.Moisl and H.Somers (eds.), A Handbook of Natural Langauge Processing: Techniques and Applications for the Processing of Language as Text. New York: Marcel Dekker.
- On the word-breaking component, see the following papers:
- Kacmarcik, Gary, Chris Brockett and Hisami Suzuki. 2000. Robust Segmentation of Japanese Text into a Lattice for Parsing. In Proceedings of COLING 2000, Saarbrüken, Germany, pp. 390-396.
- Suzuki, Hisami, Chris Brockett and Gary Kacmarcik. 2000. Using a Broad-Coverage Parser for Word-Breaking in Japanese. In Proceedings of COLING 2000, Saarbrüken, Germany, pp. 822-827. In Proceedings of COLING 2002, pp. 301-307.
- G.Kacmarcik, Making Use of Furigana, In The 1st International Joint Conference on Natural Language Processing, IJCNLP 04, Sanya, Hainan Island, China, pp.159-164, 2004.
- The following paper describes an evaluation of the phrase-structure component:
- Hisami Suzuki. 2004. Phrase-Based Dependency Evaluation of a Japanese Parser. In Proceedings of LREC 2004, Lisbon, Portugal, pp.863-866.
- For LNS, refer to the following:
- Campbell, Richard and Hisami Suzuki. 2002a. Language-Neutral Representation of Syntactic Structure. In Proceedings of the First International Workshop on Scalable Natural Language Understanding (SCANALU 2002), Heidelberg, Germany.
- Campbell, Richard and Hisami Suzuki. 2002b. Language-Neutral Syntax: An Overview. Microsoft Research Technical Report, MSR-TR-2002-76.
- On the use of predicate-argument structure in machine translation:
- Gamon, Michael, Hisami Suzuki and Simon Corston-Oliver, Using Machine Learning for System-Internal Evaluation of Transferred Linguistic Representations, European Association for Machine Translation, January 2001.
- On English-Japanese transliteration:
- E.Brill, G.Kacmarcik, C.Brockett, Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs, In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, NLPRS 2001, Tokyo, Japan, pp.393-399, 2001. (Alternate link)
