|
Chinese Word Segmentation
Overview
Overview
The Chinese word segmenter developed in the Natural Language Processing group at Microsoft Research (MSR-NLP) is an integral part of a Chinese sentence analyzer. The system performs a full syntactic analysis of all sentences and the final segmentation is produced from the leaves of parse trees. In the First International Chinese Word Segmentation Bakeoff (Sproat and Emerson, 2003), the segmenter participated in four tracks -- PKU-open, PKU-close, CTB-open and CTB-closed -- and ranked #1, #2, #2 and #3 respectively in those tracks. Like many other word segmentation systems, the MSR-NLP Chinese word segmenter has a word recognition component and disambiguation component. The recognition component includes a morphological analyzer, a named entity recognizer and a new word recognizer. Disambiguation is achieved through word lattice pruning and parsing. The main characteristic that distinguishes this system from other systems is its ability to have its output customized for different segmentation standards and different NLP applications. This is made possible by the fact that (1) word internal structures are preserved for all morphologically complex words, and (2) each class of nodes in the word tree (i.e. each type of word construction) is associated with an independent segmentation parameter whose value can be specified by the user to determine whether the children of the given node should be displayed as one word or separate words. Here are the sub-trees from the derivational tree of a sentence that contains some morphologically complex words: 国务院一月二十五日举行春节团拜会, 胡锦涛主席走进会场代表中央政治局致词。 Publications
Associated Groups
|
||||