Share this page
Share this page E-mail this page Print this page RSS feeds
Home > Publications > Dynamic Promotion of Morphemes to Words
Dynamic Promotion of Morphemes to Words

The line between bound morphemes and free words is a fuzzy one in Chinese. There are many characters that are generally considered bound morphemes but occasionally used as free words. Here are just a few examples. (1) Ãñ±ø¼ÈÊÇÃñÓÖÊDZø¡£ (2) ÕâЩ»°¿ÉÕæ¹»ËðµÄ¡£ (3) ËûÔÚÎÒ¼ÒÙ©ÁËÈý¸öÖÓÍ·¡£ The characters Ãñ£¬Ëð and Ù© are defined as bound morphemes in most dictionaries, but they are used as independent words in the sentences above. Their ¡°wordhood¡± is not only controversial in theory but problematic in computation as well: we will not be able to parse those sentences if they are not words, but if we make them full words in the dictionary, they will create noise and confuse the parser in cases where they are not used as words. Our Chinese system solves this problem by dynamically promoting those morphemes to word at run time. Instead of giving them wordhood in the static dictionary, we convert a morpheme into a word on the fly in sentences where there is strong evidence that it is used as a free word. Take Ù© for example. We are able to make it a verb in (3) on the basis of conditions such as: • The sentence cannot be parsed ifÙ© is not a word; • Ù© is not subsumed by a long word such as ٩٩¶øÌ¸; • Ù© is immediately followed by the aspectual marker ÁË. These conditions also prevent Ù© from being treated as a free word in sentences like (4) given that ٩٩¶øÌ¸ is a word in our dictionary. (4) ËûÔÚÎÒ¼Ò٩٩¶øÌ¸¡£ Every time a bound morpheme is used as a free word in a successful parse, we store this word in a separate lexicon or increment its frequency count if the word is already in that dynamic lexicon. When the frequency count reaches a given threshold, we will let the word temporarily merge into the main dictionary, so that it can be used in sentences where there is not enough contextual evidence for this word to be detected. In (5), for example, both ÏÐ and Ù© are being used as free words, but we are not able to promote them to words because they are subsumed by ÏÐÙ© which is a word in our dictionary. (5) ÄãÔõôÓÐÏÐÙ©ÕâЩ£¿ However, if we have seen Ù© being used as a word a number of times in other sentences, we can simply look it up as a verb from the dynamic lexicon when processing this sentence. If a morpheme is frequently promoted to wordhood across texts of different domains, we may consider making it a permanent part of the main dictionary. In short, this dynamic promotion of morphemes to words not only avoids the linguistic controversy over wordhood but enables us to increase the coverage of our parser without an increase in computational complexity. A demo of this system is available.

Details

Type: Inproceedings