Toward a unified approach to statistical language modeling (SLM) for Chinese -- Chinese SLM in speech recognition and pinyin input system

Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data.

In this project, we focus our research on a unified approach to Chinese SLM. Our unified approach automatically and consistently gathers a high-quality training data set from the web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all using the maximum likelihood principle, which is consistent with the trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

The techniques were also applied for Chinese speech recognition. 

SLM for Asian language input system  

The goal of this project is to extend our Chinese SLM techniques to other Asian languages, such as Japanese and Korean. Like Chinese users, the input system is indispensable for Asian language users, and it serves as an ideal test platform of our SLM research. 

TREC-9 CLIR Experiments at MSRCN

In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudo-relevance feedback, and length normalization, and examined their impact on Chinese IR. On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model. This method is shown to be as effective as a good MT system.

TREC-10 Web track experiments at MSRCN

In TREC-10, we participated in the Web track. Our work involves: (1) using link information for effective web retrieval; and (2) blind feedback.

Chinese spelling checking (CSC)

The goal of CSC is to automatically correct Chinese spelling errors in text. Although there are a lot of techniques that have been applied successfully to English spelling checking (i.e. n-gram model), it is very difficult to extend to Chinese. Chinese has some special attributes and challenges. First, there are no word boundaries. Second, non-word errors do not exist. Third, local LM, such as n-gram model, has already been used by many Chinese input systems. So global context has to be used to detect/correct errors.

To address the abovementioned problems, research has focused on three aspects: (1) Chinese proper noun identification and word segmentation; (2) long words fuzzy matching; and (3) context-sensitive word disambiguation.

¡¡