Toward a unified approach to statistical language modeling (SLM) for Chinese -- Chinese SLM in speech recognition and pinyin input system
Applying
SLM techniques like trigram language models to Chinese is challenging because
(1) there is no standard definition of words in Chinese, (2) word boundaries are
not marked by spaces, and (3) there is a dearth of training data.
In
this project, we focus our research on a unified approach to Chinese SLM. Our
unified approach automatically and consistently gathers a high-quality training
data set from the web, creates a high-quality lexicon, segments the training
data using this lexicon, and compresses the language model, all using the
maximum likelihood principle, which is consistent with the trigram model
training. We show that each of the methods leads to improvements over standard
SLM, and that the combined method yields the best pinyin conversion result
reported.
SLM for
Asian language input system
The goal of this project is to extend our Chinese SLM techniques to other Asian languages, such as Japanese and Korean. Like Chinese users, the input system is indispensable for Asian language users, and it serves as an ideal test platform of our SLM research.
TREC-9 CLIR Experiments at MSRCN
In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudo-relevance feedback, and length normalization, and examined their impact on Chinese IR. On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model. This method is shown to be as effective as a good MT system.
TREC-10 Web track experiments at MSRCN
In
TREC-10, we participated in the Web track. Our
work involves: (1) using link information for effective web retrieval; and
(2) blind feedback.
Chinese spelling checking (CSC)
The
goal of CSC is to automatically correct Chinese spelling errors in text.
Although
there are a lot of techniques that have been applied successfully to English
spelling checking (i.e. n-gram model), it is very difficult to extend to
Chinese. Chinese has some special attributes and challenges. First, there are no
word boundaries. Second, non-word errors do not exist. Third, local LM, such as n-gram model, has
already been used by many Chinese input systems. So global context has to be used to
detect/correct errors.
¡¡