Share this page
Share this page E-mail this page Print this page RSS feeds
Home > Publications > A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005
A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005

We present a Chinese word segmentation system submitted to the closed track of Sighan bakeoff 2005. Our segmenter was built using a conditional random field sequence model that provides a framework to use a large number of linguistic features such as character identity, morphological and character reduplication features. Because our morphological features were extracted from the training corpora automatically, our system was not biased toward any particular variety of Mandarin. Thus, our system does not overfit the variety of Mandarin most familiar to the system's designers. Our final system achieved a F-score of 0.947 (AS), 0.943 (HK), 0.950 (PK) and 0.964 (MSR).

tseng05crf.pdf
PDF file

In: SIGHAN Workshop on Chinese Language Processing

Publisher: Association for Computational Linguistics
All copyrights reserved by ACL 2007

Details

Type: Inproceedings