A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005

  • Huihsin Tseng ,
  • Pichuan Chang ,
  • Galen Andrew ,
  • Daniel Jurafsky ,
  • Christopher Manning

SIGHAN Workshop on Chinese Language Processing |

Published by Association for Computational Linguistics

We present a Chinese word segmentation system submitted to the closed track of Sighan bakeoff 2005. Our segmenter was built using a conditional random field sequence model that provides a framework to use a large number of linguistic features such as character identity, morphological and character reduplication features. Because our morphological features were extracted from the training corpora automatically, our system was not biased toward any particular variety of Mandarin. Thus, our system does not overfit the variety of Mandarin most familiar to the system’s designers. Our final system achieved a F-score of 0.947 (AS), 0.943 (HK), 0.950 (PK) and 0.964 (MSR).