Drop-out Conditional Random Fields for Twitter with Huge Mined Gazetteer

  • Eunsuk Yang ,
  • Young-Bum Kim ,
  • Ruhi Sarikaya ,
  • Yu-Seop Kim

Published by ACL - Association for Computational Linguistics

In named entity recognition task especially for massive data like Twitter, having a large amount of high quality gazetteers can alleviate the problem of training data scarcity. One could collect large gazetteers from knowledge graph and phrase embeddings to obtain high coverage of gazetteers. However, large gazetteers cause a side-effect called “feature under-training”, where the gazetteer features overwhelm the context features. To resolve this problem, we propose the dropout conditional random fields, which decrease the influence of gazetteer features with a high weight. Our experiments on named entity recognition with Twitter data lead to higher F1 score of 69.38%, about 4% better than the strong baseline presented in Smith and Osborne (2006).