Drop-out Conditional Random Fields for Twitter with Huge Mined Gazetteer

Eunsuk Yang; Young-Bum Kim; Ruhi Sarikaya; Yu-Seop Kim

Drop-out Conditional Random Fields for Twitter with Huge Mined Gazetteer

Eunsuk Yang ,
Young-Bum Kim ,
Ruhi Sarikaya ,
Yu-Seop Kim

June 2016

Published by ACL - Association for Computational Linguistics

Download BibTex

In named entity recognition task especially for massive data like Twitter, having a large amount of high quality gazetteers can alleviate the problem of training data scarcity. One could collect large gazetteers from knowledge graph and phrase embeddings to obtain high coverage of gazetteers. However, large gazetteers cause a side-effect called “feature under-training”, where the gazetteer features overwhelm the context features. To resolve this problem, we propose the dropout conditional random ﬁelds, which decrease the inﬂuence of gazetteer features with a high weight. Our experiments on named entity recognition with Twitter data lead to higher F1 score of 69.38%, about 4% better than the strong baseline presented in Smith and Osborne (2006).