Text classification improved through multigram models

Dou Shen, Jian-Tao Sun, Qiang Yang, and Zheng Chen

Abstract

Classification algorithms and document representation approaches are two key elements for a successful document classification system. In the past, much work has been conducted to find better ways to represent documents. However, most of the attempts rely on certain extra resources such as WordNet, or they face the problem of extremely high dimension. In this paper, we propose a new document representation approach based on n-multigram language models. This approach can automatically discover the hidden semantic sequences in the documents under each category. Based on n-multigram language models and n-gram language models, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram models alone can achieve the similar or even better classification performance compared with the classifier based on n-gram models but the model size of our algorithm is much smaller than that of the latter. Another proposed algorithm based on the combination of n-multigram models and n-gram models improves the micro-F1 and macro-F1 values from 89.5% to 92.6% and 87.2% to 91.1% respectively. All these observations support the validity of our proposed document representation approach.

Details

Publication typeInproceedings
Published inCIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
URLhttp://doi.acm.org/10.1145/1183614.1183710
Pages672–681
ISBN1-59593-433-2
AddressNew York, NY, USA
PublisherACM
> Publications > Text classification improved through multigram models