Text classification improved through multigram models

Classification algorithms and document representation approaches are two key elements for a successful document classification system. In the past, much work has been conducted to find better ways to represent documents. However, most of the attempts rely on certain extra resources such as WordNet, or they face the problem of extremely high dimension. In this paper, we propose a new document representation approach based on n-multigram language models. This approach can automatically discover the hidden semantic sequences in the documents under each category. Based on n-multigram language models and n-gram language models, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram models alone can achieve the similar or even better classification performance compared with the classifier based on n-gram models but the model size of our algorithm is much smaller than that of the latter. Another proposed algorithm based on the combination of n-multigram models and n-gram models improves the micro-F1 and macro-F1 values from 89.5% to 92.6% and 87.2% to 91.1% respectively. All these observations support the validity of our proposed document representation approach.

PDF file

In  CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Publisher  ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’06, November 5–11, 2006, Arlington, Virginia, USA. Copyright 2006 ACM 1-59593-433-2/06/0011 ...$5.00.


AddressNew York, NY, USA
> Publications > Text classification improved through multigram models