Dou Shen, Jian-Tao Sun, Qiang Yang, Hui Zhao, and Zheng Chen
We propose to use the n-multigram model to help the automatic text classification task. This model could automatically discover the latent semantic sequences contained in the document set of each category. Based on the n-multigram model and the n-gram language model, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram model can achieve the similar classification performance compared with the one based on n-gram model. However, the model size of our algorithm is only 4.21% of the latter one. Another proposed algorithm based on the combination of nmultigram model and n-gram model improves the micro- F1 and macro-F1 values by 3.5% and 4.5% respectively which support the validity of our approach.
|Published in||ICDE '06: Proceedings of the 22nd International Conference on Data Engineering|
|Address||Washington, DC, USA|
|Publisher||IEEE Computer Society|
Copyright © 2007 IEEE. Reprinted from IEEE Computer Society. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to firstname.lastname@example.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.