Shen Huang, Jian-Tao Sun, Xuanhui Wang, Hua-Jun Zeng, and Zheng Chen
Experts from different domains try to mine users' comments on weblogs for different reasons such as politics or commerce. All these needs necessitate automatically distinguishing subjective weblog contents from objective ones, namely subjectivity categorization. Since weblogs contain various topics from different domains, limited training data can hardly cover all the topics and "unseen words" becomes a serious problem for categorization tasks. In this paper, Part-Of-Speech (POS) based smoothing is proposed to alleviate the "unseen words" problem. In conjunction with a naive Bayes model constructed from limited training data, the probability of an unseen word in a new domain can be well smoothed by the probability of its POS result. Empirical studies on five datasets show that our approach consistently outperforms the basic na?ve Bayes with Laplace smoothing. In a cross-domain experiment, our approach achieves 22.0% improvement in Macro F1 and 24.4% in Micro F1 over basic naive Bayes. These verify that POS based smoothing can indeed benefit subjectivity categorization, especially in the cases with a large number of unseen words.
|Published in||ICDM '06: Proceedings of the Sixth International Conference on Data Mining|
|Address||Washington, DC, USA|
|Publisher||IEEE Computer Society|
Copyright © 2007 IEEE. Reprinted from IEEE Computer Society. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to firstname.lastname@example.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.