Jie Tang, Hang Li, Yunbo Cao, and Zhaohui Tang
Addressed in this paper is the issue of 'email data cleaning' for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean up email data before conducting mining. Although several products offer email cleaning features, the types of noises that can be processed are limited. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. Thus, it is made independent of any specific mining process. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. To the best of our knowledge, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVMs) have been proposed in this paper. Features used in the models have also been defined. Experimental results indicate that the proposed SVM based methods for email cleaning can significantly outperform the baseline methods. The proposed method has also been applied to term extraction, a typical text mining task. Experimental results show that the accuracy of term extraction can be significantly improved after applying the email data cleaning method proposed in this paper.