Boosting the Feature Space: Text Categorization for Unstructured Data on the Web

The issue of seeking efficient and effective methods for classifying unstructured text in large document corpora has received much attention in recentyears. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve high classification accuracies. This paper addresses the problem of classifying unstructured documents on the Web. A classification approach isproposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting documentfeature spaces. The method produces feature spaces with an order of magnitude less features compared with a baseline bag-of-words featureselection method. Experiments on both real-world data and benchmark corpus indicate that our approach improves classification accuracyover the traditional methods for both Support Vector Machines and AdaBoost classifiers.

PDF file

In  the Sixth IEEE international Conference on Data Mining, (ICDM 2006)

Publisher  IEEE
© 2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.


> Publications > Boosting the Feature Space: Text Categorization for Unstructured Data on the Web