Janez Brank, Marko Grobelnik, Nataša Milic-Frayling, and Dunja Mladenic
Text categorization is the task of classifying natural language documents into a set of predefined categories. Documents are typically represented by sparse vectors under the vector space model, where each word in the vocabulary is mapped to one coordinate axis and its occurrence in the document gives rise to one nonzero component in the vector representing that document. When training classifiers on large collections of documents, both the time and memory requirements connected with these vectors may be prohibitive. This calls for the use of a feature selection method not only to reduce the number of features but also to increase the sparsity of vectors. We propose a feature selection method based on linear Support Vector Machines (SVMs). Linear SVM is used on a subset of training data to train a linear classifier which is characterized by the normal to the hyper-plane dividing positive and negative instances. Components of the normal with higher absolute values have a larger impact on data classification. Instead of pre-defining the number of highest scoring features to be included in a classifier we apply feature selection that aims at a pre-defined average sparsity level across documents and classifiers for a given training set. After the feature set is determined, the model is trained on the full training data set represented within the selected feature set. We compare this feature selection approach to more traditional feature selection methods such as Mutual Information and Odds Ratio in terms of the sparsity of vectors and classification performance achieved. We also examine how the size of the training data subset affects the quality of feature selection and ultimately the classification performance. Preliminary results indicate that, at the same level of vector sparsity, feature selection based on SVM normals performs better than Odds Ratio- or Mutual Information-based feature selection. In the reported experiments we use linear SVM as the classification model.