Nataša Milic-Frayling, Dunja Mladenic, Janez Brank, and Marko Grobelnik
In this paper we revisit the practice of using feature selection for dimensionality and noise reduction. Commonly we score features according to some weighting scheme and then specify that the top N ranked features or top N percents of scored features are to be used for further processing. In text classification, such a selection criteria lead to significantly different sizes of (unique) feature sets across various weighting schemes, if a particular level of performance is to be achieved, for a given learning method. On the other hand the number and the type of features determine the sparsity characteristics of the training and test documents, i.e., the average number of features per document vector. We show that specifying sparsity level, instead of pre-defined number of features per category as the selection criteria, produces comparable average performance over the set of categories. At the same time it has an obvious advantage of providing the means for control of the consumption of computing memory resources. Furthermore, we show that observing sparsity characteristics of selected feature sets, in form of sparsity curves, can be useful in understanding the nature of the feature weighting scheme itself. In particular, we begin to understand the level at which feature specificity, or commonly called ‘rarity’ is incorporated into the term weighting scheme and accounted for in the learning algorithm.