A Framework for Characterizing Feature Weighting and Selection Methods in Text Classification

  • Janez Brank ,
  • Natasa Milic-Frayling

MSR-TR-2005-09 |

Optimizing performance of classification models often involves feature selection to eliminate noise from the feature set or reduce computational complexity by controlling the dimensionality of the feature space. A refinement of the feature set is typically performed in two steps: by scoring and ranking the features and then applying a selection criterion. Empirical studies that explore the effectiveness of feature selection methods are typically limited to identifying the number or percentage of features to be retained in order to maximize the classification performance. Since no further characterizations of the feature set are considered beyond its size, we currently have a limited understanding of the relationship between the classifier performance and the properties of the selected set of features. This paper presents a framework for characterizing feature weighting methods and selected features sets and exploring how these characteristics account for the performance of a given classifier. We illustrate the use of two feature set statistics: cumulative information gain of the ranked features and the sparsity of data representation that results from the selected feature set. We apply a novel approach of synthesizing ranked lists of features that satisfy given cumulative information gain and sparsity constraints. We show how the use of synthesized rankings enables us to investigate the degree to which the feature set properties explain the behaviour of a classifier, e.g., Naïve Bayes classifier, when used in conjunction with different feature weighting schemes.