Janez Brank, Marko Grobelnik, Nataša Milic-Frayling, and Dunja Mladenic
Text categorization is the problem of automatically assigning text documents into one or more categories. Typically, an amount of labelled data, positive and negative examples for a category, is available for training automatic classifiers. We are particularly concerned with text classification when the training data is highly imbalanced, i.e., the number of positive examples is very small. We show that the linear support vector machine (SVM) learning algorithm is adversely affected by imbalance in the training data. While the resulting hyper plane has a reasonable orientation, the proposed score threshold (parameter b) is too conservative. In our experiments we demonstrate that the SVM-specific cost-learning approach is not effective in dealing with imbalanced classes. We obtained better results with methods that directly modify the score threshold. We propose a method based on the conditional class distributions for SVM scores that works well when very few training examples is available to the learner.