Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Training text classifiers with SVM on very few positive examples

Janez Brank, Marko Grobelnik, Nataša Milic-Frayling, and Dunja Mladenic

Abstract

Text categorization is the problem of automatically assigning text documents into one or more categories. Typically, an amount of labelled data, positive and negative examples for a category, is available for training automatic classifiers. We are particularly concerned with text classification when the training data is highly imbalanced, i.e., the number of positive examples is very small. We show that the linear support vector machine (SVM) learning algorithm is adversely affected by imbalance in the training data. While the resulting hyper plane has a reasonable orientation, the proposed score threshold (parameter b) is too conservative. In our experiments we demonstrate that the SVM-specific cost-learning approach is not effective in dealing with imbalanced classes. We obtained better results with methods that directly modify the score threshold. We propose a method based on the conditional class distributions for SVM scores that works well when very few training examples is available to the learner.

Details

Publication typeTechReport
NumberMSR-TR-2003-34
Pages27
InstitutionMicrosoft Research
> Publications > Training text classifiers with SVM on very few positive examples