Learning at Low False Positive Rates

Most spam filters are configured for use at a very low false-positive rate. Typically, the filters are trained with techniques that optimize accuracy or entropy, rather than performance in this configuration. We describe two different techniques for optimizing for the low false-positive region. One method weights good data more than spam. The other method uses a two-stage technique of first finding data in the low false-positive region, and then learning using this subset. We show that with two different learning algorithms, logistic regression and Naive Bayes, we achieve substantial improvements, reducing missed spam by as much as 20% relative for logistic regression and 40% for Naive Bayes at the same low false-positive rate.

YihGoodmanHulten-ceas06.pdf
PDF file

In  Proceedings of the 3rd Conference on Email and Anti-Spam

Publisher  CEAS
Copyright (c) 2006

Details

TypeInproceedings
> Publications > Learning at Low False Positive Rates