Learning at Low False Positive Rates

  • Scott Wen-tau Yih ,
  • Joshua Goodman ,
  • Geoff Hulten

Proceedings of the 3rd Conference on Email and Anti-Spam |

Published by CEAS

Publication

Most spam filters are configured for use at a very low false-positive rate. Typically, the filters are trained with techniques that optimize accuracy or entropy, rather than performance in this configuration. We describe two different techniques for optimizing for the low false-positive region. One method weights good data more than spam. The other method uses a two-stage technique of first finding data in the low false-positive region, and then learning using this subset. We show that with two different learning algorithms, logistic regression and Naive Bayes, we achieve substantial improvements, reducing missed spam by as much as 20% relative for logistic regression and 40% for Naive Bayes at the same low false-positive rate.