Tutorial on Junk E-mail Filtering

Geoff Hulten and Joshua Goodman

Slides in PDF

Slides in PPT

Overview

The tutorial will cover junk email (spam) filtering, with an emphasis on machine learning techniques.  We will start by describing the problem, including tricks spammers use.  Next, we will describe solutions to spam.  We will spend most of the time on machine learning approaches and problems, including text categorization, probabilistic methods, cost sensitive learning, and user adaptation.  We will also describe non-learning approaches such as challenge-response systems and policy-based solutions.  Finally, we will describe the special challenges that spam presents, and the long-term prospects for solving spam through a combination of approaches.

Intended Audience

This talk is for anyone interested in learning about spam.  We will expect a basic knowledge of machine learning, typical of an ICML attendee.  We will not assume any prior knowledge about spam.   

Outline

·        Introduction to Spam

o       Define the email spam problem and its scope

o       ‘Isn’t this just text categorization?’  Isn’t it solved already?

§         No!

·        Techniques Spammers Use

o       Obscuring Spam

o       Getting addresses to spam

o       Delivering spam

§         Open proxies/relays, virus/Trojan

·        Kinds of spam

o       Email, chat rooms, instant messenger, web popups, search engines, and more!

·        Solutions to Spam

o       Filtering

§         Machine learning, matching, blackhole lists

o       ‘Postage’ systems

§         Turing tests, money, computation

o       SmartProof Approach (combination of learning with postage)

o       Stopping Outbound Spam

o       Disposable Email Addresses

o       Deployment Issues

·        Machine Learning solutions

o       Evaluating spam filters

o       Building models of spam

§         Bayesian Techniques

§         Support Vector Machines

§         Rules/Trees

o       Miscellaneous (e.g. Beating machine learning solutions)

o       Personalization

·        Analyzing all that data

o       Getting millions of hand labeled messages for free!

o       Analysis of the data

§         Where is spam from?

§         Can legal approaches work?

·        Conclusion

o       Email and Spam as a new field of study

§         Email and spam are not just machine learning

§         Cross disciplinary solutions to similar problems

o       Spam is a huge problem

§         Will live beyond Email

o       The Spam end-game

§         Will require a variety of solutions

o       Email and Spam present tons of new problems!

Presenters

Joshua Goodman’s first job was at Dragon System on Speech Recognition.  (Some of his code survives to this day.)  He then went to Harvard University, where he received a Ph.D. for research in statistical natural language processing.   In 1998, he joined Microsoft Research, focusing on language modeling.  In 2002, he began work on spam, helping start Microsoft’s Anti-Spam Technology and Strategy product group.  He presented at the 2003 MIT Spam Conference and was an invited speaker about spam at the 2003 KDD Workshop on Operational Text Classification.  He is a Program Co-Chair for the Conference on Email and Anti-Spam (http://www.ceas.cc)  He is officially a member of the Machine Learning and Applied Statistics Group at Microsoft Research, but has been “on loan” to Microsoft’s anti-spam product team since its inception.

 

Geoff Hulten is a researcher in Microsoft’s Anti-Spam Technology and Strategy group.  His research focuses on learning from massive data streams, and modeling time-changing concepts.  He presented at the 2004 MIT Spam Conference.

 

Slides in PDF

Slides in PPT