Geoff Hulten and Joshua Goodman
The tutorial
will cover junk email (spam) filtering, with an emphasis on machine learning
techniques. We will start by describing the problem, including tricks
spammers use. Next, we will describe solutions to spam. We will
spend most of the time on machine learning approaches and problems, including
text categorization, probabilistic methods, cost sensitive learning, and user
adaptation. We will also describe non-learning approaches such as
challenge-response systems and policy-based solutions. Finally, we will
describe the special challenges that spam presents, and the long-term prospects
for solving spam through a combination of approaches.
This talk is for
anyone interested in learning about spam. We will expect a basic
knowledge of machine learning, typical of an ICML attendee. We will not
assume any prior knowledge about spam.
·
Introduction
to Spam
o
Define
the email spam problem and its scope
o
‘Isn’t
this just text categorization?’ Isn’t it solved already?
§
No!
·
Techniques
Spammers Use
o
Obscuring
Spam
o
Getting
addresses to spam
o
Delivering
spam
§
Open
proxies/relays, virus/Trojan
·
Kinds
of spam
o
Email,
chat rooms, instant messenger, web popups, search engines, and more!
·
Solutions
to Spam
o
Filtering
§
Machine
learning, matching, blackhole lists
o
‘Postage’
systems
§
Turing
tests, money, computation
o
SmartProof
Approach (combination of learning with postage)
o
Stopping
Outbound Spam
o
Disposable
Email Addresses
o
Deployment
Issues
·
Machine
Learning solutions
o
Evaluating
spam filters
o
Building
models of spam
§
Bayesian
Techniques
§
Support
Vector Machines
§
Rules/Trees
o
Miscellaneous
(e.g. Beating machine learning solutions)
o
Personalization
·
Analyzing
all that data
o
Getting
millions of hand labeled messages for free!
o
Analysis
of the data
§
Where
is spam from?
§
Can
legal approaches work?
·
Conclusion
o
Email
and Spam as a new field of study
§
Email
and spam are not just machine learning
§
Cross
disciplinary solutions to similar problems
o
Spam
is a huge problem
§
Will
live beyond Email
o
The
Spam end-game
§
Will
require a variety of solutions
o
Email
and Spam present tons of new problems!
Joshua Goodman’s first
job was at Dragon System on Speech Recognition.
(Some of his code survives to this day.)
He then went to
Geoff Hulten is a
researcher in Microsoft’s Anti-Spam Technology and Strategy group. His research focuses on learning from massive
data streams, and modeling time-changing concepts. He presented at the 2004 MIT Spam Conference.