Using Compression Models to Filter Spam and Exploiting Structural Information for Document Categorization

In the first part of this talk, I will present a spam filtering method based on statistical data compression models. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. The models are fast to construct and incrementally updateable. I will present experimental results indicating that this method performs well in comparison to established spam filters, and that the method is extremely robust to noise, which should make it difficult for spammers to defeat. I will also give some examples, which show that the method is capable of picking up interesting, non-trivial patterns that are indicative of spam/ham.

The second part of this talk describes how to exploit structural information for document categorization. Classifier stacking can be used to exploit the structure of semi-structured documents for improved text categorization performance. In this approach, a meta-classifier is used to combine predictions based on different structural elements. It will be shown that this approach consistently outperforms a flat-text linear SVM on a number of standard text categorization datasets, often by a wide margin. I will present selected nomograms that visualize the resulting meta-classifier and give interesting insight into the characteristics of the datasets.

Speaker Details

Andrej Bratko received a B.S. in Computer Science from the University of Ljubljana in 2003. During his undergraduate studies, he worked on storage management software, as a developer at Hermes SoftLab and StorScape Inc. In 2003, he left Hermes to cofound Klika Ltd, a software company specializing in mobile location-based services, which continues to be his primary occupation. He spends part of his time working as a researcher at the Department of Intelligent Systems, Jozef Stefan Institute, and is currently a Ph.D. candidate at the University of Ljubljana. His research interests include natural language processing and text mining, particularly text categorization.

Date:
Speakers:
Andrej Bratko
Affiliation:
University of Ljubljana
    • Portrait of Jeff Running

      Jeff Running