Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages

Dennis Fetterly, Mark Manasse, and Marc Najork

Abstract

The increasing importance of search engines to commer-cial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index. We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, in-cluding linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web

Details

Publication typeInproceedings
Published in7th International Workshop on the Web and Databases (WebDB)
PublisherAssociation for Computing Machinery, Inc.
> Publications > Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages