Strider Search Defender: Automatic and Systematic Discovery of Search
Spammers through Non-Content Analysis
Microsoft Research Strider Team, in
collaboration with UCD
Technical Report:
MSR-TR-2006-97
Created on May 9,
2006; Posted on July 13, 2006; Last Updated: December 12, 2006
Updated on December 12, 2006: “Strider Search Defender” has been renamed
to “Strider Search Ranger”
Updated on August 5, 2006: "One company's typo
is another country's natural resource" and "What do fish cams, brain
injuries, and NASA's Regional Technology Transfer Centers have in common?"
Updated on July 19, 2006: www.stanford.edu/~jakef/
appears among top-3 results for Google
“mastercraft furniture” search
Search engine
spamming (or search
spamming or web spamming [1]) refers to the practice of using questionable Search Engine
Optimization (SEO) techniques to improve the ranking of a
website in search engine listings. Comment spamming
(or blog spamming) is a form of
search spamming in which random comments, promoting links to commercial
services, are automatically posted to publicly-accessible forums, guest books,
blogs, message boards, etc. See sample
screenshots of URLs hosted on auto-review.net, webspawner.com,
ripway.com, etc. being
spammed at many open forums. There are now several commercial programs that
automate such spamming tasks.
To make their URLs look more
legitimate so that search users are more likely to click the links, many
spammers create doorway pages on reputable domains and use their URLs
in comment spamming. When a user clicks on a doorway-page link in search
listings, her browser is instructed to either redirect to or fetch ads
listing from the actual target page, potentially operated by the
spammer. See sample screenshots of doorway-page
URLs hosted on mywebpage.netscape.com, tripod.com, geocities.com,
angelfire.com, hometown.aol.com, groups.yahoo.com, etc.
being spammed at many open forums. Also see an example of Google spammed by “freett.com” URLs.
Many search spammers set up doorway
pages on blog websites such as blogspot.com, blogstudio.com, blogdrive.com,
ebloggy.com, blog4ever.com, blogspirit.com, etc. Such
doorway pages are a form of spam blogs (splogs). (See screenshots
of sample splogs hosted on several blog websites.) Our preliminary
investigation shows that splogs hosted on blogspot.com appear to be
particularly widely spammed and effective against search engines: see
A
picture illustrating how splog doorway pages work;
A long (partial) list of forums and guest books spammed
by splog URLs hosted on blogspot.com;
Screenshots of Google search results spammed
by blogspot.com splog URLs;
Screenshots of Yahoo! search results spammed
by blogspot.com splog URLs;
Screenshots of MSN search results spammed by blogspot.com
splog URLs.
A common approach to detecting spam
web pages is through content analysis based on
classification heuristics [2,3]. In this report, we propose an orthogonal context-based approach that
uses URL-redirection analysis. Our work
was primarily motivated by two key observations:
1)
Many spam pages use cloaking and redirection
techniques [1,4] to serve up a different
page to search-engine crawlers than will be seen by human users. A common
technique is to present to the crawler some page content that will be
dynamically rewritten by the browser before the page is displayed to the users.
Some spammers even use obfuscated scripts to make it impossible for crawlers to
figure out how the pages will be rewritten. (See examples and analysis of actual cloaking techniques used by major spammers.) Our approach is to
treat each spam page as a dynamic program rather than a static page, and
utilize a “monkey program” [6] to analyze the traffic resulting from visiting
each page with an actual browser so that the program can be executed in full fidelity.
2)
Many successful, large-scale
spammers have created a huge number of doorway pages that either redirect to or
fetch ads from a single domain that is responsible for serving all target
pages. By
identifying those domains that serve target pages for a large number of doorway
pages, we can catch major spammers’ domains together with all their doorway
pages and doorway domains.
We call our approach the Search Defender
approach. It consists of two steps:
1.
Starting with a seed list of
confirmed spam URLs, the Spam Hunter supplies them as search terms
(or “link:” query terms) to search engines to locate the forums and guest books
at which they were spammed, gathers additional URLs from each of these pages to
grow the list, and does this iteratively until the list “converges”, i.e., the
list no longer grows significantly after a query iteration.
The list
automatically generated from the above step is only a list of “potential” spam
URLs because there can be false positives. For example, some spammed forum pages
may contain earlier comments from actual users that include non-spam URLs;
spammers may intentionally intersperse non-spam URLs with spam ones.
2.
To filter out false positives,
we feed the list of potential spam URLs to the Strider URL Tracer (which we have previously released to
help trademark owners find typo-squatting domains of their websites [5]). The
tracer provides a key functionality called the Top Domain view:
given a list of (primary) URLs, the tracer launches an actual browser to visit
each URL and records all secondary URLs visited as a result. At the end of the
batched scan, the Top Domain view provides the list of third-party domains that
received secondary-URL traffic and rank them by the number of primary URLs that
generated traffic to them. If the input is a list of potential spam URLs, the
Top Domain view essentially highlights those target-page domains that are
associated with a large number of doorway-page URLs. To further reduce false
positives, we use the whitelist of legitimate ads syndicators and web-analytics
servers that were heavy redirection-traffic receivers in our Strider HoneyMonkey scan
of the top one million click-through URLs [6,7]. The ranked Top Domain list is
then used to prioritize manual investigation. Once a third-party domain is
determined to be a spammer’s domain, all doorway-page URLs associated with that
domain are labeled as high-potential spam URLs.
Our Search Defender approach has two
desirable properties that naturally turn the spammers’ spamming activities
against themselves:
1.
The more widely spammed a URL
is, the easier it is for the spam hunter to find it. Once a spammed forum is identified, it becomes a
“HoneyForum” that can be used to capture new spam URLs in new comment postings.
Ideally, since there is a delay between spamming and its effect on search
engine results, our spam hunter should be able to identify new spam URLs and
notify the search engine before the URLs enter top search results.
2.
The more doorway pages a spammer
creates, the higher priority its target-page domain is placed on the Top Domain
list for investigation.
Case Study #1: Analysis of Blogspot Spammers
Given over 17,000 blogspot URLs collected by the spam hunter, the URL Tracer identified these top-25 target-page domains that are behind a large number of blogspot splogs. The top six are particularly active: s-e-arch.com, speedsearcher.net, abcsearcher.com, eash.info, paysefeed.net, and veryfastsearch.com, which collectively were responsible for approximately 45% of the blogspot URLs. Screenshots of how the target pages look like and where their doorway URLs are spammed are shown here. In addition, we found that hundreds of these splogs generated traffic to googlesyndication.com (see an example). The “Fighting Splog” blog at http://fightsplog.blogspot.com provides a more comprehensive analysis of splogs that serve AdSense ads.
Case Study #2: Analysis of Blog4ever Spammers
Given 5,505 blog4ever
URLs collected by the spam hunter, the URL Tracer identified 5,363 of them that
fetched Google AdSense ads from googlesyndication.com. All of them
included the client ID “ca-pub-6785940031399100”
in the ads-fetching URLs and are most likely owned by the same spammer. See
full report here.
Case Study #3: Analysis of Blogstudio Spammers
Given over 2,400 blogstudio URLs collected by the spam hunter, the URL Tracer identified two redirection target domains that are behind all these splogs: casino-web-search.com and finance-web-search.com. See full report here.
Case Study #4: Analysis of Proboards Spammers
Given over 1,300 proboards
URLs collected by the spam hunter, the URL Tracer identified these top-8 target-page domains. The #1 paysefeed.net and #7 s-e-arch.com are also #5 and #1
on the blogspot list, respectively.
Case Study #5: Analysis of the “Money Spammers”
Given the hundreds of money-related, non-splog URLs collected by the spam hunter that contain keywords like “credit”, “loan”, “mortgage”, “insurance”, “finance”, “cash”, etc., the URL Tracer identified five redirection target domains that are behind a large number of doorway domains: finance-4u.com, finance-portal-4u.com, bankersnationalfinancial.com, finance-portal-online.com, and 1placeloan.com. See full report here.
Case Study #6: Analysis of the “.be Spammers”
Search Defender have
found 3,854 doorway pages hosted on 109 .be doorway domains, all of which fetch
ads from the target domain rills.be.
See screenshots and the full list here.
Discussions
We are in the process of fully
automating Strider Search Defender. The main purpose of releasing this
preliminary study is to raise awareness of this growing problem by providing a
systematic analysis and proposing a solution so that the web community can
start working together to combat this problem. We urge owners of blog sites and
free hosting sites to actively monitor their websites to detect abuse.
Similarly, advertisement syndicators can detect potential spammers by
monitoring those customers who serve ads on a huge number of different URLs
through a single account because it is highly unlikely that anyone can generate
quality content at that scale. Second, although the content on some spam pages
may actually have decent relevance, we urge search engines to consider removing
such pages so as not to encourage web spamming. Third, we urge owners of
publicly accessible forums (and guest books, etc.) to do a local search of “blogspot.com” and other spam-related
domain names reported on this page to see if their forums have been abused and
should be protected. For example, searching for “blogspot.com” at http://www.stat.ucla.edu/forums/search.php?f=325,
or searching for “funpic.org”, or “yoll.net”,
or “freett.com”, or “fc2.com” at http://coolplayer.sourceforge.net/phorum/search.php?f=2
would generate a large number of hits.
Finally, in some cases, the owners of
the target-page domains may not be directly involved in the spamming activities
of the doorway pages that redirect to them; their “affiliates” may be the ones
who are actually performing the spamming. We urge the owners of such
target-page domains to have a stronger rule that prohibits their affiliates
from using spamming techniques to draw traffic.
References
[1] Z. Gyongyi and H.
Garcia-Molina, “Web Spam Taxonomy,” in the
First International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), 2005.
[2] A. Ntoulas, M. Najork, M.
Manasse, and D. Fetterly, “Detecting
Spam Web Pages through Content Analysis,” in Proc. International World Wide Web Conference (WWW), 2006.
[3] “SVMs for the Blogosphere:
Blog Identification and Splog Detection,” AAAI Spring Symposium on Computational Approaches to Analysing Weblogs,
March 2006
[4] Baoning Wu and Brian D. Davison, “Cloaking and
Redirection: A Preliminary Study,” in the
First International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), May 2005.
[5]
[6]
[7] Ben Edelman and
Hannah Rosenbaum of SiteAdvisor, “The Safety
of Internet Search Engines,” May 12, 2006.
Other Related Links and Papers
·
Fighting Splog: http://fightsplog.blogspot.com.
·
SplogSpot, http://splogspot.com/.
·
Fight Splog!, http://www.fightsplog.com/.
·
Spamhuntress, http://spamhuntress.com/.
·
Splog Reporter, http://www.splogreporter.com/.
·
“Spamdexing,” http://en.wikipedia.org/wiki/Spamdexing.
·
WebLogs.com, http://weblogs.com/.
·
Ping-o-matic, http://pingomatic.com/.
·
Automattic Kismet (Akismet for short), http://akismet.com/.
·
Spam
Karma anti-spam plugin for WordPress.
·
Spam ping
(Sping) and TrackBack.
·
“PR0 - Google's
PageRank 0 Penalty,” http://en.pr10.info/pagerank0-badrank/.
·
Web Directories, Reverse Google Lookups, Link Farms, Splogs,
and Scraper Sites, http://www.nowpublic.com/web_directories_reverse_google_lookups_link_farms_splogs_and_scraper_sites.
·
Welcome
to the Splogosphere: 75% of new pings are spings (splogs), Ebiquity Group,
UMBC.
·
Automated
spam classifying algorithms keep spam blogs out of NextBlog.
·
Ryan Naraine, “Blog Spammers Take
Aim at Google,” October 18, 2005.
·
L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank
Citation Ranking: Bringing Order to the Web,” Technical Report,
·
Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, “Combating
Web Spam with TrustRank”, in Proc. of the
30th VLDB Conference, 2004.
·
N. Eiron, K. S. McCurley, and J. A. Tomlin, “Ranking the Web
Frontier,” in Proc. International World
Wide Web Conference (WWW),
·
A. Benczur, K. Csalogany, T. Sarlos, and M. Uher, “SpamRank
– Fully Automatic Link Spam Detection,” in the
First International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), May 2005.
·
B. Wu and B. D. Davison, “Identifying Link Farm Pages,” in Proc. International World Wide Web
Conference (WWW), 2005
·
B. Wu and B. D. Davison, “Detecting Semantic Cloaking on the
Web,” in Proc. International World Wide
Web Conference (WWW), 2006.
