Strider Search Defender: Automatic and Systematic Discovery of Search Spammers through Non-Content Analysis

 

Microsoft Research Strider Team, in collaboration with UCD

 

Technical Report: MSR-TR-2006-97

Created on May 9, 2006; Posted on July 13, 2006; Last Updated: December 12, 2006

 

Updated on December 12, 2006: “Strider Search Defender” has been renamed to “Strider Search Ranger

 

Updated on August 5, 2006: "One company's typo is another country's natural resource" and "What do fish cams, brain injuries, and NASA's Regional Technology Transfer Centers have in common?"

 

Updated on July 19, 2006: www.stanford.edu/~jakef/ appears among top-3 results for Google “mastercraft furniture” search

 

 

Search engine spamming (or search spamming or web spamming [1]) refers to the practice of using questionable Search Engine Optimization (SEO) techniques to improve the ranking of a website in search engine listings. Comment spamming (or blog spamming) is a form of search spamming in which random comments, promoting links to commercial services, are automatically posted to publicly-accessible forums, guest books, blogs, message boards, etc. See sample screenshots of URLs hosted on auto-review.net, webspawner.com, ripway.com, etc. being spammed at many open forums. There are now several commercial programs that automate such spamming tasks.

To make their URLs look more legitimate so that search users are more likely to click the links, many spammers create doorway pages on reputable domains and use their URLs in comment spamming. When a user clicks on a doorway-page link in search listings, her browser is instructed to either redirect to or fetch ads listing from the actual target page, potentially operated by the spammer. See sample screenshots of doorway-page URLs hosted on mywebpage.netscape.com, tripod.com, geocities.com, angelfire.com, hometown.aol.com, groups.yahoo.com, etc. being spammed at many open forums. Also see an example of Google spammed by “freett.com” URLs.

Many search spammers set up doorway pages on blog websites such as blogspot.com, blogstudio.com, blogdrive.com, ebloggy.com, blog4ever.com, blogspirit.com, etc. Such doorway pages are a form of spam blogs (splogs). (See screenshots of sample splogs hosted on several blog websites.) Our preliminary investigation shows that splogs hosted on blogspot.com appear to be particularly widely spammed and effective against search engines: see

*      A picture illustrating how splog doorway pages work;

*      A long (partial) list of forums and guest books spammed by splog URLs hosted on blogspot.com;

*      Screenshots of Google search results spammed by blogspot.com splog URLs;

*      Screenshots of Yahoo! search results spammed by blogspot.com splog URLs;

*      Screenshots of MSN search results spammed by blogspot.com splog URLs.

A common approach to detecting spam web pages is through content analysis based on classification heuristics [2,3]. In this report, we propose an orthogonal context-based approach that uses URL-redirection analysis. Our work was primarily motivated by two key observations:

1)     Many spam pages use cloaking and redirection techniques [1,4] to serve up a different page to search-engine crawlers than will be seen by human users. A common technique is to present to the crawler some page content that will be dynamically rewritten by the browser before the page is displayed to the users. Some spammers even use obfuscated scripts to make it impossible for crawlers to figure out how the pages will be rewritten. (See examples and analysis of actual cloaking techniques used by major spammers.) Our approach is to treat each spam page as a dynamic program rather than a static page, and utilize a “monkey program” [6] to analyze the traffic resulting from visiting each page with an actual browser so that the program can be executed in full fidelity.

2)     Many successful, large-scale spammers have created a huge number of doorway pages that either redirect to or fetch ads from a single domain that is responsible for serving all target pages. By identifying those domains that serve target pages for a large number of doorway pages, we can catch major spammers’ domains together with all their doorway pages and doorway domains.

We call our approach the Search Defender approach. It consists of two steps:

1.      Starting with a seed list of confirmed spam URLs, the Spam Hunter supplies them as search terms (or “link:” query terms) to search engines to locate the forums and guest books at which they were spammed, gathers additional URLs from each of these pages to grow the list, and does this iteratively until the list “converges”, i.e., the list no longer grows significantly after a query iteration.

The list automatically generated from the above step is only a list of “potential” spam URLs because there can be false positives. For example, some spammed forum pages may contain earlier comments from actual users that include non-spam URLs; spammers may intentionally intersperse non-spam URLs with spam ones.

2.      To filter out false positives, we feed the list of potential spam URLs to the Strider URL Tracer (which we have previously released to help trademark owners find typo-squatting domains of their websites [5]). The tracer provides a key functionality called the Top Domain view: given a list of (primary) URLs, the tracer launches an actual browser to visit each URL and records all secondary URLs visited as a result. At the end of the batched scan, the Top Domain view provides the list of third-party domains that received secondary-URL traffic and rank them by the number of primary URLs that generated traffic to them. If the input is a list of potential spam URLs, the Top Domain view essentially highlights those target-page domains that are associated with a large number of doorway-page URLs. To further reduce false positives, we use the whitelist of legitimate ads syndicators and web-analytics servers that were heavy redirection-traffic receivers in our Strider HoneyMonkey scan of the top one million click-through URLs [6,7]. The ranked Top Domain list is then used to prioritize manual investigation. Once a third-party domain is determined to be a spammer’s domain, all doorway-page URLs associated with that domain are labeled as high-potential spam URLs.

Our Search Defender approach has two desirable properties that naturally turn the spammers’ spamming activities against themselves:

1.      The more widely spammed a URL is, the easier it is for the spam hunter to find it. Once a spammed forum is identified, it becomes a “HoneyForum” that can be used to capture new spam URLs in new comment postings. Ideally, since there is a delay between spamming and its effect on search engine results, our spam hunter should be able to identify new spam URLs and notify the search engine before the URLs enter top search results.

2.      The more doorway pages a spammer creates, the higher priority its target-page domain is placed on the Top Domain list for investigation.

 

Case Study #1: Analysis of Blogspot Spammers

Given over 17,000 blogspot URLs collected by the spam hunter, the URL Tracer identified these top-25 target-page domains that are behind a large number of blogspot splogs. The top six are particularly active: s-e-arch.com, speedsearcher.net, abcsearcher.com, eash.info, paysefeed.net, and veryfastsearch.com, which collectively were responsible for approximately 45% of the blogspot URLs. Screenshots of how the target pages look like and where their doorway URLs are spammed are shown here. In addition, we found that hundreds of these splogs generated traffic to googlesyndication.com (see an example). The “Fighting Splog” blog at http://fightsplog.blogspot.com provides a more comprehensive analysis of splogs that serve AdSense ads. 

 

Case Study #2: Analysis of Blog4ever Spammers

Given 5,505 blog4ever URLs collected by the spam hunter, the URL Tracer identified 5,363 of them that fetched Google AdSense ads from googlesyndication.com. All of them included the client ID “ca-pub-6785940031399100” in the ads-fetching URLs and are most likely owned by the same spammer. See full report here.

 

Case Study #3: Analysis of Blogstudio Spammers

Given over 2,400 blogstudio URLs collected by the spam hunter, the URL Tracer identified two redirection target domains that are behind all these splogs: casino-web-search.com and finance-web-search.com. See full report here.

 

Case Study #4: Analysis of Proboards Spammers

Given over 1,300 proboards URLs collected by the spam hunter, the URL Tracer identified these top-8 target-page domains. The #1 paysefeed.net and #7 s-e-arch.com are also #5 and #1 on the blogspot list, respectively.

 

Case Study #5: Analysis of the “Money Spammers”

Given the hundreds of money-related, non-splog URLs collected by the spam hunter that contain keywords like “credit”, “loan”, “mortgage”, “insurance”, “finance”, “cash”, etc., the URL Tracer identified five redirection target domains that are behind a large number of doorway domains: finance-4u.com, finance-portal-4u.com, bankersnationalfinancial.com, finance-portal-online.com, and 1placeloan.com. See full report here. 

 

Case Study #6: Analysis of the “.be Spammers”

Search Defender have found 3,854 doorway pages hosted on 109 .be doorway domains, all of which fetch ads from the target domain rills.be. See screenshots and the full list here.

 

Discussions

We are in the process of fully automating Strider Search Defender. The main purpose of releasing this preliminary study is to raise awareness of this growing problem by providing a systematic analysis and proposing a solution so that the web community can start working together to combat this problem. We urge owners of blog sites and free hosting sites to actively monitor their websites to detect abuse. Similarly, advertisement syndicators can detect potential spammers by monitoring those customers who serve ads on a huge number of different URLs through a single account because it is highly unlikely that anyone can generate quality content at that scale. Second, although the content on some spam pages may actually have decent relevance, we urge search engines to consider removing such pages so as not to encourage web spamming. Third, we urge owners of publicly accessible forums (and guest books, etc.) to do a local search of “blogspot.com” and other spam-related domain names reported on this page to see if their forums have been abused and should be protected. For example, searching for “blogspot.com” at http://www.stat.ucla.edu/forums/search.php?f=325, or searching for “funpic.org”, or “yoll.net”, or “freett.com”, or “fc2.com” at http://coolplayer.sourceforge.net/phorum/search.php?f=2 would generate a large number of hits.

Finally, in some cases, the owners of the target-page domains may not be directly involved in the spamming activities of the doorway pages that redirect to them; their “affiliates” may be the ones who are actually performing the spamming. We urge the owners of such target-page domains to have a stronger rule that prohibits their affiliates from using spamming techniques to draw traffic.

 

References

[1] Z. Gyongyi and H. Garcia-Molina,  “Web Spam Taxonomy,”  in the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.

[2] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, “Detecting Spam Web Pages through Content Analysis,” in Proc. International World Wide Web Conference (WWW), 2006.

[3] “SVMs for the Blogosphere: Blog Identification and Splog Detection,” AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006

[4] Baoning Wu and Brian D. Davison, “Cloaking and Redirection: A Preliminary Study,” in the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.

[5] Yi-Min Wang, Doug Beck, Jeffrey Wang, Chad Verbowski, and Brad Daniels, “Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting,” in Proc. 2nd Workshop on Steps to Reducing Unwanted Traffic on the Internet (SRUTI), July 2006 (also see project home page at http://research.microsoft.com/URLTracer).

[6] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski, Shuo Chen, and Sam King, “Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities,” in Proc. Network and Distributed System Security (NDSS) Symposium, February 2006 (also see project home page at http://research.microsoft.com/HoneyMonkey).

[7] Ben Edelman and Hannah Rosenbaum of SiteAdvisor, “The Safety of Internet Search Engines,” May 12, 2006.

 

Other Related Links and Papers

·         Fighting Splog: http://fightsplog.blogspot.com.

·         SplogSpot, http://splogspot.com/.

·         Fight Splog!, http://www.fightsplog.com/.

·         Spamhuntress, http://spamhuntress.com/.

·         Splog Reporter, http://www.splogreporter.com/.

·         “Spamdexing,” http://en.wikipedia.org/wiki/Spamdexing. 

·         WebLogs.com, http://weblogs.com/.

·         Ping-o-matic, http://pingomatic.com/.

·         Automattic Kismet (Akismet for short), http://akismet.com/.

·         Spam Karma anti-spam plugin for WordPress.      

·         Spam ping (Sping) and TrackBack.

·          “PR0 - Google's PageRank 0 Penalty,” http://en.pr10.info/pagerank0-badrank/.

·         Web Directories, Reverse Google Lookups, Link Farms, Splogs, and Scraper Sites, http://www.nowpublic.com/web_directories_reverse_google_lookups_link_farms_splogs_and_scraper_sites.

·         Welcome to the Splogosphere: 75% of new pings are spings (splogs), Ebiquity Group, UMBC.

·         Automated spam classifying algorithms keep spam blogs out of NextBlog.

·         Ryan Naraine, “Blog Spammers Take Aim at Google,” October 18, 2005.

·         L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Stanford University, 1998.

·         Z. Gyongyi, H. Garcia-Molina, and J. Pedersen, “Combating Web Spam with TrustRank”, in Proc. of the 30th VLDB Conference, 2004.

·         N. Eiron, K. S. McCurley, and J. A. Tomlin, “Ranking the Web Frontier,” in Proc. International World Wide Web Conference (WWW), New York, 2004

·         A. Benczur, K. Csalogany, T. Sarlos, and M. Uher, “SpamRank – Fully Automatic Link Spam Detection,” in the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.

·         B. Wu and B. D. Davison, “Identifying Link Farm Pages,” in Proc. International World Wide Web Conference (WWW), 2005

·         B. Wu and B. D. Davison, “Detecting Semantic Cloaking on the Web,” in Proc. International World Wide Web Conference (WWW), 2006.

 

How Splog Doorway Pages Work