Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
MSRBot

A web crawler, also known as a spider, wanderer, bot, or robot, is a program that downloads a seed page from the World Wide Web, extracts the links contained in that page, then downloads the pages those links refer to, extracts the links in those pages, and so on.

Overview

We are using the MSRBot web crawler to collect data from the web for further study.

Contact Information

You can contact us at msrbot@microsoft.com

Frequently Asked Questions

Answers

  • What is a web crawler? A web crawler, also known as a spider, wanderer, bot, or robot, is a program that downloads a seed page from the World Wide Web, extracts the links contained in that page, then downloads the pages those links refer to, extracts the links in those pages, and so on.
  • Why is MSRBot trying to download incorrect links from my server? Or from a server that doesn't exist? Because MSRBot obtains the list of links to crawl by extracting them from documents on the web, there must be an incorrect link available on the web. To determine the location of this links, look at the referral field in your web server log.
  • Why is MSRBot trying to access a file called robots.txt? robots.txt is a standard document that can tell MSRBot not to download some or all information from your web server. Please see the answer to the next question for more details on the robots.txt file.
  • How do I keep stop MSRBot from crawling my site? The format of the robots.txt file is specified in the Robot Exclusion Standard. When deciding which pages to crawl on a particular host, MSRbot will obey the first record in the robots.txt file with a User-Agent starting with "msrbot". If no such entry exists, it will obey the first entry with a User-Agent of "*".
  • How can I prevent crawling if I can't create a robots.txt file? There is another standard for telling robots not to index a web page or follow links on it, which may be more helpful in some cases, since it can be used more conveniently on a page-by-page basis. It involves placing a "META" element into a page of HTML, and is described here; you can also read what the HTML standard has to say about these tags.
  • I have edited my robots.txt file. Why are you still crawling my pages? MSRBot downloads the robots.txt file once a day. Therefore it may take a while for MSRBot to learn about changes that you may have made to the robots.txt file on your server. You may also want to ensure that the syntax of your robots.txt file is correct. Please check the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. Also ensure that the robots.txt file is in top directory of your server. Placing the file in any subdirectory will not have any effect on how MSRBot crawls your site.
  • How can I prevent MSRBot from following links from a particular page?MSRBot obeys the noindex and nofollow meta-tags. If you place these tags in the head of your HTML document, you can cause MSRBot to not index or not follow specific documents on your site. The tags to include and their effects are: MSRBot will retrieve the document, but it will not index the document. MSRBot will not follow any links that are present on the page to other documents.
    The "robots" tag is obeyed by many different web robots. If you'd like to specify some of these restrictions only for MSRBot, you may use "msrbot" in place of "robots". You can also combine these tags into a single meta tag. For example:
  • How often will MSRBot access a page from my web server? In general MSRBot should not try to access your site more than once every few seconds. MSRBot will also account for the time it takes to download a page from a site so that if your site has a slower connection we will not access it as frequently. If you find that we are placing too high a load on your site please let us know by sending us e-mail at msrbot@microsoft.com.
  • How can I report problems to you?

    Please email us at msrbot@microsoft.com. Please include:
    • The URLs of the pages MSRBot has erroneously fetched.
    • If possible, the relevant lines of your web server log.
    • The method of robots exclusion you are using.
    • A valid return address, so we can get back to you.