|
|
MSRBot
We are using the MSRBot web crawler to collect data from the web for further study.
You can contact us at msrbot@microsoft.com
- What is a web crawler?
A web crawler, also known as a
spider, wanderer, bot, or robot, is a
program that downloads a seed page
from the World Wide Web, extracts the
links contained in that page, then
downloads the pages those links refer
to, extracts the links in those pages,
and so on.
- Why is MSRBot trying to download
incorrect links from my server? Or
from a server that doesn't exist?
Because MSRBot obtains the list of
links to crawl by extracting them from
documents on the web, there must be an
incorrect link available on the web.
To determine the location of this
links, look at the referral field in
your web server log.
- Why is MSRBot trying to access a
file called robots.txt?
robots.txt is a standard document
that can tell MSRBot not to download
some or all information from your web
server. Please see the answer to the
next question for more details on the
robots.txt file.
- How do I keep stop MSRBot from
crawling my site?
The format of the robots.txt file
is specified in the Robot
Exclusion Standard. When deciding
which pages to crawl on a particular
host, MSRbot will obey the first
record in the robots.txt file with a
User-Agent starting with "msrbot". If
no such entry exists, it will obey the
first entry with a User-Agent of
"*".
- How can I prevent crawling if I
can't create a robots.txt file?
There is another standard for
telling robots not to index a web page
or follow links on it, which may be
more helpful in some cases, since it
can be used more conveniently on a
page-by-page basis. It involves
placing a "META" element into a page
of HTML, and is described here;
you can also read what the
HTML standard has to say about these
tags.
- I have edited my robots.txt
file. Why are you still crawling my
pages?
MSRBot downloads the robots.txt
file once a day. Therefore it may
take a while for MSRBot to learn about
changes that you may have made to the
robots.txt file on your server. You
may also want to ensure that the
syntax of your robots.txt file is
correct. Please check the standard at
http://www.robotstxt.org/wc/exclusion.html#robotstxt.
Also ensure that the robots.txt file
is in top directory of your
server. Placing the file in any
subdirectory will not have any effect
on how MSRBot crawls your
site.
- How can I prevent MSRBot from
following links from a particular
page?MSRBot obeys the noindex and
nofollow meta-tags. If you place these
tags in the head of your HTML
document, you can cause MSRBot to not
index or not follow specific documents
on your site. The tags to include and
their effects are:
- <META NAME="robots" CONTENT="noindex">
- MSRBot will retrieve
the document, but it will not index
the document.
- <META NAME="robots" CONTENT="nofollow">
- MSRBot will not follow
any links that are present on the page
to other documents.
The "robots" tag is obeyed by many
different web robots. If you'd like
to specify some of these restrictions
only for MSRBot, you may use "msrbot"
in place of "robots". You can also
combine these tags into a single meta
tag. For example:
<META NAME="robots" CONTENT="noindex,nofollow">
- How often will MSRBot access a
page from my web server?
In general MSRBot should not try to
access your site more than once every
few seconds. MSRBot will also account
for the time it takes to download a
page from a site so that if your site
has a slower connection we will not
access it as frequently. If you find
that we are placing too high a load on
your site please let us know by
sending us e-mail at
msrbot@microsoft.com.
- How can I report problems to
you?
Please email us at msrbot@microsoft.com. Please
include:
- The URLs of the pages MSRBot has erroneously
fetched.
- If possible, the relevant lines of your web server
log.
- The method of robots exclusion you are using.
- A
valid return address, so we can get back to you.
|