Using robots.txt To Control Search Engine Spiders
What are robots and spiders?
Search engines such as Google and Yahoo! use what is called 'robots' or 'spiders'
to visit pages on the internet and then automatically add them to their search
database. Many people even add their sites manually rather than wait for a robot
or spider to visit their web site. When you put a web page on your web server,
it can take some time for your site to show up in their search engine. Once the
page is entered into their database, however, it can take a long time also for
the page to be removed should the page move or be taken off the server. For more
information on how the major search engine spiders work, please see the pages
below:
However, there may be times where you have information that you do not want to
share with everyone or have the search engines put in their database. You may
even have a whole directory you wish to keep secret.
One way to keep a search engine from adding your pages to their database is to
put a file called robots.txt in the directory where the pages you wish
to protect exist. While this is not a fool-proof way to protect your pages, it
may help keep them from showing up in most search engine databases, at the very
least.
How do I create a robots.txt file?
You can create a robots.txt file from any Linux text editor or any text
editor that saves to Unix format. This is important as the file must have Unix
style line breaks. Please see
Text Editors You Can Use To Create CGI Scripts
for more information. Note that the robots.txt file must be in the root
directory (and not in a sub directory) of your CGI or web server. You can have
one in each if you want. Put a robots.txt file in the root directory of your
CGI server to control spidering of files on only that server. Put a robots.txt
file in the root directory of your web server to control spidering of files on
your web server only. Your robots.txt file will only affect your own server(s) and
not anyone else.
The robots.txt file usually needs only two fields: User-agent and
Disallow. Here are a couple examples you can put in your robots.txt
file. You can add more than one User-agent or Disallow field to
your robots.txt file.
Allow all robots:
User-agent: *
Disallow:
This will allow all robots to visit all pages in the directory. Note nothing was
entered for Disallow even though it was included in the robots.txt file.
Specify rules for a certain search engine:
User-agent: googlebot
Disallow:
This specifies Disallow rules to be followed only by google robots that may
visit your site. Note nothing was entered for Disallow even though it was
included in the robots.txt file. This means all files can be added to
google's search database.
Keep all robots out of directory:
User-agent: *
Disallow: /
This keeps all robots from adding any of the pages in the directory the
robot.txt file is placed. Note the slash / in the Disallow field means
all files.
Ban a certain search engine from all directories:
User-agent: googlebot
Disallow: /
This would keep google from adding any pages in that directory to it's search
engine database.
Protecting only certain files:
User-Agent: *
Disallow: /images/
Disallow: email.html
This keeps all robots from adding all files in the images directory and
the email.html file from being added to their search database. Note that
using Disallow: /images/ will cover the subdirectories as well, so
there would be no need to add another for each subdirectory in the images
directory. Spiders will not go into the images directory at all nor visit
any of the directories or files inside it.
We recommend you take a look at
an example robots.txt file from PimpSoft.
You may want to copy this file and edjust it to your site's needs. This file helps
to keep certain harmful robots (spiders) off your site and control how these
robots spider your site. In this way, your pages can be indexed most efficiently.
Once you have constructed and saved your robots.txt file, upload it to your web
server directory which you wish to protect using your FTP program.
Checking robot.txt Validity
Once you've uploaded the robot.txt file, it's usually a good idea to check the
validity of the file and be sure there are no problems. You can do this using the
one of the following robots.txt validators. Please be sure your robots.txt file
is uploaded to your web site and provide the proper URL to the file, such as
http://yourdomain.com/robots.txt.
Specifying Robot Rules in HTML Meta Tags
Alternatively (or even additionally) you can specify the rules in your HTML file
itself, within a meta tag. This tag appears in the head tag. Here
is an example:
<head>
<meta name="robots" content="noindex,nofollow">
<title>My Page</title>
</head>
In the content= area within the quotes you have a few choices you can
add. The first word before the comma you can use either index meaning
the robot will add the page to the search engine database, and noindex
meaning the robot will not add the page to the search engine database.
The second word after the comma in the content= area you have two choices. You
can use follow to mean it will also visit all other links you have on
that page and catalog them (providing there is no robot meta tag preventing it,
in which case it will skip over those), or nofollow meaning the robot will
act on only that page and not follow links you have on that page.
Which do I use, robot.txt or in my meta tag?
The robots.txt method is best if you want to keep robots from indexing a whole
directory or even protect certain files. It also lets you change things from one
file rather than from each .html file you have. This is good for keeping your
pages from being added to search engines.
The robot meta tag is best if you want search engines to add your pages
to the search engines.
Do remember though that spiders will only find content on your pages and pages
that are linked to. If any of your pages aren't linked to, then spiders may not
find and index those pages.
|