Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Working with the robots.txt file


  • Please log in to reply
12 replies to this topic

#1 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 17 April 2004 - 03:12 AM

Working with the robots.txt File
By Jagdeep. S. Pannu


What is the robots.txt file?
Working with the robots.txt file;
Advantages of robots.txt;
Disadvantages of the robots.txt file;
Optimization of the robots.txt file;
Using the robots.txt file;
Related reading.



What is the robots.txt file?

The robots.txt file is an ASCII text file that has specific instructions
for search engine robots about specific content that they are not allowed
to index. These instructions are the deciding factor of how a search
engine indexes your website’s pages. The universal address of the
robots.txt file is: www.example.com/robots.txt This is the first file
that a robot visits. It picks up instructions for indexing the site
content and follows them. This file contains two text fields. Lets study
this robots.txt example :

User-agent: *
Disallow:

The User-agent field is for specifying robot name for which the access
policy follows in the Disallow field. Disallow field specifies URLs which
the specified robots have no access to. An example :

User-agent: *
Disallow: /

Here “*” means all robots and “/ ” means all URLs. This is read as, “ No
access for any search engine to any URL” Since all URLs are preceded by “/
” so it bans access to all URLs when nothing follows after “/ ”. If
partial access has to be given, only the banned URL is specified in the
Disallow field. Lets consider this example :

# Research access for Googlebot.
User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /concepts/new/

Here we see that both the fields have been repeated. Multiple commands
can be given for different user agents in different lines. The above
commands mean that all user agents are banned access to /concepts/new/
except Googlebot which has full access. Characters following # are ignored
up to the line termination as they are considered to be comments.


Working with the robots.txt file

1. The robots.txt file is always named in all lowercase (e.g. Robots.txt
or robots.Txt is incorrect)

2. Wildcards are not supported in both the fields. Only * can be used in
the User-agent fields’ command syntax because it is a special character
denoting “all”. Googlebot is the only robot that now supports some
wildcard file extensions.
Ref: http://www.google.co...ers/faq.html#12

3. The robots.txt file is an exclusion file meant for search engine robot
reference and not obligatory for a website to function. An empty or absent
file simply means that all robots are welcome to index any part of the
website.

4. Only one file can be maintained per domain.

5. Website owners who do not have administrative rights cannot sometimes
make a robots.txt file. In such situations, the Robots Meta Tag
can be configured which
will solve the same purpose. Here we must keep in mind that lately,
questions have been raised about robot behavior regarding the Robot Meta
Tag. Some robots might skip it altogether. Protocol makes it obligatory
for all robots to start with the robots.txt thereby making it the default
starting point for all robots.

6. Separate lines are required for specifying access to different user
agents and Disallow field should not carry more than one command in a line
in the robots.txt file. There is no limit to the number of lines though
i.e. both the User-agent and Disallow fields can be repeated with
different commands any number of times. Blank lines will also not work
within a single record set of both the commands.

7. Use lower-case for all robots.txt file content. Please also note that
filenames on Unix systems are case sensitive. Be careful about case
sensitivity when defining directory or files for Unix hosted domains.

You can use The robots.txt Validator to check your robots.txt from
Searchengineworld

Advantages of the robots.txt file

Protocol demands that all search engine robots start with the robots.txt
file. This is the default entry point for robots if the file is present.
Specific instructions can be placed on this file to help index your site
on the web. Major search engines will never violate the Standard for
Robots Exclusion.

1. The robots.txt file can be used to keep out unwanted robots like email
retrievers, image strippers etc.

2. The robots.txt file can be used to specify the directories on your
server that you don’t want robots to access and/or index e.g. temporary,
cgi, and private/back-end directories.

3. An absent robots.txt file could generate a 404 error and redirect the
robot to your default 404 error page. Here it was noticed after careful
research that sites that do not have a robots.txt file present and had a
customized 404-error page, would serve the same to the robots. The robot
is bound to treat it as the robots.txt file, which can confuse it’s
indexing.

4. The robots.txt file is used to direct select robots to relevant pages
to be indexed. This specially comes in handy where the site has
multilingual content or where the robot is searching for only specific
content.

5. The need for the robots.txt file was also felt to stop robots from
deluging servers with rapid-fire requests or re-indexing the same files
repeatedly. If you have duplicate content on your site for any reason, the
same can be controlled from getting indexed. This will help you avoid any
duplicate content penalties.


Disadvantages of the robots.txt file

Careless handling of directory and filenames can lead hackers to snoop
around your site by studying the robots.txt file, as you sometimes may
also list filenames and directories that have classified content. This is
not a serious issue as deploying some effective security checks to the
content in question can take care of it. For example if you have your
traffic log on your site on a URL such as www.example.com/stats/index.htm
which you do not want robots to index, then you would have to add a
command to your robots.txt file. As an example:

User-agent: *
Disallow: /stats/

However, it is easy for a snooper to guess what you are trying to hide
and simply typing the URL www.example.com/stats in his browser would
enable access to the same. This calls for one of the following remedies -

1. Change file names:

Change the stats filename from index.htm to something different, such
as stats-new.htm so that your stats URL now becomes
www.example.com/stats/stats-new.htm

Place a simple text file containing the text, “Sorry you are not
authorized to view this page”, and save it as index.htm in your
/stats/directory.

This way the snooper cannot guess your actual filename and get to your
banned content.

2. Use login passwords:

Password-protect the sensitive content listed in your robots.txt
file.


Optimization of the robots.txt file

The right commands : Use correct commands. Most common errors
include - putting the command meant for “User-agent” field in the
“Disallow field” and vice-versa.
Please also note that there is no “Allow” command in the standard
robots.txt protocol. Content not blocked in the “Disallow” field is
considered allowed. Currently, only two fields are recognized: “The User-
agent field” and the “Disallow field”. Experts are considering the
addition of more robot recognizable commands to make the robots.txt file
more Webmaster and robot friendly.

Note - Google is the only search engine, which is experimenting
with certain new robots.txt commands. There are indications that Google
now recognizes the "Allow" command. Please refer:
http://www.google.co...ers/faq.html#12

Bad Syntax: Do not put multiple file URLs in one Disallow line in
the robots.txt file. Use a new Disallow line for every directory that you
want to block access to. Incorrect example :

User-agent: *
Disallow: /concepts/ /links/ /images/

Correct example:

User-agent: *
Disallow: /concepts/
Disallow: /links/
Disallow: /images/

Files and directories: If a specific file has to be disallowed,
end it with the file extension and without a forward slash in the end.
Study the following example :

For file:

User-agent: *
Disallow: /hilltop.html

For Directory:

User-agent: *
Disallow: /concepts/

Remember if you have to block access to all files in the directory, you
don’t have to specify each and every file in robots.txt . You can simply
block the directory as shown above. Another common error is leaving out
the slashes altogether. This would leave a very different message than
intended.

The right location : No robot will access a badly placed
robots.txt file. Make sure that the location is www.example.com/robots.txt.

Capitalization : Never capitalize your syntax commands. Directory
and filenames are case sensitive in Unix platforms. The only capitals used
per standard are: “User-agent ” and “Disallow ”

Correct Order : If you want to block access to all but one or more
than one robot, then the specific ones should be mentioned first. Lets
study this robots.txt example :

User-agent: *
Disallow: /

User-agent: MSNBot
Disallow:

In the above case, MSNBot would simply leave the site without indexing
after reading the first command. Correct syntax is:

User-agent: MSNBot
Disallow:

User-agent: *
Disallow: /

The robots.txt file: Presence - Not having a robots.txt file at
all could generate a 404 error for search engine robots, which could
redirect the robot to the default 404-error page or your customized 404-
error page. If this happens seamlessly, it is up to the robot to decide if
the target file is a robots.txt file or an html file. Typically it would
not cause many problems but you may not want to risk it. It’s always a
better idea to put the standard robots.txt file in the root directory,
than not having it at all.

The standard robots.txt file for allowing all robots to index all pages is:

User-agent: *
Disallow:

Using # carefully in the robots.txt file: Adding comments after
the syntax commands is not a good idea using “#”. Some robots might
misinterpret the line although it is acceptable as per the robots
exclusion standard. New lines are always preferred for comments.


Using the robots.txt file

1. Robots are configured to read text. Too much graphic content could
render your pages invisible to the search engine. Use the robots.txt file
to block irrelevant and graphic-only content.

2. Indiscriminate access to all files, it is believed, can dilute
relevance to your site content after being indexed by robots. This could
seriously affect your site’s ranking with search engines. Use the
robots.txt file to direct robots to content relevant to your site’s theme
by blocking the irrelevant files or directories.

3. The file can be used for multilingual websites to direct robots to
relevant content for relevant topics for different languages. It
ultimately helps the search engines to present relevant results for
specific languages. It also helps the search engine in its advanced search
options where language is a variable.

4. Some robots could cause severe server loading problems by rapid firing
too many requests at peak hours. This could affect your business. By
excluding some robots that might be irrelevant to your site, in the
robots.txt file, this problem can be taken care of. It is really not a
good idea to let malevolent robots use up precious bandwidth to harvest
your emails, images etc.

5. Use the robots.txt file to block out folders with sensitive
information, text content, demo areas or content yet to be approved by
your editors before it goes live.

The robots.txt file is an effective tool to address certain issues
regarding website ranking. Used in conjunction with other SEO strategies,
it can significantly enhance a website’s presence on the net.

Article last updated : 11th March 2004

Related Reading

A Standard for Robots Exclusion.
http://www.robotstxt...c/norobots.html

Guide to The Robots Exclusion Protocol
http://www.robotstxt...sion-admin.html

W3C Recommendations

Meta Tags Optimization for Search Engines

Jagdeep Singh Pannu

Edited by Pannu, 16 January 2007 - 06:34 AM.


#2 James

James

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 2104 posts

Posted 17 April 2004 - 04:28 AM

Hi Jagdeep,

A very useful introduction to robots.txt files and their use. However, a link to http://www.ebooksnby...317022910.shtml or your original article at http://www.seorank.c...ts-tutorial.htm would have been just as useful. Alternatively, you could submit the article to the Cre8asite Resource Library.

Kind Regards,
James

#3 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 20 April 2004 - 06:55 AM

Thanks James!

I thought that this article fits in best in the tutorial section and links are sometimes taken as self/site promotion.

Edit: I have moved the SEO Articles link to the main tutorial page, so that it is more visible and accessible. Readers not interested in the robots.txt might never have found this link.

#4 Grumpus

Grumpus

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 6296 posts

Posted 20 April 2004 - 07:20 AM

Yeah, deciding whether to just post a link to an article or to repost it completely is a tough one. We're pretty relaxed here about that sort of thing. True, we frown upon a new visitor who comes in and posts a link to their site (article or not) and then never comes back again. But if someone is participating in some other threads, and hanging out, we don't mind some links to good stuff from your site. Usually a brief explanation of the article is nice along with the link rather than just a naked link is a good idea to encourage discussion of it as well.

This is an excellent tutorial, though - so regardless of how you posted it - Thank You! :D

G.

P.S. I managed to find a minute to review your submissions to the directory last night. Sorry it took so long - I had a busy weekend. I did alter one of the descriptions and you can see those changes and any comments I made during the review process by clicking on your "Control Panel" link on the directory site and then looking at your "Sites" and "Pages" submission links in the "Activity Rating" section of that page. All those articles were really good. Cheers!

#5 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 21 April 2004 - 12:15 AM

Grumpus, thanks for appreciating my article and also for the warm welcome to this excellent forum.

I think it's fun interacting in the forum as there is so much to learn. Every post is enlightening and interaction is definitely the best tutorial.

I look forward to learning a lot and sharing the very few drops that I have picked from the vast ocean of knowledge.

#6 RoadTrucker

RoadTrucker

    New To Community

  • Members
  • 1 posts

Posted 08 May 2004 - 02:26 PM

The post was perfect. A quick guide at the top, followed by a more detailed explanation. It was easy to find with the Tutorial forum. I have read a lot on the robots file, but you made it concise and to the point. I might have added a link to a syntax checker to round it out, but all-in-all, I'll be looking for other tutorials and post by you. Thanks :)

#7 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 08 May 2004 - 03:15 PM

Thanks RoadTrucker.

Lots of new stuff is in the pipeline. The latest published article is Atul Gupta's (CEO SEORank):

Where does your Site Rank on Google?

Atul writes about possibilities, if Google were to offer a website ranking check facility, wherein you can type in 100 keyword phrases with your site URL. Links to sample pages have also been included as suggestions.
You can even cast your vote for the suggestions. Your recommendation could well become a possibility in the future.

#8 cubano_en_ny

cubano_en_ny

    New To Community

  • Members
  • 1 posts

Posted 13 July 2005 - 12:44 PM

:roll:

Thinking of the right format to write exclusions on. I placed the robots.txt file under our /public directory (which is where our index.htm resides). As far as what robots are concerned, is the /public file the root directory? Should I just worry about exclusion of files and directories under /public?

Do I have to worry about robots navigating upwards in the web directory structure? They shouldn't, but just asking the x-perts.

Thanks in advance

#9 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 13 July 2005 - 03:06 PM

The default address that the search engines use to locate a robots.txt file is www.example.com/robots.txt

In your case Cubano, the "/public file " or "/public directory" is not the root directory. No directory fits the definition of "root" other than "example.com/"

Robots that honour this protocol can navigate anywhere in your site, if exclusion is not specified in either the robots.txt file or in the robots meta tag on individual pages.

#10 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 20 April 2007 - 04:33 PM

There have been some recent developments regarding the robots.txt file.

Christine Churchill reports on the robots.txt summit , which was held at Search Engine Strategies 2007 and was moderated by Danny Sullivan. Representatives of prominent search engines attended, and discussed how the robots.txt, which started off as an exclusion file, should evolve and be used for more than just the functionality of listing sections of your site that should not be indexed.

One of the unanimously voted additions to the syntax, which search engines shall now follow is the line, which points to the location of the xml sitemap of your website. Here is the syntax:

Sitemap: http://www.example.com/sitemap.xml

You can check out specific instructions on the sitemap protocol page. Be sure to make your xml sitemap compliant to the sitemap protocol.

Vanessa Fox posted news about the sitemap protocol , and how to add a link to your sitemap in the robots.txt

Here's a more detailed discussion in another thread at Cre8asiteforums : robots.txt and xml sitemaps

#11 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 20 April 2007 - 05:11 PM

Something else I did not see mentioned:
Yahoo! Search Crawler (Yahoo! Slurp) - Supporting wildcards in robots.txt

Yahoo: How can I reduce the number of requests you make on my web site?

John

#12 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 21 April 2007 - 02:26 PM

Thanks for adding those John.

Yahoo! did extend the functions of the robots.txt to their advantage by configuring Slurp to recognize specific wildcards ('*' and '$'). '*' can be used in place of a sequence of characters in the URL and '$' can be used for pointing where the directive should end in a URL string. This makes it quite easy for webmasters to exclude folders, files or file-types.

The 'Crawl-delay' instruction is a very thoughtful addition, which helps webmasters to set a delay for successive Slurp visits. I recommend visiting the links provided by John to fully understand how the wildcards and the crawl delay instruction can be used.

To remove all ambiguity, it would be apt to again assert that the above additions to the robots.txt are specific to only directing Yahoo's spider--Slurp.

Also notable is the fact that the 'Allow' command is also recognized by Slurp, which can be used to specify what should be included in the index. As stated earlier in the main article, Google supported (read pioneered) this command for Googlebot.

All these additions to the robots.txt are not standardized and not included in the Robots Protocol authored by Martijn Koster but are welcome, because IMHO, with time (so much has changed since 1994) the robots.txt needs to evolve too.

Edited by Pannu, 21 April 2007 - 02:28 PM.


#13 Pannu

Pannu

    Ready To Fly Member

  • Members
  • 18 posts

Posted 03 May 2007 - 06:05 AM

Yahoo! will now obey "class=robots-nocontent" attribute for specific content within a page that you do not want Slurp to index: Yahoo! on the "robots-nocontent" attribute

This means that you can guide Slurp to partially exclude content from a page and index the rest of the page.

A simple example here:

<p class="robots-nocontent">
eQuote “So called rationalists shall revel in proven facts, which they once ridiculed”
</p>

See Danny Sullivan's post for more examples and comments.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users