2 Pages V  1 2 >  
Reply to this topicStart new topic
> How to write an anti-robot txt file?

Founder & Administrator

Group Icon
Group: Admin - Top Level
Joined: 29-August 02
Posts: 11,920
From: Bucks County, PA
post Nov 12 2002, 02:10 PM
This was presented to me and I thought it would be a great question for here too, in case more people want to learn. That includes me too, because I need an update on this procedure.

Question:

QUOTE
I recently uploaded a robots.txt file to the site.  Is this the right
approach?  This file recommends that robots look at 16 files on the site
(found at (URL Hidden to protect Web Owner) websitename/anti_robots.txt), and skip all of
the 'include' files which contain 1) VBScript executed on the server, 2)
navigation, contact, and logo presentations, 3) and other pages and
directories on the site which would generate errors or be incomprehensible
to a user if he/she entered the site there.


Anyone want to give a tutorial on how to write anti-robot txt files, explain uses for them and/or offer any other input?

Thanks!

Kim
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 7-November 02
Posts: 6,179
From: New England, USA
post Nov 12 2002, 02:23 PM
I don't think there's any such thing as an "anti_robots.txt" file - and if there is, I've never had a robot ask for it. Are we talking about just the regular old robots.txt file? Or am I missing something?

G.
Offline Go to the top of the page

Founder & Administrator

Group Icon
Group: Admin - Top Level
Joined: 29-August 02
Posts: 11,920
From: Bucks County, PA
post Nov 12 2002, 02:29 PM
Yep!

Except how do you tell it what NOT to index? I've been greedy all these years and let them have everything, but what about those who want to protect files in their directories, etc. from spiders?

And thanks for clarifying...sometimes only *I* know what I'm talking about. Gets scary. :shock:

Kim
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 7-November 02
Posts: 6,179
From: New England, USA
post Nov 12 2002, 02:43 PM
With a robots text file, anything in it is an "exclusion". You can ban specific bots from the entire site, directories, and/or specific files.

http://www.robotstxt.org/wc/robots.html

Google, for those of you who know what you're doing, also allows some extra fun things like using filename wildcards linke "Disallow: /Folder/*.htm" which bans files with the "htm" extension, but not any others. Most robots won't listen to this.

Unfortunuately, for the robots.txt file, many of the bad robots (leechers, e-mail harvesters, etc.) don't read the robots.txt file anyway, so you need to have other solutions. (In ASP, I do a check for "If useragent = "badbot" then response.redirect(www.fbi.gov)" - but even that isn't infallible.

That link above should get you going, though. The homepage has even more advanced info, ID's of all the robots and more.

Hope that gets ya started!

G.
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 6-October 02
Posts: 210
From: Redding California
post Nov 12 2002, 03:50 PM
You might try here: http://www.freedom2support.co.uk/tutorials.../spidermeta.php

And there are several very good references at http://www.searchenginewatch.com smile.gif
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
post Nov 12 2002, 09:15 PM
The links above have it all, but here in a nutshell are a couple of specific examples that help illustrate the robots.txt

First of all, the robots.txt file must be in the root of the domain e.g. http://www.example.com/robots.txt

It has to begin with a user-agent line
e.g. User-agent: *

Note that the User-agent: * robots.

The user-agent declaration must be followed on the following line by an exclusion, e.g:
Disallow: /
or
Disallow: /scripts/

Examples:

User-agent: *
Disallow: /


That one disallows all robots from everything within the domain. The only thing a spider should grab here is the robots.txt, and it should not even look at anything else and certainly not index anything. This example can be used to get a domain dropped from the index.

User-agent: Googlebot
Disallow: /


This excludes Google's spider from the entire domain, and can be used to hopefully clear a PR penalty that hasn't otherwise gone after corrections have been made. (Only as a last resort).

User-agent: *
Disallow: /cgi-bin/


The above example keeps all spiders out of the cgi-bin directory. You could use this and additionally keep external javascripts (.js files) in the cgi-bin to prevent them getting grabbed, or at least to prevent them being indexed in any way.

User-agent: *
Disallow: /cgi


The above example keeps spiders out the cgi-bin too, but would also prevent indexing of cgi-explained.html or cgis-company.html
Offline Go to the top of the page

Previous Moderator/Hall of Fame

Group Icon
Group: Hall Of Fame
Joined: 4-September 02
Posts: 6,888
From: Melbourne, Australia
post Nov 12 2002, 10:58 PM
I'm also a bit shakey on this stuff and wondered what others thought of the Meta Tags that do this?

Eg <meta name="robots" content="noindex, nofollow">
Offline Go to the top of the page

Previous Moderator/Hall of Fame

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 12 2002, 11:27 PM
Sophie,

The ideal time to use the meta tags is when you don't have access to the root directory, but you want to exclude spiders from certain pages.
Offline Go to the top of the page

Previous Moderator/Hall of Fame

Group Icon
Group: Hall Of Fame
Joined: 4-September 02
Posts: 6,888
From: Melbourne, Australia
post Nov 12 2002, 11:49 PM
Thanks Bill. smile.gif
Offline Go to the top of the page

Previous Moderator/Hall of Fame

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 12 2002, 11:56 PM
Some more on spiders and robots.txt files:

1. Kind of old, but interesting -- Avi Rappoport's white paper Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes

2. A robots.txt syntax checker -- http://www.sxw.org.uk/computing/robots/check.html

3. A little about Googlebot and Slurp, two of the robots you'll see. These pages from Google and Inktomi discuss the robot "standards" and the meta tags, and how each engine uses them.

4. You'll see this referred to as a robots exclusion "standard" or alternatively as a "protocol." It's a protocol in that it tells a visiting program how to view a page or set of pages. It's a defacto standard in that many spiders follow it, but its status as a standard is a little in question:

QUOTE
It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.


The same is true of the meta tags for robots. While they should be followed, there's no guarantee that they will be.

5. Google does appear to index some robots.txt files. If you do an allinurl:robots.txt search, you'll see a few thousand.

6. Here are some robots.txt files that you might find interesting:

altavista

Google

alltheweb (fast)

(You're welcome, Sophie smile.gif )
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 7-November 02
Posts: 6,179
From: New England, USA
post Nov 13 2002, 07:38 AM
Another very important tip:

DO NOT list your "secret directories" in your robots.txt file. (i.e. if you have an administration page or something like that). Once you put a folder like that in there, you're just begging for the hackers to try to get in now that you've told them where it is.

G.
Offline Go to the top of the page

Member

Group: Members
Joined: 6-September 02
Posts: 28
post Nov 13 2002, 07:59 PM
Hi,

My impression was that the original post asked about spiders that do noy obey the robots.txt file anyway, so any tips on how to construct a proper file will not do much good ;(

In my research, I found some good sources that provide solutions to that problem. They are ranked from the lightweight to the heavyweight...

http://manatee.mojam.com/~skip/reject.html
This seems like a very simple minded approach...

http://www.leekillough.com/robots.html
An excellent article with a LOT of useful information

http://www.robotcop.org/
The ultimate solution?

Hope that helps - but it may be way too techie for some...

MC
Offline Go to the top of the page

Previous Moderator/Hall of Fame

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 13 2002, 10:22 PM
Good point MC.

Thanks! (Unfortunately, the leekillough.com page is coming up with an "under construction" message. I hope that the information that you were pointing towards will return.)

There's a section of the FAQ on the robotstxt.org page called "Surely listing sensitive files is asking for trouble?" that reinforces what you're saying. This quote especially:

QUOTE
The real answer is that /robots.txt is not intended for access control, so don't try to use it as such. Think of it as a \"No Entry\" sign, not a locked door. If you have files on your web site that you don't want unauthorized people to access, then configure your server to do authentication, and configure appropriate authorization.


Even if done correctly, a robots.txt file won't stop a robot from going where it isn't supposed to go, and may help direct it to a place where it shouldn't go. The robotcop links page has some other good resources too. Nice find!
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
post Nov 13 2002, 10:31 PM
QUOTE(mrdch)
My impression was that the original post asked about spiders that do noy obey the robots.txt file anyway, so any tips on how to construct a proper file will not do much good ;(


Actually, the original request was laced with references to an incorrect robots.txt, which was incorrectly named, hoped to specify inclusion rather than exclusions, etc, etc.
QUOTE
I recently uploaded a robots.txt file to the site. Is this the right 
approach?
This file recommends that robots look at 16 files on the site (found at (URL Hidden to protect Web Owner) websitename/anti_robots.txt), and skip all of the 'include' files which contain 1) VBScript executed on the server, 2) navigation, contact, and logo presentations, 3) and other pages and directories on the site which would generate errors or be incomprehensible to a user if he/she entered the site there.


I think it is clear that the original quote came from someone needing guidance on using the robots.txt correctly, and needing to know that there is no such thing as a robots inclusion protocol.
Offline Go to the top of the page

Previous Moderator/Hall of Fame

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 13 2002, 11:02 PM
I believe that you may be right Ammon, on reading the first post, and the quote over again. It is a little ambiguous, though.

In that case, I may be going on about something that makes things more confusing with this post, but I noticed something I hadn't seen before. Inktomi's section on Slurp intimates that the draft 1996 protocol might be followed by the spider. I could be wrong. If anyone interprets these lines differently than I do, I'd appreciate it if you let me know:

QUOTE
robots.txt: Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the 1994 standard, the proposed standard is followed.


disambiguates?

Most of the discussion in this thread deals with the earlier 1994 standard. But, the 1996 proposed standard changes things a little. I had always understood the robots.txt file as "excluding" spidering of certain sections. The 1996 proposal added "allowing" of certain areas to the 1994's "disallowing."

QUOTE
Previous of this specification didn't provide the Allow line. The    introduction of the Allow line causes robots to behave slightly differently under either specification:    If a /robots.txt contains an Allow which overrides a later occurring  Disallow, a robot ignoring Allow lines will not retrieve those   parts. This is considered acceptable because there is no requirement   for a robot to access URLs it is allowed to retrieve, and it is safe,   in that no URLs a Web site administrator wants to Disallow are be    allowed. It is expected this may in fact encourage robots to upgrade compliance to the specification in this memo.


Is Inktomi using a newer standard that no one is following? Is anyone else? Do the allow lines as described in the 1996 protocol make putting a robots.txt file together more confusing or less?
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 7-November 02
Posts: 6,179
From: New England, USA
post Nov 14 2002, 06:26 AM
Google and INK both use the 1996 standard. Don't know of any others that do. I've never used it as there are so many that don't listen to it. Google has the added benefit (as I mentioned above) of obeying wildcards that no one else seems to do. This is really handy with dynamic sites, but is probably also useful for people who use SSI - you can allow the bot to crawl SHTML files, but disallow *.htm (the included files) in the same directory with one simple line.

G.
Offline Go to the top of the page

Honorary Member

Group: Members
Joined: 30-August 02
Posts: 341
From: Fairfield, Iowa, USA
post Nov 20 2002, 11:01 AM
QUOTE(Black_Knight)

User-agent: Googlebot
Disallow: /


This excludes Google's spider from the entire domain, and can be used to hopefully clear a PR penalty that hasn't otherwise gone after corrections have been made. (Only as a last resort).

Hi guys,

I've been out of town for a week visiting an Indian saint. How would this work? Would Google figure the site no longer existed and thus drop it from their s**t list, then after a while you could remove or revise the robots.txt and Google would regard it as a new site?
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 7-November 02
Posts: 6,179
From: New England, USA
post Nov 21 2002, 06:30 AM
Nope. Once a site is on the list, it's on the list for however long Google decides it should be there. I've known folks who have purchased a domain without checking the history, built a site on that domain, and found out a month later that that domain still has a penalty on it. It was another month of writing to Google and getting them to check out the site and lift the penalty.

Many google penalties, though, seem to be like a hockey penalty. If you get the penalty, fix the problem, and wait, you'll be out of the penalty box in a few months. If google can't see that the penalty is fixed (the site is abandoned or the bot is banned via Robots.txt) then it's likely to never get lifted.

G.
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-December 02
Posts: 140
From: World Citizen
post Dec 9 2002, 05:00 PM
QUOTE(bragadocchio)

6.  Here are some robots.txt files that you might find interesting:

altavista

Google

alltheweb (fast)


In the Google file listed above I see the following:
QUOTE(http://www.google.com/robots.txt)

Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?


Obviously, the disallow is self explanatory, but this is the first time I have seen the question mark after a directory name.

Curious if anyone knows the theory behind this one.

Thanks.
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-December 02
Posts: 140
From: World Citizen
post Dec 10 2002, 02:57 PM
Also, does anyone know whether or not you can exclude one SE from a particular file while allowing another SE to see/spider that file?

In all the examples of robots.txt I have seen, I haven't seen an example of this, so I guess the answer is no. But you darned people keep feeding me information that makes me have to think and come up with new questions (hey, is that why this forum is here wink-2.gif)

Anyways, I would think this might be valuable if you optimized a page for AV and another for Google with basically the same content. You might not want them to both be spidered by each of the SE's for fear of spamming. If you could tell the robots of each user agent which they could see/not see, that would be a good thing, right?.
Offline Go to the top of the page
Reply to this topic Start new topic
2 Pages V  1 2 >
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 2nd September 2010 - 11:59 AM
Meet our Moderators: cre8pc : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed