![]() ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 7-November 02
Posts: 6,179
From: New England, USA
|
Nov 12 2002, 02:43 PM |
|
|
With a robots text file, anything in it is an "exclusion". You can ban specific bots from the entire site, directories, and/or specific files.
http://www.robotstxt.org/wc/robots.html Google, for those of you who know what you're doing, also allows some extra fun things like using filename wildcards linke "Disallow: /Folder/*.htm" which bans files with the "htm" extension, but not any others. Most robots won't listen to this. Unfortunuately, for the robots.txt file, many of the bad robots (leechers, e-mail harvesters, etc.) don't read the robots.txt file anyway, so you need to have other solutions. (In ASP, I do a check for "If useragent = "badbot" then response.redirect(www.fbi.gov)" - but even that isn't infallible. That link above should get you going, though. The homepage has even more advanced info, ID's of all the robots and more. Hope that gets ya started! G. |
||
| Offline | ![]() |
Centenarian PosterGroup: Members
Joined: 6-October 02
Posts: 210
From: Redding California
|
Nov 12 2002, 03:50 PM |
|
|
You might try here: http://www.freedom2support.co.uk/tutorials.../spidermeta.php
And there are several very good references at http://www.searchenginewatch.com |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
|
Nov 12 2002, 09:15 PM |
|
|
The links above have it all, but here in a nutshell are a couple of specific examples that help illustrate the robots.txt
First of all, the robots.txt file must be in the root of the domain e.g. http://www.example.com/robots.txt It has to begin with a user-agent line e.g. User-agent: * Note that the User-agent: * robots. The user-agent declaration must be followed on the following line by an exclusion, e.g: Disallow: / or Disallow: /scripts/ Examples: User-agent: * Disallow: / That one disallows all robots from everything within the domain. The only thing a spider should grab here is the robots.txt, and it should not even look at anything else and certainly not index anything. This example can be used to get a domain dropped from the index. User-agent: Googlebot Disallow: / This excludes Google's spider from the entire domain, and can be used to hopefully clear a PR penalty that hasn't otherwise gone after corrections have been made. (Only as a last resort). User-agent: * Disallow: /cgi-bin/ The above example keeps all spiders out of the cgi-bin directory. You could use this and additionally keep external javascripts (.js files) in the cgi-bin to prevent them getting grabbed, or at least to prevent them being indexed in any way. User-agent: * Disallow: /cgi The above example keeps spiders out the cgi-bin too, but would also prevent indexing of cgi-explained.html or cgis-company.html |
||
| Offline | ![]() |
Previous Moderator/Hall of Fame![]() ![]() Group: Hall Of Fame
Joined: 4-September 02
Posts: 6,888
From: Melbourne, Australia
|
Nov 12 2002, 11:49 PM |
|
|
Thanks Bill.
|
||
| Offline | ![]() |
Previous Moderator/Hall of Fame![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Nov 12 2002, 11:56 PM |
|
|
Some more on spiders and robots.txt files:
1. Kind of old, but interesting -- Avi Rappoport's white paper Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes 2. A robots.txt syntax checker -- http://www.sxw.org.uk/computing/robots/check.html 3. A little about Googlebot and Slurp, two of the robots you'll see. These pages from Google and Inktomi discuss the robot "standards" and the meta tags, and how each engine uses them. 4. You'll see this referred to as a robots exclusion "standard" or alternatively as a "protocol." It's a protocol in that it tells a visiting program how to view a page or set of pages. It's a defacto standard in that many spiders follow it, but its status as a standard is a little in question: QUOTE It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots. The same is true of the meta tags for robots. While they should be followed, there's no guarantee that they will be. 5. Google does appear to index some robots.txt files. If you do an allinurl:robots.txt search, you'll see a few thousand. 6. Here are some robots.txt files that you might find interesting: altavista alltheweb (fast) (You're welcome, Sophie |
||
| Offline | ![]() |
MemberGroup: Members
Joined: 6-September 02
Posts: 28
|
Nov 13 2002, 07:59 PM |
|
|
Hi,
My impression was that the original post asked about spiders that do noy obey the robots.txt file anyway, so any tips on how to construct a proper file will not do much good ;( In my research, I found some good sources that provide solutions to that problem. They are ranked from the lightweight to the heavyweight... http://manatee.mojam.com/~skip/reject.html This seems like a very simple minded approach... http://www.leekillough.com/robots.html An excellent article with a LOT of useful information http://www.robotcop.org/ The ultimate solution? Hope that helps - but it may be way too techie for some... MC |
||
| Offline | ![]() |
Previous Moderator/Hall of Fame![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Nov 13 2002, 10:22 PM |
|
|
Good point MC.
Thanks! (Unfortunately, the leekillough.com page is coming up with an "under construction" message. I hope that the information that you were pointing towards will return.) There's a section of the FAQ on the robotstxt.org page called "Surely listing sensitive files is asking for trouble?" that reinforces what you're saying. This quote especially: QUOTE The real answer is that /robots.txt is not intended for access control, so don't try to use it as such. Think of it as a \"No Entry\" sign, not a locked door. If you have files on your web site that you don't want unauthorized people to access, then configure your server to do authentication, and configure appropriate authorization. Even if done correctly, a robots.txt file won't stop a robot from going where it isn't supposed to go, and may help direct it to a place where it shouldn't go. The robotcop links page has some other good resources too. Nice find! |
||
| Offline | ![]() |
Previous Moderator/Hall of Fame![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Nov 13 2002, 11:02 PM |
|
|
I believe that you may be right Ammon, on reading the first post, and the quote over again. It is a little ambiguous, though.
In that case, I may be going on about something that makes things more confusing with this post, but I noticed something I hadn't seen before. Inktomi's section on Slurp intimates that the draft 1996 protocol might be followed by the spider. I could be wrong. If anyone interprets these lines differently than I do, I'd appreciate it if you let me know: QUOTE robots.txt: Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the 1994 standard, the proposed standard is followed. disambiguates? Most of the discussion in this thread deals with the earlier 1994 standard. But, the 1996 proposed standard changes things a little. I had always understood the robots.txt file as "excluding" spidering of certain sections. The 1996 proposal added "allowing" of certain areas to the 1994's "disallowing." QUOTE Previous of this specification didn't provide the Allow line. The introduction of the Allow line causes robots to behave slightly differently under either specification: If a /robots.txt contains an Allow which overrides a later occurring Disallow, a robot ignoring Allow lines will not retrieve those parts. This is considered acceptable because there is no requirement for a robot to access URLs it is allowed to retrieve, and it is safe, in that no URLs a Web site administrator wants to Disallow are be allowed. It is expected this may in fact encourage robots to upgrade compliance to the specification in this memo. Is Inktomi using a newer standard that no one is following? Is anyone else? Do the allow lines as described in the 1996 protocol make putting a robots.txt file together more confusing or less? |
||
| Offline | ![]() |
Centenarian PosterGroup: Members
Joined: 5-December 02
Posts: 140
From: World Citizen
|
Dec 9 2002, 05:00 PM |
|
|
QUOTE(bragadocchio) 6. Here are some robots.txt files that you might find interesting: altavista alltheweb (fast) In the Google file listed above I see the following: QUOTE(http://www.google.com/robots.txt) Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Obviously, the disallow is self explanatory, but this is the first time I have seen the question mark after a directory name. Curious if anyone knows the theory behind this one. Thanks. |
||
| Offline | ![]() |
|
|
2 Pages 1 2 >
|
|
| Lo-Fi Version | Time is now: 2nd September 2010 - 11:59 AM |
| Meet our Moderators: | cre8pc : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |