Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Spell-checking Robots.txt


  • Please log in to reply
4 replies to this topic

#1 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 09 June 2007 - 06:17 PM

Here is a bit of background: One of my tools was starting to be hit by the SE bots and so I blocked access to it using robots.txt. The exact text I used was:

User-agent: *
Disallwow: /deeplinkratio/?


Notice the typo in "Disallow". The tool works by passing the domain name as a query string, and so this rule should block such requests.

Weeks passed and MSNBot and Googlebot stopped hitting the pages, but Slurp kept at it. I checked the robots.txt file, and thought everything was fine and that Slurp was having a problem figuring out the rule.

Just now, I was mere seconds away from submitting a bug report to the Y! search team. I went to copy/paste the robots.txt file text and discovered the typo (the fact I didn't spot this earlier is another story).

Now this is interesting: Clearly Googlebot and MSNbot did not request the tool's pages, but Slurp continued requesting the pages. Looking through the ~300000 blocked URLs in Google's Webmaster Central, not a single entry was found for the tool.

So, the hypotheses are as follows:

1. Google and MSN spellcheck the robots.txt file and obey it. Slurp doesn't.
2. None of the bots spellcheck, but Slurp found a lot of links to the tool and proceeded to index them. Google and MSN don't know of any such links.
3. Something else is going on.

Anyone else know more about such a situation? Could be an interesting hidden "feature".

Incidentally, I just updated the robots.txt file a few minutes ago, so I will be able to accurately measure how long Y! Slurp takes to start obeying it :lol:

Pierre

#2 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 10 June 2007 - 02:19 PM

I seem to remember that Google is slightly "generous" (misspellings, wrong format, blank lines after the user-agent, etc) in the processing of a robots.txt, but I can't remember where I got that from ...

I can't make up my mind whether or not it's good to ignore the standard... what do you think? Block more or block less when in doubt?

John

#3 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 10 June 2007 - 02:35 PM

I had a problem with one of my robots.txt files not too long ago, and had used "disallows" instead of "disallow" on one line.

Pages along that path weren't being disallowed by any of the robots.

Would be interested in seeing where you may have gotten that impression, John, if you can find it.

I would have to say that I would hope that the search engines block less when in doubt. I'll also add that the robots.txt exclusion protocol is in serious need of a translation into plain English.

This recent paper is probably worth a look:

A Large-Scale Study of Robots.txt

Their conclusion is "that a better-specified, official standard is needed."

They do discuss some of the issues and problems that they have seen in robots.txt files, like incorrect user-agent names, ambiguous rules, and conflicting rules - and note that the exclusion protocol provides no guidance on how a robot might react to such problems.

#4 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 10 June 2007 - 02:48 PM

I'll have to search for that, Bill. I hate mentioning things that I don't have the direct source for anymore.... Off to dig in the bookmarks....

I'm kind of unhappy that they say "large-scale" and then explain that they have checked less than 8'000 sites ;). Surely there are larger data-sets somewhere (it would be interesting to get some smaller sites in there as well. But you'd think that the Fortune-1000 sites would at least get it right :))? It would be fun to get a view into the collection from a major search engine ...

John

#5 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 10 June 2007 - 03:06 PM

Oh we definitely need a better standard as the current one is very broken for modern times.

As I mentioned in my second hypothesis, what I'm seeing could be as simple side-effect of Yahoo! finding a whole list of links to the Deep Link Ratio Calculator that the other SEs haven't found (which is interesting in its own right). However, I *know* there are a ton of links like these out there, especially on forums. It would be very interesting if G & M ignored these links but Yahoo didn't.

And I agree with you John: 8000 sites is not large scale at all. 80000 would be better :)

Pierre



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users