Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

New kind of spider is in town


  • Please log in to reply
4 replies to this topic

#1 ignat

ignat

    Whirl Wind Member

  • Members
  • 67 posts

Posted 15 January 2006 - 02:32 AM

Has any of you seen this article from Wired? It discusses a new type of web crawler that imitates bahavior of a regular web surfer. Here's a brief description of the spider:

1. It comes from a different web browser, simulating a website visitor

2. Downloads everything that comes with a page (Flash, JavaScript, ActiveX)

3. Keeps a cache, requesting only new material

Creator claims that the spider is completely AI driven and can be use to accurately determine link popularity. I wonder if this thing is going to live up to its expectation?

#2 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9005 posts

Posted 15 January 2006 - 06:56 AM

Yes, ignat, I read that too. The article suggested this was done for research purposes to test those websites who say they can detect whether it's a bot or a human who is doing the downloading. If websites can prevent automatic downloading, it was suggested they wouldn't think to ban this 'human-type' activity. It's a sort of reverse-Turing test as was mentioned.

For me it sounded somewhat scary. This is supposedly done for ethical research purposes, but could it be used instead for evil purposes. :(

#3 Ron Carnell

Ron Carnell

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 2062 posts

Posted 15 January 2006 - 10:04 AM

In my opinion, any robot that doesn't ask for and then follow robots.txt is, by definition, unethical. They can argue intent all day long, but actions still speak louder than words.

Fortunately, a reverse-Turing test isn't any more likely to be passed than a Turing test. Here's a very simple way to beat their spider, or indeed, any unwanted spider.

Step 1: Create a new directory on your site and exclude it from being crawled in robots.txt. Then, take a short holiday or spend a few days working on another site. You need to wait until you are absolutely certain that all legitimate spiders have this copy of robots.txt before proceeding. Remember, Googlebot and his brothers don't necessarily read your robots.txt on every single visit. Check your logs!

2. Put an invisible link to this directory at the top of your major entry pages or, better yet, on all of your web pages. This is the bait for your spider trap. You can use CSS or a 1x1 pixel, whatever floats your boat. Personally, I would also include a title or alt attribute specifically telling human beings who might see it, either through a screen reader or view source, to NOT visit the link.

3. You now have a prominent link that no legitimate spider will every visit and very few legitimate users will ever see. Any IP that hits that link can be assumed to be bad 'bot with a pretty high degree of accuracy. The index page in this spider trap directory should include a human readable explanation of what just happened in case any real visitor stumbles into the trap, and perhaps a resolution, depending on what you plan to do with the collected IP addresses.

4. Want to carry it to the next level? Make the index page for your spider trap a PHP script that captures the IP address and automatically writes a couple of new lines to your .htaccess file, immediately banning it from further access to the site. Personally, I would suggest a self-cleaning script that also removed IP bans periodically, say perhaps anything older than a hour. Such a system should require almost no maintenance once implemented.

Could a spider be written that was smart enough to avoid a spider trap? Probably, but I've very carefully avoided telling you what to name the directory or how to word the links or warnings, because if everyone does their own thing when they implement it, even a smart spider is going to have a hard time avoiding the trap. I would be willing to bet dollars to donuts that Hoffman's little arachnid would quickly fail his reverse-Turning test. :(

#4 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9005 posts

Posted 15 January 2006 - 10:41 AM

Great post, Ron. :applause:

The man (or woman) can always beat the machine. :)

#5 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 15 January 2006 - 11:56 AM

Bot traps are very interesting to play with. I can't say you learn astounding amounts from them but it is very interesting to see what some Asian IP spiders are doing, for instance.

The bot traps I've used with an automated ban have an unban option as well with most of the on-page text for the human user in a graphic :)

I like the idea (Wired) though. Not novel but nice to see it implemented.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users