New kind of spider is in town
Posted 15 January 2006 - 03:02 AM
1. It comes from a different web browser, simulating a website visitor
3. Keeps a cache, requesting only new material
Creator claims that the spider is completely AI driven and can be use to accurately determine link popularity. I wonder if this thing is going to live up to its expectation?
Posted 15 January 2006 - 07:26 AM
For me it sounded somewhat scary. This is supposedly done for ethical research purposes, but could it be used instead for evil purposes.
Posted 15 January 2006 - 10:34 AM
Fortunately, a reverse-Turing test isn't any more likely to be passed than a Turing test. Here's a very simple way to beat their spider, or indeed, any unwanted spider.
Step 1: Create a new directory on your site and exclude it from being crawled in robots.txt. Then, take a short holiday or spend a few days working on another site. You need to wait until you are absolutely certain that all legitimate spiders have this copy of robots.txt before proceeding. Remember, Googlebot and his brothers don't necessarily read your robots.txt on every single visit. Check your logs!
2. Put an invisible link to this directory at the top of your major entry pages or, better yet, on all of your web pages. This is the bait for your spider trap. You can use CSS or a 1x1 pixel, whatever floats your boat. Personally, I would also include a title or alt attribute specifically telling human beings who might see it, either through a screen reader or view source, to NOT visit the link.
3. You now have a prominent link that no legitimate spider will every visit and very few legitimate users will ever see. Any IP that hits that link can be assumed to be bad 'bot with a pretty high degree of accuracy. The index page in this spider trap directory should include a human readable explanation of what just happened in case any real visitor stumbles into the trap, and perhaps a resolution, depending on what you plan to do with the collected IP addresses.
4. Want to carry it to the next level? Make the index page for your spider trap a PHP script that captures the IP address and automatically writes a couple of new lines to your .htaccess file, immediately banning it from further access to the site. Personally, I would suggest a self-cleaning script that also removed IP bans periodically, say perhaps anything older than a hour. Such a system should require almost no maintenance once implemented.
Could a spider be written that was smart enough to avoid a spider trap? Probably, but I've very carefully avoided telling you what to name the directory or how to word the links or warnings, because if everyone does their own thing when they implement it, even a smart spider is going to have a hard time avoiding the trap. I would be willing to bet dollars to donuts that Hoffman's little arachnid would quickly fail his reverse-Turning test.
Posted 15 January 2006 - 11:11 AM
The man (or woman) can always beat the machine.
Posted 15 January 2006 - 12:26 PM
The bot traps I've used with an automated ban have an unban option as well with most of the on-page text for the human user in a graphic
I like the idea (Wired) though. Not novel but nice to see it implemented.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users