If you run a Web crawler you are a filthy spammer UNLESS you do the following:
* Publish a user-agent on your Website.
* Identify the user-agent in every fetch request the crawler makes.
* Honor the Robots Exclusion Standard (Cf. http://www.robotstxt.org/).
Your classification as filthy bottom-feeding scum for running a rogue crawler is not negotiable.
So the Rules for Crawlers Are Simple
Rule No. 1: Respect the Website This means your crawler should identify itself and it should honor the Robots Exclusion Standard.
Rule No. 2: Make No Exceptions If in your bid to become the next big tech billionaire you find that a million Websites are blocking your user-agent and IP address, STAY OFF THOSE SITES. We don’t owe you anything. You owe US.
Rule No. 3: Don’t Take it Personally The Internet was not created so that you can make billions of dollars by stealing other people’s content and using it for your own benefit. So when you choose to do this and treat us as nothing more than ants to be crushed, we’re going to consider you to be bottom-feeding filth because that is what you are.
While I loved his rant and generally shouted Yes!!! or Huzzah! with every sentence I do disagree with this: ...Google honors the Robots Exclusion Standard...
Not always. And neither do Bing and other major SEs. They do run 'cloaked' bots, headless browsers, etc. as well as fudge on the 'meaning' of bot, user-agent et al to excuse apparent rogue crawler and fetcher behaviours.
It still behooves webdevs to build in appropriate bot defences in depth - unfortunately still no simple plug and play version available I know to recommend - but the parameters are only a search or three away, some even with basic code examples (caveat emptor). A typical website analysis shows 15-20% of it's traffic is bot; some exceed 80-90%. That is a lot of scraped content, wasted bandwidth, diluted conversion stats, etc. I even block most Google and Bing bots and other critters because they do not offer value in return; your needs may differ.
Great read. Thanks, Michael.