I pushed my small server into publishing the AOL search database (several GB of data) and set up an interface to query it. All queries are done through parameters in the URL. All database query results are cached since the queries can take a bit of time (usually around 10 seconds, sometimes up to a minute). The result pages are filled with links to other queries so that users can "click and play" (and wait
If a search engine crawler were to stumble upon those pages, it could go crazy following links and pushing my server into a "meltdown"
I currently have the following set up to avoid that, but I'm not sure if it's the best way:
- "bad bots" (recognized by user-agent, IP or referrer) are treated to a 404 (with a small captcha-like field to authenticate real users and allow them access)
- "good bots" (+the Mediabot) are allowed access and shown the cached data.
- When "good bots" try to access pages which are not yet cached (not visited by a human visitor), they are treated to a 500 (server error; again with a captcha-like element for real users)
My main question is: will the "good bots" treat the site differently if they are shown error code "500" for all the non-cached URLs they find? So far, I've noticed that the Mediabot will keep re-trying the URL (with a longer wait in between) -- the Mediabot isn't that much of a problem, since it only accesses pages which were opened anyway
John






