Search engine bot control
Posted 17 August 2006 - 03:03 AM
I pushed my small server into publishing the AOL search database (several GB of data) and set up an interface to query it. All queries are done through parameters in the URL. All database query results are cached since the queries can take a bit of time (usually around 10 seconds, sometimes up to a minute). The result pages are filled with links to other queries so that users can "click and play" (and wait ).
If a search engine crawler were to stumble upon those pages, it could go crazy following links and pushing my server into a "meltdown" .
I currently have the following set up to avoid that, but I'm not sure if it's the best way:
- "bad bots" (recognized by user-agent, IP or referrer) are treated to a 404 (with a small captcha-like field to authenticate real users and allow them access)
- "good bots" (+the Mediabot) are allowed access and shown the cached data.
- When "good bots" try to access pages which are not yet cached (not visited by a human visitor), they are treated to a 500 (server error; again with a captcha-like element for real users)
My main question is: will the "good bots" treat the site differently if they are shown error code "500" for all the non-cached URLs they find? So far, I've noticed that the Mediabot will keep re-trying the URL (with a longer wait in between) -- the Mediabot isn't that much of a problem, since it only accesses pages which were opened anyway (except for a small bug in their URL parsing ). Could the search engines assume that the server went offline? Or will they reduce the crawl-frequency in general for the whole site (eg should I instead move the DB to a separate domain)? Or should I remove all non-cached URLs as links from the pages when a search engine crawler comes by (that would take a bit of processing power, but if it's worth it....)?
Posted 17 August 2006 - 03:41 AM
The other thing: add a captcha to the searches themselves. This will trap spammers too
Posted 17 August 2006 - 04:13 AM
Because I don't mind if the (good) bots read my cached pages. It's just that I don't want them to crawl the whole database and block access for the "real" users.
Why don't you have a special search script that is blocked by robots.txt? This search script will handle all quries for you.
Adding a captcha to all queries would be a possiblity, but it would be hard to get it to work "right" (I think). I suppose it could work for the whole session, letting the user enter the captcha once (and only if real database queries are being done), hmm...
Posted 17 August 2006 - 04:52 AM
Case in point: one site of mine has a guestbook. Spammers loved it for two reasons: easy to post to to get a link from a good domain, easy to scrape for emails and more links to scrape. End result? It became the most visited page on the site at one point!
So I added a captcha to even read the guestbook. That cut requests by more than 80% (I'm not kidding!) I still get spammers visiting, but I know they won't come back.
Funny thing is: I was once reading through the log files, and noticed a typical scraper pattern for one of the IP addresses. Then, the captcha was submitted a few minutes later, and the same scraping pattern started. It seems to me that the spammer personally visited the guestbook to manually kick-start any session-based captchas, but got thwarted. It made me happy
So, again, given how straining your database is on your server, I would put a captcha for every search. I know it's inconvenient, but a good explanation would go a long way to alleviating any pain your visitors might feel. Everyone understands spam and everyone hates it. They'll understand.
Posted 17 August 2006 - 07:57 AM
The cached pages are an 'adsense' golmine waiting to be mined. So start thinking that the cache is good it will grow and as it grows it will take less and less time for real queries. (You will need plenty storage space!).
The problem is as follows:
1. Let real users through - This is not difficult with a little turing test.
2. Let good bots in - no problem just exlude the bad bots!
3. How to exclude bad bots
"Bad bots" (recognized by user-agent, IP or referrer) are treated to a 404 (with a small captcha-like field to authenticate real users and allow them access).
Agreed this is a good strategy.
When "good bots" try to access pages which are not yet cached (not visited by a human visitor), they are treated to a 500 (server error; again with a captcha-like element for real users).
I do not think the search engines will like this. Rather send them to a page where the query is automatically phrased using php, the page created on the fly and cached. This way also the bots will create all the cache you need! Or cross check all the links and create the full cache yourself via your own 'bot'.
Goldmine John do not let it go!
Edited by yannis, 17 August 2006 - 07:58 AM.
Posted 22 August 2006 - 10:21 AM
The correct code would be 503 (network / server unavailable)
The error 500 seems to have scared the Googlebot off for a few days! Even the Mediabot is not visiting my site anymore and certainly NOT targeting those database entries. I have even changed paths in the URL and reduced it from 4 parameters to 2, moving the real query to the end. Adsense is partially targeting based on the URL, but the Mediabot is not visiting the pages to see what is to be found. Perhaps this is just for a short time? Perhaps there is some sort of "OMG - the server is dead" timeout (several days?) on a whole bunch of 500 errors.
The Googlebot has currently returned, crawling 1-5 pages/hour from the database, the 503 errors don't seem to bother it much, it keeps on anyway. It's nowhere near the usual crawl-frequency the site had, however. I bet it's adjusting the crawl speed based on the errors.
The Mediabot is, like I said, not coming at all anymore. It's visiting the other URLs in my site as visitors come, just not the AOL database ones. The funny thing is that I can provoke the Mediabot to come for a visit with the "right" parameters attached to the end of the URL (I've been playing with that and have sent some mails to Google about it). It just doesn't visit the natural URLs in the AOL database section of the site... Hmm... (or maybe the server is cacheing the accesses and not passing them through to the application? that would be neat -- off to check the server logs tonight, otherwise off to install a packet sniffer later on).
Posted 22 August 2006 - 03:27 PM
John :embarrassed: :embarrassed:
Posted 22 August 2006 - 06:18 PM
Posted 23 August 2006 - 01:13 AM
Once I get the time, I'll move it to a separate domain (or subdomain) and try to figure out a way to provide value on the database pages itself: perhaps allow people to comment on them, rate them, link out to other articles, etc.
What I also want to work out is how to provide a way for people to submit "statistics" that will be run (and cached) in the off-hours. I'm sure that given the chance, people will come up with some things I'd never think about and it would be great to offer them a way to do that without having to import all that data (and of course great to get links from their postings about their findings ).
Why can't the day have a bunch of extra hours?
Posted 23 August 2006 - 04:17 AM
Depends on what you're trying to do. Let's take your myspace example: suppose you have a page on your site about myspace, or even , a series of pages each one discussing some aspect of myspace. You can (and the gray/black hat in me says you should ) add a secondary page, say on a blog or a tool's output (your case) that lists all the keywords related to myspace. One idea would be to use the Overture keywords data or the AOL data in your case.
But regarding the "value" of it all for the search engines, I think you're probably right, Ammon. In the end, a compiled table of lots of keywords is not really worth that much -- sure visitors might find it "on accident" when searching, but it's not like they'll find what they're looking for in a database like that (I doubt people searching for "myspace" will want to see what other people who also searched for "myspace" went to look at ).
On a good domain name, the secondary page will rank for pretty much all the keywords in the list, for some being higher than the others. The end result is a very wide funnel that you can use to capture traffic to your primary pages.
So, if you think about it this way, you might have created the ultimate traffic capturing code!
As a start, I would cache the top keyword searches you have done, link them up to some articles on the subject, throw in some AdSense, and you get some money off it. Even better, get some people to write some unique articles for you in exchange for a link.
My 2-darker-than-white cents.
PS - I'm talking out of experience on this one
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users