Restricting Googlebot -- Pros & Cons
Posted 22 April 2012 - 04:23 PM
1. googlebot is responsible for a large percentage of web traffic -- certainly on my servers googlebot hits are responsible for a high percentage of the server load!
2. googlebot can be restricted by one or a combination of robots.txt, webmaster tools and iptables/firewall rules.
So my questions are: Will restricting googlebot -- limiting it to fewer hits per website per minute -- will this hurt me? Will it result in fewer pages getting indexed? Will it downgrade google's view of the importance of the site? Will it result in less traffic coming from google?
I imagine webmaster tools will spit out some warning...probably the only definitive answer will come from trying it on a site or two...
Posted 22 April 2012 - 10:59 PM
I am not talking about blocking it, but restricting the frequency of googlebot...cut it by 75% for example, on a busy site...in theory this would cut my server load enormously and avoid having to add servers
Edited by nuts, 22 April 2012 - 11:08 PM.
Posted 22 April 2012 - 11:09 PM
1. not all googlebots are really googlebot. It is probably the most popular bot nom deguerre; user-agent strings are easily spoofed. Without auto reverse IP lookup confirmation and associated blocking googlebot authenticity can be difficult to ascertain. Note: Google bots including googlebot have been known to operate out of 'stealth', i.e. not readily connected to Google, IPs. 'Tis a fun game we play.
2. I've heard both good and bad googlebot blocking stories. I have long blocked based on how hard or repetitively a whitelisted SE bot, including googlebot, hammers a site, without apparent harm, however, the blockage in such instances is only for an hour at a time... and while blocking such whitelisted bots return a '503 Service Unavailable' header response including 'Retry-After: 3600'. All by the rules
Posted 22 April 2012 - 11:33 PM
I am looking at googlebot hits to apache access_log, identified as googlebot by apache hostname lookup -- the bot identifies by IP, apache finds the name. It is a consistent hammering to a large number of domains. If I had only a few domains on a server, it would be no big deal, but the total volume contributes a lot to server load.
I am also researching cacheing a large number of pages, deciding between memcache (well known) and couchdb, which claims to keep a record of the cache in a file so that server restarts are much easier. This new box (new to me) has 48Gb of RAM which should be able to cache millions of pages, so rather than querying mysql or oracle, the hits would be much more innocuous as far as server load is concerned. Also should make for faster page loads. Of course I have quite a bit of architecture/programming to think through...
Posted 23 April 2012 - 09:39 AM
The learning curve of the webdev goes up and up, up and up
Of course I have quite a bit of architecture/programming to think through...
Michael Martinez wrote a (still!) good (the lasting power of basics, strategy, and theory ) primer four (!!!time flies:() years ago: Large Web site design theory and crawl management.
Crawl refers to all aspects of search engine crawling. It includes:
1. Crawl rate (how many pages are fetched in a given timeframe)
2. Crawl frequency (how often a search engine initiates a new crawl)
3. Crawl depth (how many clicks deep a search engine goes from a crawl initiation point)
4. Crawl saturation (how many unique pages are fetched)
5. Crawl priority (which pages are used to initiate crawls)
6. Crawl redundancy (how many crawlers are used to crawl a site)
7. Crawl mapping (creating paths for crawlers)
Mapping the above metrics can be a valuable tool in deciding how to manage SE bots.
Posted 23 April 2012 - 12:55 PM
Jon, I'm surprised you lost a lot of rankings by blocking Googlebot. I might test that with SEO Theory.
Reply to this topic
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users