Jump to content

Cre8asiteforums

Web Site Design, Usability, SEO & Marketing Discussion and Support

  • Announcements

    • cre8pc

      20 Years! Cre8asiteforums 1998 - 2018   01/18/2018

      Cre8asiteforums In Its 20th Year In case you didn't know, Internet Marketing Ninjas released many of the online forums they had acquired, such as WebmasterWorld, SEOChat, several DevShed properties and these forums back to their founders. You will notice a new user interface for Cre8asiteforums, the software was upgraded, and it was moved to a new server.  Founder, Kim Krause Berg, who was retained as forums Admin when the forums were sold, is the hotel manager here, with the help of long-time member, "iamlost" as backup. Kim is shouldering the expenses of keeping the place going, so if you have any inclination towards making a donation or putting up a banner, she is most appreciative of your financial support. 
nuts

Restricting Googlebot -- Pros & Cons

Recommended Posts

Ok, this may be more complex than I am imagining, but here are a couple of assumptions:

 

1. googlebot is responsible for a large percentage of web traffic -- certainly on my servers googlebot hits are responsible for a high percentage of the server load!

 

2. googlebot can be restricted by one or a combination of robots.txt, webmaster tools and iptables/firewall rules.

 

So my questions are: Will restricting googlebot -- limiting it to fewer hits per website per minute -- will this hurt me? Will it result in fewer pages getting indexed? Will it downgrade google's view of the importance of the site? Will it result in less traffic coming from google?

 

I imagine webmaster tools will spit out some warning...probably the only definitive answer will come from trying it on a site or two...

 

Comments welcome

 

Cheers

Mike

Share this post


Link to post
Share on other sites

I blocked googlebot accidentally once and rankings plummeted pretty fast.

Share this post


Link to post
Share on other sites

Well there's a contradiction...

 

I am not talking about blocking it, but restricting the frequency of googlebot...cut it by 75% for example, on a busy site...in theory this would cut my server load enormously and avoid having to add servers

Edited by nuts

Share this post


Link to post
Share on other sites

Two thoughts for your consideration:

1. not all googlebots are really googlebot. It is probably the most popular bot nom deguerre; user-agent strings are easily spoofed. Without auto reverse IP lookup confirmation and associated blocking googlebot authenticity can be difficult to ascertain. Note: Google bots including googlebot have been known to operate out of 'stealth', i.e. not readily connected to Google, IPs. 'Tis a fun game we play. :)

 

2. I've heard both good and bad googlebot blocking stories. I have long blocked based on how hard or repetitively a whitelisted SE bot, including googlebot, hammers a site, without apparent harm, however, the blockage in such instances is only for an hour at a time... and while blocking such whitelisted bots return a '503 Service Unavailable' header response including 'Retry-After: 3600'. All by the rules :)

Share this post


Link to post
Share on other sites

Hi iamlost

 

I am looking at googlebot hits to apache access_log, identified as googlebot by apache hostname lookup -- the bot identifies by IP, apache finds the name. It is a consistent hammering to a large number of domains. If I had only a few domains on a server, it would be no big deal, but the total volume contributes a lot to server load.

 

I am also researching cacheing a large number of pages, deciding between memcache (well known) and couchdb, which claims to keep a record of the cache in a file so that server restarts are much easier. This new box (new to me) has 48Gb of RAM which should be able to cache millions of pages, so rather than querying mysql or oracle, the hits would be much more innocuous as far as server load is concerned. Also should make for faster page loads. Of course I have quite a bit of architecture/programming to think through...

Share this post


Link to post
Share on other sites

Of course I have quite a bit of architecture/programming to think through...

 

The learning curve of the webdev goes up and up, up and up :D

 

Michael Martinez wrote a (still!) good (the lasting power of basics, strategy, and theory :)) primer four (!!!time flies:() years ago: Large Web site design theory and crawl management.

 

Crawl refers to all aspects of search engine crawling. It includes:

 

1. Crawl rate (how many pages are fetched in a given timeframe)

2. Crawl frequency (how often a search engine initiates a new crawl)

3. Crawl depth (how many clicks deep a search engine goes from a crawl initiation point)

4. Crawl saturation (how many unique pages are fetched)

5. Crawl priority (which pages are used to initiate crawls)

6. Crawl redundancy (how many crawlers are used to crawl a site)

7. Crawl mapping (creating paths for crawlers)

 

 

Mapping the above metrics can be a valuable tool in deciding how to manage SE bots.

Share this post


Link to post
Share on other sites

You cannot control Googlebot's crawl rate but you SHOULD be able to control where your Web statistics data is stored. It should not be in a Web-accessible folder.

 

Jon, I'm surprised you lost a lot of rankings by blocking Googlebot. I might test that with SEO Theory.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×