Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Restricting Googlebot -- Pros & Cons


  • Please log in to reply
7 replies to this topic

#1 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 22 April 2012 - 04:23 PM

Ok, this may be more complex than I am imagining, but here are a couple of assumptions:

1. googlebot is responsible for a large percentage of web traffic -- certainly on my servers googlebot hits are responsible for a high percentage of the server load!

2. googlebot can be restricted by one or a combination of robots.txt, webmaster tools and iptables/firewall rules.

So my questions are: Will restricting googlebot -- limiting it to fewer hits per website per minute -- will this hurt me? Will it result in fewer pages getting indexed? Will it downgrade google's view of the importance of the site? Will it result in less traffic coming from google?

I imagine webmaster tools will spit out some warning...probably the only definitive answer will come from trying it on a site or two...

Comments welcome

Cheers
Mike

#2 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 22 April 2012 - 05:18 PM

I have blocked Googlebot for periods of up to 2 weeks without seeing any adverse effects. Your mileage may vary.

#3 jonbey

jonbey

    Eyes Like Hawk Moderator

  • Moderators
  • 4423 posts

Posted 22 April 2012 - 06:56 PM

I blocked googlebot accidentally once and rankings plummeted pretty fast.

#4 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 22 April 2012 - 10:59 PM

Well there's a contradiction...

I am not talking about blocking it, but restricting the frequency of googlebot...cut it by 75% for example, on a busy site...in theory this would cut my server load enormously and avoid having to add servers

Edited by nuts, 22 April 2012 - 11:08 PM.


#5 iamlost

iamlost

    The Wind Master

  • Site Administrators
  • 4633 posts

Posted 22 April 2012 - 11:09 PM

Two thoughts for your consideration:
1. not all googlebots are really googlebot. It is probably the most popular bot nom deguerre; user-agent strings are easily spoofed. Without auto reverse IP lookup confirmation and associated blocking googlebot authenticity can be difficult to ascertain. Note: Google bots including googlebot have been known to operate out of 'stealth', i.e. not readily connected to Google, IPs. 'Tis a fun game we play. :)

2. I've heard both good and bad googlebot blocking stories. I have long blocked based on how hard or repetitively a whitelisted SE bot, including googlebot, hammers a site, without apparent harm, however, the blockage in such instances is only for an hour at a time... and while blocking such whitelisted bots return a '503 Service Unavailable' header response including 'Retry-After: 3600'. All by the rules :)

#6 nuts

nuts

    Mach 1 Member

  • Members
  • 307 posts

Posted 22 April 2012 - 11:33 PM

Hi iamlost

I am looking at googlebot hits to apache access_log, identified as googlebot by apache hostname lookup -- the bot identifies by IP, apache finds the name. It is a consistent hammering to a large number of domains. If I had only a few domains on a server, it would be no big deal, but the total volume contributes a lot to server load.

I am also researching cacheing a large number of pages, deciding between memcache (well known) and couchdb, which claims to keep a record of the cache in a file so that server restarts are much easier. This new box (new to me) has 48Gb of RAM which should be able to cache millions of pages, so rather than querying mysql or oracle, the hits would be much more innocuous as far as server load is concerned. Also should make for faster page loads. Of course I have quite a bit of architecture/programming to think through...

#7 iamlost

iamlost

    The Wind Master

  • Site Administrators
  • 4633 posts

Posted 23 April 2012 - 09:39 AM

Of course I have quite a bit of architecture/programming to think through...

The learning curve of the webdev goes up and up, up and up :D

Michael Martinez wrote a (still!) good (the lasting power of basics, strategy, and theory :)) primer four (!!!time flies:() years ago: Large Web site design theory and crawl management.

Crawl refers to all aspects of search engine crawling. It includes:

1. Crawl rate (how many pages are fetched in a given timeframe)
2. Crawl frequency (how often a search engine initiates a new crawl)
3. Crawl depth (how many clicks deep a search engine goes from a crawl initiation point)
4. Crawl saturation (how many unique pages are fetched)
5. Crawl priority (which pages are used to initiate crawls)
6. Crawl redundancy (how many crawlers are used to crawl a site)
7. Crawl mapping (creating paths for crawlers)


Mapping the above metrics can be a valuable tool in deciding how to manage SE bots.

#8 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 23 April 2012 - 12:55 PM

You cannot control Googlebot's crawl rate but you SHOULD be able to control where your Web statistics data is stored. It should not be in a Web-accessible folder.

Jon, I'm surprised you lost a lot of rankings by blocking Googlebot. I might test that with SEO Theory.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users