Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Google analyzes a billion web-pages

  • Please log in to reply
3 replies to this topic

#1 JohnMu


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 25 January 2006 - 07:42 PM

Wow, I really like how Google is putting some of the statistics online - even without numbers it's interesting enough. Putting a billion documents through a statistics system is something I would love to do, but sadly my downstream won't really let me do that in a reasonable time-frame (like 100 years).

Google has it here: http://code.google.c...tats/index.html

Various people have, over the last few years, done studies into the popularity of authoring techniques. For example, looking at what HTML ids and classes are most common, and at how many sites validate (and yes, we know that we're not leading the way in terms of validation).

John Allsopp's study is the most recent one we're aware of, where he looked at class and id attribute values on 1315 sites. Before that, Marko Karppinen did a study in 2002, looking at which of the then 141 W3C members had sites that validated; in 2003 Evan Goer did a study into 119 Alpha Geeks' use of XHTML; and of course in 2004 François Briatte did a study covering trends of Web site design on 10 high-profile blogs. In addition, in the last year, microformats.org contributors have done a lot of research into the use of class and rel attributes, amongst other things, in their pursuit of bite-sized reusable semantics. We are also aware of some studies being done by for the Mozilla project, covering thousands of pages.

We can now add to this data. In December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata. The results we found are available below. We hope this is of use!

(requires a SVG compatible browser, like Firefox 1.5)

The really interesting stuff needs to be read between the lines. Why is Google doing this? Could it be the start of real block level content analysis? The data they show is interesting, I can only guess at what they really wanted out of it and what they did get out of it :).

If anyone has more detailed information (or a publication?) with information about these statistics, I would be really glad to get a link or two.


Edited by softplus, 25 January 2006 - 07:43 PM.

#2 kensplace


    Time Traveler Member

  • 1000 Post Club
  • 1497 posts

Posted 25 January 2006 - 08:07 PM

Interesting, and nice to see some examples from google stating what they regard as "hostile" (pop unders) and also what they state is a waste of time (like keywords meta tags etc).

#3 Adrian


    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 5779 posts

Posted 26 January 2006 - 04:26 PM

It just proves how easy it is for them to spot certain types of dodgy coding and how easy it is to ignore things like keyword stuffed comment tags....

Just trying to look at the data now, should be interesting :)

#4 bwelford


    Peacekeeper Administrator

  • Admin - Top Level
  • 8995 posts

Posted 26 January 2006 - 05:03 PM

Some of the stats are very interesting, particularly on the proportion of web pages that use non-standard code. The very last section on Custom codes is particularly intriguing.

RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users