Reply to this topicStart new topic
> Supplementals across datacenters, simple tool to check a site

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Sep 14 2006, 04:49 AM
I spotted the SEOmoz blog entry about finding supplemental pages from a site and turned it into a tool that works across datacenters:

http://oy-oy.eu/google/supplemental/

It takes your domain, passes it as a "site:domain.com ***" query to Google and extracts the total "approximate" count as well the effective real count (if less than 1000, it goes to the last page of the results and checks).

What's also interesting in this regard is taking the same query and running it through the Google API (combined with my PR display): http://oy-oy.eu/google/searchpr/go.aspx -- some of those pages still show a toolbar PR value. To be honest, I have no idea what that would mean: is the displayed PR just outdated (should it be 0)? or can a page in the supplemental index have a PR number? or is the displayed PR just estimated from other existing pages? (probably) Ah, the joy of tools whos output you can't understand biggrin.gif biggrin.gif

John

Edit: looks like I tested too much, Google doesn't want to give the server any numbers ph34r.gif - you might want to wait an hour or so to play with it...

This post has been edited by softplus: Sep 14 2006, 05:52 AM
Online Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 22-May 06
Posts: 1,632
post Sep 14 2006, 09:05 AM
Interesting tool John.

What is the flux-factor?

Yannis

PS Your server is overworked! smile.gif
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Sep 14 2006, 09:11 AM
Nice idea.


From what I've seen and understand about supplemental result, they are in a separate database than regular results, and utilize a different search engine - and those results are merged with the results from the main database when served to a searcher.

From the Anna Patterson patent application on multiple databases, if it is what is being used, a lot less information about supplementals is gathered and maintained than results from pages within the main index. I could see Google not maintaining pagerank information for those pages, though they might.
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Sep 14 2006, 09:29 AM
What, overworked? or just underpowered, underbandwidthed biggrin.gif. It's good to run things like this on a local server, then you can play with lots of variables, which however does sometimes leave the rest of the system to beg for resources biggrin.gif. And one day I'll get a really-good web connection for the company ....

The main problem with a tool like this is that Google turns on a block when it sees lots of queries from the same server -- the tool more or less goes against the Google terms of service, which is not really such a good idea, but is the only way to get this data. Personally, I feel that I'm not doing automated queries since I do it in real time with the user triggering them - but that argument is probably moot since Google only sees a flood of queries from the server biggrin.gif. I could trick them a bit more (perhaps spread the requests over several IPs or go through proxy servers) and possibly get a few more queries out of them before they "recognize" it again, but then again if they need / want to block mass queries then I'll just take what I can get until then. Perhaps I should mention that on the pages in question, so that people understand what it means when it returns "?" smile.gif.

What surprised me was that the fluctuations in supplemental URLs across the datacenters was much higher than the fluctuations in indexed URLs. To me that sounds like a sign that they are playing with the supplemental index, with perhaps vastly different settings. Also, the sometimes large difference in the actual count and the "about" count (my "bad data push correction" biggrin.gif) seems much higher with supplementals than with indexed pages (or links) - that would make sense however, since the supplemental index is probably not something they would give a higher priority to generate better approximations for.

The "Flux-Factor" is a rough and dirty value I calculate to determine how large the spread in numbers is. A high "flux-factor" could signify that things are in movement and that no stable equilibrium has been reached. It could also signify that Google is using / testing different settings (as I think is the case with the supplementals).

A fun toy with interesting numbers, you just have to figure out what they mean biggrin.gif biggrin.gif

John
Online Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Sep 14 2006, 09:58 AM
Just looked it up - the flux-factor is the percentage of displayed datacenters that returns values outside of 15% from the average. Example: if the average is "100", +/- 15% would be 85 - 115. The flux factor would be the percentage of datacenters queried that returns either below 85 or above 115. I played around with other values such as variance, etc. but this one seemed to return the "best" results based on the number I have seen so far (+ it's easy to calculate in javascript on the client-side).

John
Online Go to the top of the page

Quarter Grand Poster

Group: Members
Joined: 18-November 05
Posts: 256
post Sep 15 2006, 08:52 AM
Impressing softplus. I'm going to mention your tool at seomoz to make your server very overworked. :-)
Offline Go to the top of the page

Untested

Group: Members
Joined: 7-September 04
Posts: 1
post Sep 17 2006, 12:56 PM
What is happening?

When I do a Google search (site:www.seroundtable.com/ ***) I get 2 results. sad.gif

When I search using the tool (http://oy-oy.eu/google/supplemental/) for www.seroundtable.com/ I get results like 291 (about 724).

Is there any way to see a list of the supplemental pages in Google or another tool?

- Bill
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Sep 17 2006, 02:32 PM
Hi Bill
If you click on the datacenter name/ip it'll open the query it used in a new window/tab. That way you see "exactly" (with exceptions, as always smile.gif) what the tool sees to extract that number.

John
Online Go to the top of the page

Member

Group: Members
Joined: 18-September 06
Posts: 13
From: Ontario, Canada
post Sep 18 2006, 04:17 PM
Nice Tool!

You are right about the variation from one DC to the next... There is an incredible difference on some of them for the sites I checked.
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Sep 26 2006, 03:37 AM
I just updated it to get around the current Google fix for "***" wink-2.gif

John
Online Go to the top of the page

Untested

Group: Members
Joined: 27-April 06
Posts: 5
post Sep 26 2006, 08:57 PM
Interesting softplus. Was just going to whine about the *** supplemental search not 'working' anymore when I came across your tool. Did Google decide to remove this 'search operator'?
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Sep 27 2006, 01:17 AM
Shor - you can check the query that I'm using by clicking on the datacenter names, it's just a variation of the same theme smile.gif

John
Online Go to the top of the page
Fast ReplyReply to this topic Start new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 05:41 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed