Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Supplementals across datacenters


  • Please log in to reply
11 replies to this topic

#1 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 14 September 2006 - 04:49 AM

I spotted the SEOmoz blog entry about finding supplemental pages from a site and turned it into a tool that works across datacenters:

http://oy-oy.eu/google/supplemental/

It takes your domain, passes it as a "site:domain.com ***" query to Google and extracts the total "approximate" count as well the effective real count (if less than 1000, it goes to the last page of the results and checks).

What's also interesting in this regard is taking the same query and running it through the Google API (combined with my PR display): http://oy-oy.eu/goog...earchpr/go.aspx -- some of those pages still show a toolbar PR value. To be honest, I have no idea what that would mean: is the displayed PR just outdated (should it be 0)? or can a page in the supplemental index have a PR number? or is the displayed PR just estimated from other existing pages? (probably) Ah, the joy of tools whos output you can't understand :) :)

John

Edit: looks like I tested too much, Google doesn't want to give the server any numbers :ph34r: - you might want to wait an hour or so to play with it...

Edited by softplus, 14 September 2006 - 05:52 AM.


#2 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 14 September 2006 - 09:05 AM

Interesting tool John.

What is the flux-factor?

Yannis

PS Your server is overworked! :)

#3 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 14 September 2006 - 09:11 AM

Nice idea.


From what I've seen and understand about supplemental result, they are in a separate database than regular results, and utilize a different search engine - and those results are merged with the results from the main database when served to a searcher.

From the Anna Patterson patent application on multiple databases, if it is what is being used, a lot less information about supplementals is gathered and maintained than results from pages within the main index. I could see Google not maintaining pagerank information for those pages, though they might.

#4 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 14 September 2006 - 09:29 AM

What, overworked? or just underpowered, underbandwidthed :). It's good to run things like this on a local server, then you can play with lots of variables, which however does sometimes leave the rest of the system to beg for resources :D. And one day I'll get a really-good web connection for the company ....

The main problem with a tool like this is that Google turns on a block when it sees lots of queries from the same server -- the tool more or less goes against the Google terms of service, which is not really such a good idea, but is the only way to get this data. Personally, I feel that I'm not doing automated queries since I do it in real time with the user triggering them - but that argument is probably moot since Google only sees a flood of queries from the server :D. I could trick them a bit more (perhaps spread the requests over several IPs or go through proxy servers) and possibly get a few more queries out of them before they "recognize" it again, but then again if they need / want to block mass queries then I'll just take what I can get until then. Perhaps I should mention that on the pages in question, so that people understand what it means when it returns "?" :).

What surprised me was that the fluctuations in supplemental URLs across the datacenters was much higher than the fluctuations in indexed URLs. To me that sounds like a sign that they are playing with the supplemental index, with perhaps vastly different settings. Also, the sometimes large difference in the actual count and the "about" count (my "bad data push correction" :D) seems much higher with supplementals than with indexed pages (or links) - that would make sense however, since the supplemental index is probably not something they would give a higher priority to generate better approximations for.

The "Flux-Factor" is a rough and dirty value I calculate to determine how large the spread in numbers is. A high "flux-factor" could signify that things are in movement and that no stable equilibrium has been reached. It could also signify that Google is using / testing different settings (as I think is the case with the supplementals).

A fun toy with interesting numbers, you just have to figure out what they mean :D :D

John

#5 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 14 September 2006 - 09:58 AM

Just looked it up - the flux-factor is the percentage of displayed datacenters that returns values outside of 15% from the average. Example: if the average is "100", +/- 15% would be 85 - 115. The flux factor would be the percentage of datacenters queried that returns either below 85 or above 115. I played around with other values such as variance, etc. but this one seemed to return the "best" results based on the number I have seen so far (+ it's easy to calculate in javascript on the client-side).

John

#6 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 15 September 2006 - 08:52 AM

Impressing softplus. I'm going to mention your tool at seomoz to make your server very overworked. :-)

#7 scullywp

scullywp

    New To Community

  • Members
  • 1 posts

Posted 17 September 2006 - 12:56 PM

What is happening?

When I do a Google search (site:www.seroundtable.com/ ***) I get 2 results. :)

When I search using the tool (http://oy-oy.eu/google/supplemental/) for www.seroundtable.com/ I get results like 291 (about 724).

Is there any way to see a list of the supplemental pages in Google or another tool?

- Bill

#8 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 17 September 2006 - 02:32 PM

Hi Bill
If you click on the datacenter name/ip it'll open the query it used in a new window/tab. That way you see "exactly" (with exceptions, as always :)) what the tool sees to extract that number.

John

#9 SEOMonkey

SEOMonkey

    Ready To Fly Member

  • Members
  • 13 posts

Posted 18 September 2006 - 04:17 PM

Nice Tool!

You are right about the variation from one DC to the next... There is an incredible difference on some of them for the sites I checked.

#10 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 26 September 2006 - 03:37 AM

I just updated it to get around the current Google fix for "***" :D

John

#11 shor

shor

    Unlurked Energy

  • Members
  • 5 posts

Posted 26 September 2006 - 08:57 PM

Interesting softplus. Was just going to whine about the *** supplemental search not 'working' anymore when I came across your tool. Did Google decide to remove this 'search operator'?

#12 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 27 September 2006 - 01:17 AM

Shor - you can check the query that I'm using by clicking on the datacenter names, it's just a variation of the same theme :)

John



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users