![]() ![]() |
Star Member![]() Group: Members
Joined: 21-October 05
Posts: 856
From: Cheshire, England
|
Sep 23 2006, 05:28 AM |
|
|
Ammon, is that a page that was NOT spidered EVER, or just within a recent timeframe? |
||
| Offline | ![]() |
Hall of Famer![]() ![]() Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,482
From: CHeeseland
|
Sep 23 2006, 01:39 PM |
|
|
Could the supplemental index be seen as a parallel document index (with hit-lists, forward and inverted indicies) (using "backrub" as a structure)? Assuming documents with slightly insufficient value (for the current crawl/index-threshold) are moved into the supplemental index first, Google could save a lot of bandwidth by just restoring them once the URLs have more value (versus throwing them out and having to re-crawl everything to find those URLs).
This is particularly important when links are re-evaluated or thresholds are adjusted: it's not just the links to that site (the one "going supplemental") but also to a very large extent the links to the links (to the links, etc) that matter. By de-valuating a few major value-passing links, many full sites downstream from there could be effected. Those sites (not just single pages!) will be in a position where they used to be fully indexed (perhaps with 1000's of URLs), but now do not have enough value to merit being indexed anymore. Google could assume that those sites will "fight" for their value and work on getting more good links in the future. By keeping the old (insufficiently valued) URLs warm in a supplemental index, Google can restore the sites complete indexing as soon as enough valuable links are found. It saves a lot of bandwidth and time. The same could be applied if a site loses value through penalties (or of course if one of the major upstream sites loses value through penalties). I wonder how many sites (and links of links of those sites) indexing is dependant on the wikipedia or DMOZ (+clones). Imagine if Google were to devaluate those two sources (and their clones)... didn't they just do exactly that a while back? Just a guess John |
||
| Offline | ![]() |
Centenarian PosterGroup: Members
Joined: 5-March 06
Posts: 110
|
Sep 25 2006, 01:23 PM |
|
|
QUOTE A supplemental page is a page that was NOT spidered, but is known about from links on pages that were. Ammon, I completely disagree. First, you're describing what's known as "url only" pages blocked by robots.txt. Google doesn't spider the content of those pages but it lists the urls in the index because, according to Matt Cutts, people link to them. Second, a page listed in the main index can also show up as supplemental depending on the query. It's quite likely that every site has an older copy of all its indexed pages in the supplemental index (I'll elaborate if you want). QUOTE Some site owners over at WebmasterWorld have been discussing an issue where on Bigdaddy data centers, the site wouldn’t be crawled as much in the main index. That would result in Google showing more pages from the supplemental results for that site. http://www.mattcutts.com/blog/gone-supplemental/ (Though he writes "the site wouldn't be crawled as much in the main index" - which doesn't make sense - I believe he means "the site wouldn't be listed as much in the main index"). Third, there are plenty of supplemental listings where the site's navigation link text, meta description, etc are displayed in the SERPs. Fourth, when two urls points to the same content, often one will "go supplemental." For example, www.domain.com and domain.com pointing to the same /index.html or www.domain.com/index.htm and www.domain.com/, etc. There is no way in hell Google can judge the two urls point to the same content unless it takes a look at the contents of both those pages. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
|
Sep 26 2006, 10:00 AM |
|
|
Regarding the quote from Matt Cutts you were having trouble with, Halfdeck, look at the wider context:
QUOTE on Bigdaddy data centers, the site wouldn’t be crawled as much in the main index. That would result in Google showing more pages from the supplemental results for that site. This was during the testing of BigDaddy, when that index was on test servers, and was not the main index. A lot of sites were not being crawled as much in the test of BigDaddy as they would (and were) on the main servers. As noted, a decrease in crawling results in supplementals showing for uncrawled pages. Matt refers to adjusting the variable for crawl priority to increase this somewhat across the majority of sites (it seems the prioritisation algorithm was just a little too harsh): QUOTE several people responded with enough details that we identified and changed a threshold in Bigdaddy to crawl more pages from those sites. This was also the first occasion I recall that Matt alluded to some pages which are crawled, still not actually getting indexed: QUOTE we can still fetch those pages, it’s just that they don’t always make it live. I believe someone checked in another threshold yesterday based on the meeting that I had some with crawl/index folks. So there, a different value from crawl priority determines whether a crawled page will actually go into the live index. Matt was a lot clearer on this in some later blog posts: Matt's post on Google's indexing timeline is far more specific. One particular gem of insight was in the comments where Matt said: QUOTE it’s by design in Bigdaddy that we crawl somewhat more than we index in Bigdaddy. If you index everything that you crawl, you never know what you might be missing by crawling a little more, for example. I see at least one indexed post from your forum, so the fact that we’ve been visiting those pages is a good indicator that we’re aware of those pages, and they may be incorporated in the index in the future John, I would certainly say that having pages suddenly go supplemental is a warning sign to SEOs. It tells you that your site has a low crawling or indexing priority that has fallen below a working threshhold level. It could be that inbound links you'd built have been discounted due to discovery of its connection to a bad neighborhood. It could be that some of those inbound links have simply dropped in visibility (as with links in blog posts once the post drops from the front page, and exists only in archives). It could be that some of your own linking patterns have caused your internal site links to lose value. It is a red flag - it may not give you any real specific information, but it certainly is there to warn you of potential danger. |
||
| Offline | ![]() |
Centenarian PosterGroup: Members
Joined: 5-March 06
Posts: 110
|
Sep 26 2006, 12:16 PM |
|
|
BK, I agree with your last two posts.
QUOTE Now that I haven't seen yet. Was the same datacentre involved in both queries? Most likely. I think I ran my queries on gfe-eh.google.com. I'm dying to post an example but I can't recall exactly what query I ran. I first heard about it from gs1md on WMW, and frankly, I found it hard to believe, but seeing my root domain url displayed as supplemental for odd queries which contain text that does NOT appear in the cache is good enough proof for me to believe that 1) supplemental pages are spidered and 2) even a page listed in the main index has a copy of it in the supplemental database: http://www.webmasterworld.com/google/3060898-3-10.htm Speculation: I believe there are several types of supplementals, but let me talk about one scenario. Google keeps an older copy of indexed pages in the supplemental pages. These pages are usually masked by pages in the main index so as far as we know, they don't exist. When pages from the main index are dropped, supplemental pages become visible - but they were ALWAYS there. Your page didn't suddenly "go supplemental." Webmasters at WMW racked their brains trying to figure out why their sites suddenly "went supplemental" during BD release. I interpret Matt's answer as: "they didn't go supplemental. Pages in the main index were just dropped (due to new trustbox or whatever you want to call it, affecting crawl priority, pickier indexing, devaluation of paid/traded links etc)" .. "That would result in Google showing more pages from the supplemental results for that site." Example (ignore the unrealistic timeline): 1. Aug 1. 2005 Google crawls www.domain.com/index.html and stores it in the main index and a copy of it in the supplemental database. 2. Aug 2, 2005. Google crawls domain.com/ and does the same thing. So now we have 4 DB records, 2 in the main index DB and 2 in the supplemental. At this point, a site: search shows: www.domain.com/index.html Cached (aug 1, 2005) Similar pages domain.com/ Cached (Aug 2, 2005) Similar Pages 3. Oct 1, 2005. Google recrawls both pages and updates the main index cache (cache dates updated to Oct 1, 2005 for both urls). Records/timestamps in the supplemental database are left untouched. 4. Nov 1, 2005. Google runs a duplicate content check on the domain and finds the content on the two urls are identical. It decides to drop the url www.domain.com/index.html from the main index, which reveals a supplemental url cached on Aug 1, 2005. So now, a site: search returns: www.domain.com/index.html Supplemental Result Cached (Aug 1, 2005) Similar pages domain.com/ Cached (Oct 1, 2005) Similar Pages |
||
| Offline | ![]() |
Centenarian PosterGroup: Members
Joined: 5-March 06
Posts: 110
|
Oct 12 2006, 04:51 AM |
|
|
QUOTE PageRank is the primary factor determining whether a url is in the main web index vs. the supplemental results... - Matt Cutts http://www.mattcutts.com/blog/fall-weather.../#comment-87795 I'm actually a little surprised by what he wrote. |
||
| Offline | ![]() |
Previous Moderator/Hall of Fame![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Oct 12 2006, 07:17 PM |
|
|
QUOTE What are all the possible reasons for a page to have ‘Supplemental Result’ next to the url in the Google search results? So that they can index more pages by capturing less information for lower ranking pages. Nice description of a multiple staged index in this document: Multiple index based information retrieval system |
||
| Offline | ![]() |
Centenarian PosterGroup: Members
Joined: 5-March 06
Posts: 110
|
Oct 12 2006, 09:08 PM |
|
|
QUOTE So that they can index more pages by capturing less information for lower ranking pages. Nicely put Bill. QUOTE Another problem with conventional information retrieval systems is that they can only index a relatively small portion of the documents available on the Internet. It is currently estimated that there are over 200 billion pages on the Internet today. However, even the best search engines index only 6 to 8 billion pages, thereby missing the majority of available pages. There are several reasons for the limited indexing capability of existing systems. Most significantly, typical systems rely on a variation of an inverted index that maintains for every term (as discussed above) a list of every page on which the term occurs, along with position information identifying the exact position of each occurrence of the term on the page. The combination of indexing individual terms and indexing positional information requires a very large storage system. And Google keep insisting they're nowhere near out of storage space. What is clear is Google will not index everything under the sun. QUOTE The indexing system 110 is responsible for identifying phrases in documents, and indexing documents according to their phrases, by accessing various websites 190 and other document collections. The front end server 140 receives queries from a user of a client 170, and provides those queries to the search system 120. The search system 120 is responsible for searching for documents relevant to the search query (search results), including identifying any phrases in the search query, and then ranking the documents in the search results using the presence of phrases to influence the ranking order. The search system 120 provides the search results to the presentation system 130. The presentation system 130 is responsible for modifying the search results including removing near duplicate documents, and generating topical descriptions of documents, and providing the modified search results back to the front end server 140, which provides the results to the client 170. Interesting. But that patent describes what might happen when someone runs a query on Google. I'm thinking more of the process involved during crawling. QUOTE Why surprised? Well, I dunno. Maybe I was baiting Seriously though, Matt Cutts mentioning PageRank isn't surprisng to me. I believed that to be the case for a while. In this WMW thread, I wrote: QUOTE I also think once a site goes heavily supplemental, you need to regain some trust with Google to get pages back into the main index (i.e. organic inbounds/PageRank). It would be one way Google guards itself against 100,000,000 page spam sites sitting in the supplemental index, and preventing any periodical on-page tweaks from reinjecting the site into the main index. I also wrote here: QUOTE To be or not to be supplemental.. is a question that hinges on more than one factor alone. I believe PageRank is a factor, but a PR of 8 may not save you if the page is an identical copy of a page with a PR of 10. Google's algo is not a single IF/ELSE statement. It's hard to put my reaction into words, so instead I'll quote gs1md from this thread: QUOTE Heh, Marcia, you’re gonna love this Matt Cutts comment: >> PageRank is the primary factor determining…. Note: The “primary” factor. Jeez. No mention of Duplicate Content, and Redirects and 404 URLs at all. QUOTE Do you think there is only one internal pagerank? No. Imo, they need to maintain *at least* two internal PageRank in case they need to backtrack. Assuming they're constantly improving the way they calculate PageRank (e.g. improving how they identify certain type of links: bought links, reciprocals, footer non-internal links, etc), each test DC with a different PageRank calculating mod I'd assume would have a different set of internal PageRank. QUOTE and that it's calculated in the same way that the exported pagerank is? If I go strictly by what Matt Cutts has said so far (I know, that's assuming a lot), TBPR is exported internal PageRank translated on a 0-10 scale. The only inaccuracy involved with TBPR is PageRank updates continuously while TBPR updates every few months, and internal PageRank is more granular than TBPR. I'd compare it to reading the news every three months instead of every day, and instead of being able to read full articles, you only get to read the headlines. |
||
| Offline | ![]() |
Previous Moderator/Hall of Fame![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Oct 12 2006, 10:21 PM |
|
|
QUOTE(Halfdeck) But that patent describes what might happen when someone runs a query on Google. I'm thinking more of the process involved during crawling. It describes two stages of ranking. The first could look primarily at pagerank, to decide which documents would end up in the primary index, and which would go into a secondary, or supplemental, index. QUOTE(Multiple index based information retrieval system) The scoring algorithm for pre-ranking the documents may be the same underlying relevance scoring algorithm used in the search system 120 to generate a relevance score. In one embodiment, the IR score is based on the page rank algorithm, as described in U.S. Pat. No. 6,285,999. Alternatively or additionally, statistics for a number of IR-relevant attributes of the document, such as the number of inlinks, outlinks, document length, may also be stored, and used alone or in combination in order to rank the documents. A year ago, Anna Patterson (inventor named on that patent application) told us that they expanded the size of their index by three times for their seventh birthday, at the official Google Blog. Google might not be using the process described in this patent application, or the other four that she filed over the past year on a phrase indexing system. But they are worth looking at, and considering carefully Multiple index based information retrieval system (20060106792) Phrase-based searching in an information retrieval system (20060031195) Phrase-based indexing in an information retrieval system (20060020607) Phrase-based generation of document descriptions (20060020571) Phrase identification in an information retrieval system (20060018551) It does seem to provide a framework that addresses some of the behavior that we are seeing. |
||
| Offline | ![]() |
![]()
|
|
2 Pages 1 2 >
|
|
| Lo-Fi Version | Time is now: 2nd September 2010 - 12:02 PM |
| Meet our Moderators: | cre8pc : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |