Of Sandboxes and Toolbars: Google's New Patent Application
Posted 31 March 2005 - 10:00 PM
Information retrieval based on historical data
I've seen a couple of forum threads and blog posts about it (including this one started by msgraph, who seems to have been the first to spot the application - nice going), and thought that it would be a good idea to bread down the patent step-by-step and see what lurks underneath all of the legal language.
In a few places, it's been called an explanation for Google's Sandbox - a place where new sites go instead of gaining page rank, and being able to rank well in Google's results. Mentions of the use of Google's toolbar and the gathering of information about a site also factor into some of the discussions I've seen.
With all of that press, it pays to take a closer look. I'm not sure that I have the time to go through the whole thing all in one sitting, but please feel free to jump in and help me dissect this patent.
Posted 31 March 2005 - 10:34 PM
I had done some searching of the patent and patent application databases not too long ago using the name of one of the Authors of this patent. But I didn't come across this one.
So, when was it filed and made part of the application database? We see two dates towards the top of the patent application. One is March 31, 2005. There's a "filed" date of December 31, 2003.
I'm guessing that the patent was placed out for public view today, rather than on that file date. Why is the date important? We find an answer to that on the pages of the US Patent and Trademark Office, in their section on General Information Concerning Patents
It is important when a patent application is filed. From that Patent Office page we see:
If the invention has been described in a printed publication anywhere, or has been in public use or on sale in this country more than one year before the date on which an application for patent is filed in this country, a patent cannot be obtained. In this connection it is immaterial when the invention was made, or whether the printed publication or public use was by the inventor himself/herself or by someone else.
So, if a patent is in public use more than a year before it has been applied for, a patent upon it "cannot be obtained" even if it is the inventor of the subject matter covered by the patent.
So, why the two dates? Well, a patent applicant can request that a patent not be immediately published, as described above.
From the same page:
On filing of a plant or utility application on or after November 29, 2000, an applicant may request that the application not be published, but only if the invention has not been and will not be the subject of an application filed in a foreign country that requires publication 18 months after filing (or earlier claimed priority date) or under the Patent Cooperation Treaty. Publication occurs after the expiration of an 18-month period following the earliest effective filing date or priority date claimed by an application. Following publication, the application for patent is no longer held in confidence by the Office and any member of the public may request access to the entire file history of the application.
So, while a patent application can be filed, it may not need to be published immediately. In this instance, we have a period of fifteen months from the date of filing to the time of publication.
So, it isn't a new application. Just one that has been kept quiet for a while. For the application to become an actual patent, the invention it describes shouldn't have been in use more than a year before it was filed, even by its inventor. So, that date would seem to be December 31, 2002.
Posted 31 March 2005 - 10:54 PM
And it covers much more than the sandbox. Here's the "age of domain" section:
 According to an implementation consistent with the principles of the invention, information relating to a domain associated with a document may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor information relating to how a document is hosted within a computer network (e.g., the Internet, an intranet or other network or database of documents) and use this information to score the document.
 Individuals who attempt to deceive (spam) search engines often use throwaway or "doorway" domains and attempt to obtain as much traffic as possible before being caught. Information regarding the legitimacy of the domains may be used by search engine 125 when scoring the documents associated with these domains.
 Certain signals may be used to distinguish between illegitimate and legitimate domains. For example, domains can be renewed up to a period of 10 years. Valuable (legitimate) domains are often paid for several years in advance, while doorway (illegitimate) domains rarely are used for more than a year. Therefore, the date when a domain expires in the future can be used as a factor in predicting the legitimacy of a domain and, thus, the documents associated therewith.
 Also, or alternatively, the domain name server (DNS) record for a domain may be monitored to predict whether a domain is legitimate. The DNS record contains details of who registered the domain, administrative and technical addresses, and the addresses of name servers (i.e., servers that resolve the domain name into an IP address). By analyzing this data over time for a domain, illegitimate domains may be identified. For instance, search engine 125 may monitor whether physically correct address information exists over a period of time, whether contact information for the domain changes relatively often, whether there is a relatively high number of changes between different name servers and hosting companies, etc. In one implementation, a list of known-bad contact information, name servers, and/or IP addresses may be identified, stored, and used in predicting the legitimacy of a domain and, thus, the documents associated therewith.
 Also, or alternatively, the age, or other information, regarding a name server associated with a domain may be used to predict the legitimacy of the domain. A "good" name server may have a mix of different domains from different registrars and have a history of hosting those domains, while a "bad" name server might host mainly pornography or doorway domains, domains with commercial words (a common indicator of spam), or primarily bulk domains from a single registrar, or might be brand new. The newness of a name server might not automatically be a negative factor in determining the legitimacy of the associated domain, but in combination with other factors, such as ones described herein, it could be.
Some SEO groups I know have been conjecturing for years about how they'd filter if they ran a search engine. Some have developed "spammer profiles" that are extraordinarily close to the items in the patent.
Posted 31 March 2005 - 11:04 PM
Ideally, a search engine, in response to a given user's search query, will provide the user with the most relevant results. One category of search engines identifies relevant documents based on a comparison of the search query terms to the words contained in the documents. Another category of search engines identifies relevant documents using factors other than, or in addition to, the presence of the search query terms in the documents. One such search engine uses information associated with links to or from the documents to determine the relative importance of the documents.
That section also names two problems that can have a negative effect upon the relevance of search engine results. One is "spamming techniques" that artifically "inflate" the rankings of sites. Another are "stale" sites that are ranked higher than fresher sites with more recently updated information and "contain more recent data."
From this introduction, it appears that this patent application is intended to address people spamming search results, and make it easier for newer sites to rank well against older, staler sites.
That sort of seems to go against the concept of a "Sandbox" effect, where newer sites seem to be penalized and unable to rank well in Google. Or does it? We probably need to delve deeper.
Posted 31 March 2005 - 11:52 PM
I also noticed that it's a good fit with one of the google patents on duplicate pages that deals with fingerprints, and looking at snippets of information to determine whether content on multiple pages are duplicates.
Detecting duplicate and near-duplicate files
At one point in that patent on duplicate content, it lists some factors that it might use to determine which duplicate page to use, including age of the document:
In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent*) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent*, etc.) is returned.
* My emphasis.
One thing that bothered me about the use of age to determine which duplicate, and near duplicate pages, to return and which to filter out is that any document on the web can be saved and have a recent time stamp, even if it has been on the web for years.
After this newer patent, we may have a better sense of how Google determines the age of a page. There are other factors listed in the application which describe ways in which Google can do that. (Monika Henzinger is a co-inventor of both patents, which may account for some similarities.)
The sections on determining the age of a page is an important part of this newer patent application. It's probably worth looking at those closely, and trying to translate them from the legal language they are presently couched in.
Posted 01 April 2005 - 12:02 AM
Interesting. So stale will be determined, and dropped links would matter, as would new links. Might make things more timely, without the overhead of PageRank, and with perhaps different areas given different stale values. After all, stale as it relates to the works of van Gogh is different to stale as it realtes to SEO!!
20. The method of claim 19, wherein the scoring the document includes: determining whether stale documents are considered favorable for a search query when the document is determined to be stale, and scoring the document based, at least in part, on whether stale documents are considered favorable for the search query when the document is determined to be stale.
22. The method of claim 1, wherein the one or more types of history data includes information relating to behavior of links over time; and wherein the generating a score includes: determining behavior of links associated with the document, and scoring the document based, at least in part, on the behavior of links associated with the document.
LMAO!! Now not only do links "bleed" PageRank, but they now also "decay". I wonder if they bleed decayed PageRank
54. A method for ranking a linked document, comprising: determining an age of linkage data associated with the linked document; and ranking the linked document based on a decaying function of the age of the linkage data.
All in all, some interesting ideas, and again, very hard to see a way to manipulate!! All I can think of would be rotating links with every Google crawl to keep links "fresh". Anyone else think of any black hat uses for all this ??
Toolbar usage.. Interesting...
 According to an implementation consistent with the principles of the invention, information relating to traffic associated with a document over time may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor the time-varying characteristics of traffic to, or other "use" of, a document by one or more users. A large reduction in traffic may indicate that a document may be stale (e.g., no longer be updated or may be superseded by another document).
Posted 01 April 2005 - 12:26 AM
I''m plodding along, and you're asking some great questions. I hope that I can uncover some answers by being slow and methodical.
Assigning Age Rank
How do we tell the age of a document, and determine whether or not it is stale? What types of things would be used to give a score to a document based upon that age?
1. Information is gathered from a couple of different sources about the age of a document.
2. Information is gathered from a few different sources about the age of links leading to and from that document.
We'll get to those sources further along. But first...
Defining a Document
Before we look too deeply at this patent, and determine whether it has an impact on the ranking of web pages based upon the age of those pages, we have to get something else out to the way.
One of the important aspects of this patent is that a "document" isn't necessarily just a web page. A document could be a web page, or it could be "an e-mail, a web site, a file, a combination of files, one or more files with embedded links to other files, a news group posting, a blog, a web advertisement, etc."
So, the application is looking at more than just web pages. It can look at parts of pages, or even collections of pages.
The patent also notes at this point that it is important to note that documents can have "forward" links leading from them to other documents, and "back" links leading to them.
So, why is it important to define "documents" differently than pages? How can that make a difference?
Posted 01 April 2005 - 12:37 AM
In the following, the term "document(s)" should be broadly interpreted and may include content such as Web pages, text files, multimedia files, object features, link structure, etc..
Is that a difference that matters? I don't know yet.
Posted 01 April 2005 - 01:40 AM
document inception date
This can be determined a number of different ways (maybe based upon what type of document it is, or by what implementation of the application is being used):
[list]When first crawled by the search engine
When first submitted to the search engine
When a link to the document is first discovered
Domain registration date
When first referenced in another document
When a document first reaches a certain number of pages
By the time stamp of the document on the server it is hosted upon.[list]
The application tells us that under a link-based ranking system not using age-based information, a document with less links to and from the document may rank lower than a document with more links to and from it.
But, if the document with less links can be determined to be newer, based upon the document inception date, it might just rank higher than an older document with more links because it has a higher rate of growth. But too many links, coming too quickly to the newer document, may also be a sign that some type of spamming is happening.
So, how is that rate determined, and how much does it influence the overall ranking of a page?
This formula is given as one way of determining that:
where H may refer to the history-adjusted link score, L may refer to the link score given to the document, which can be derived using any known link scoring technique (e.g., the scoring technique described in U.S. Pat. No. 6,285,999) that assigns a score to a document based on links to/from the document, and F may refer to elapsed time measured from the inception date associated with the document (or a window within this period).
The patent further refines this formula by negating some of the difference between the ages of the documents, in a recognition that some "older documents may be more favorable than newer ones" and that some sets of results can be fairly mature. The scores of documents can be influenced (positively or negatively) by the difference between the document's age, and the average age of documents resulting from a query.
So, a fairly new site that appears amongst a set of results that are, on the average fairly old, may find it being negatively influenced by that difference in age.
There are, however, a number of other ways to assign a score based upon age, which can influence the ranking of a site. The patent goes into those in more detail. And I will, too.
Posted 01 April 2005 - 02:27 AM
If it's okay, I'll post the link here (I posted it in the Spider-Food forum):
This is a hypothetical position, not the official Michael Martinez interpretation of Google. But it comes close to stating some of the principles I have been working with for a long time.
The patent doesn't really tell us whether Google is doing this stuff now, but it does make it sound like they have been tinkering under the hood with these ideas -- especially given the search engine's behavior over the past year.
Posted 01 April 2005 - 07:59 AM
I'm a little torn between moving forward with a slow, plodding look at the patent, or rereading your posts first, and then going forward. I might try to keep at the patent before jumping off to seeing how your interpretation there matches up.
But I like the idea of seeing how what you wrote matches up with what Google has now released that seems to indicate that they are using age of documents as a consideration in rnaking pages. If anyone else wants to bring this discussion that way, I'd say go for it.
Posted 01 April 2005 - 11:00 AM
I'm almost tempted to describe the patent as an April Fool's Joke, to be honest. It is so exhaustive I don't see how they could possibly seriously attempt all that stuff. But then, would the patent office really appreciate that? Would it be the first time someone patented nonsense as a joke?
Posted 01 April 2005 - 02:08 PM
Posted 01 April 2005 - 04:53 PM
I like that new avatar. Is it from down under?
Posted 01 April 2005 - 09:10 PM
You have a web site. It ranks well in Google, and has for years, and you are afraid of changing anything. But, you think that if you make some changes, you might get more conversions on your page.
If you update the page, will this historical data measuring make a difference?
You have a blog that you update almost everyday. You go on vacation for two weeks, and then have a family emergency that keeps you from your web site for another two weeks. Has the failure to update your site in a month influenced how your page ranks in Google?
The application recognizes that pages change. Some of them change more rapidly than others. How does that fit into Google's ranking of pages?
We are given another mathematical formula in this section.
An "Update score" (U) is calculated using frequency of change, and amount of change.
An "update frequency score" (UF) may be used to calculate how often a document (or page) changes over time. It could be determined by the average time between updates or the amount of updates over a period of time.
An "Update amount score" (UA) represents how much a document (or page) has changed over time. The update amount score looks at a number of possible changes, and gives different weights to different kinds of changes.
Kinds of UA updates:
[list]The number of "new" or unique pages associated with a document over a period of time.
[*]The ratio of the number of new or unique pages associated with a document over a period of time versus the total number of pages associated with that document.
[*]The amount that the document is updated over one or more periods of time (e.g., n % of a document's visible content may change over a period t (e.g., last months), which might be an average value.
[*]The amount that the document (or page) has changed in one or more periods of time (e.g., within the last x days). [list]
Weights of UA updates:
Unimportant if updated/changed:
[*]boilerplate material, or;
[*]date/time tags. [list]These could be given little weight or even ignored altogether when determining UA.
[b]Important if updated/changed (e.g., more often, more recently, more extensively, etc.):
[*]anchor text associated with the forward links.[list]These could have a much bigger impact when determining UA.
Posted 02 April 2005 - 01:20 AM
I think the patent implies they will look at whether ads change, too.
Posted 02 April 2005 - 01:29 AM
Rather, the application is drawing a line in the sand here. It's looking at how a web page can change over time, and deciding that some aspects of change on a page, or part of a page, or web site as a whole may be considered less important than others.
So, if someone is showing ads on their site, or using java script to display an RSS feed, or so on, and these things change on a regular basis, they are much less important than a page title change, or a change in the anchor text of a link leading from the page.
Posted 02 April 2005 - 02:02 AM
Your dedication is impressive. Thank you for sifting through this data and writing out your thoughts.
Posted 02 April 2005 - 02:19 AM
I was going to pick up where I left off, but I noticed in another thread on another forum a valid complaint against some interpretations of this patent application.
The complaint was that the application doesn't have Google's name on it. And because of that, it's misleading to attribute many of the things listed in the patent application to Google, and to what Google is doing on the web presently.
The complaint was mainly targeted at an indepth analysis of the patent application by Randfish, which I'm trying not to read while we pursue this much more rambling and discursive analysis. But, if you want to jump over to what he has done (yes, I've peeked), it's at:
And while this patent hasn't been legally assigned to Google, and we are making an assumption if we believe that it might be, I thought it might be a good idea to find out a little more about some of the authors of the patent application.
At this point in time and at the time of filing of the patent application, they do all appear to be Google employees. I don't know how I would feel as an employer if some of my best and brightest employees got together, and invented what appears to be a complex and fairly comprehensive application on the use of aging information to influence the rankings of documents in search engine results.
But, it doesn't have the search company's name upon it.
So, keep in mind that all of this speculation about how this aging information fits in may not describe how Google presently works. Instead, the application covers the workings of "Search Engine 125."
Here are the inventors (I'm not sure of all of their official titles at Google, but I've included what I was able to find):
Principal Engineer, Google
One of the major forces behind Google Scholar.
Anurag Acharya Helped Google’s Scholarly Leap
Software Engineer, Google
A popular speaker at Search Engine Strategies and other conferences, including one for Consumer Webwatch, where he made this statement:
Because for some searches, there's some small or — not the majority, but some percentage of searches which are commercial. Say, 30, 40 percent, where somebody's looking to buy a product. And in those sorts of searches, the advertisements can be just as useful as the search results. In fact, we often order the search results by the click-through percentage.
He also gives a number of interviews on behalf of Google, like these:
Inside the Google search machine
Interview with Matt Cutts, Google
Distinguished engineer in the Google Systems Research Lab
If you look through a number of Google patents and patent applications, you may notice his name appears more than once as a co-inventor. Here's a paper on the physical side of Google's operation, that he co-authored:
Web Search For A Planet: The Google Cluster Architecture
Senior software engineer, search quality group. (Google)
Director of Research at Google
Another name that springs up in Google patents and patent applications on a regular basis. And a few for some other search companies before Google had the good fortune to employ her.
Still Searching (After All These Years)
Google's Research Maven
Peeking Into Google
Urs Hoelzle, Vice President of Operations, Google
Google's Management Team
Google Senior Research Scientist
Steve Lawrence: List of publications from the DBLP Bibliography Server
The main developer of CiteSeer, presumably involved with Google Scholar, and the lead developer of Google's desktop search.
Google unveils desktop search
Karl Pfleger's Other Interests
Google Software Engineer
Worked on developing the cluster management system used by MapReduce.
MapReduce: Simplifed Data Processing on Large Clusters
After an internship at Google, Inc., he now works full time for the California company as a software engineer.
Engineering News: School of Engineering and Applied Science at Washington University in St. Louis (Summer 2002)
Research Scientist, Google Inc.
One of the folks in the Google Labs.
alicebot-general - Fw: Fw: another way cool thing at Google
ps. On preview, you're welcome Elizabeth. I'm trying to explain as much of this in as plain language as I can. I think I'm getting better at it as I go. Hopefully, I'll get even better.
pps. Nice Avatar. It's good to be able to see the faces behind posts.
Posted 02 April 2005 - 09:54 PM
I probably should have included one more paragraph in the previous post dealing with changes. The next section of the patent talks about comparing the rate and amount of changes over more than one period of time.
The way that paragraph is worded, it appears that the patent application may often favor sites that are updated frequently, and that show an increase in a rate, and amount of change.
All of this extra monitoring creates some new challenges, like where to put all the extra data that comes with it.
The Problem of Storing Historic Data to Compare for Changes
It struck me while reading all of this information about tracking changes in documents that it would take a considerable amount of storage to house older copies of documents and allow a comparison.
While Google notes presently claims "8,058,044,651 web pages" on the front of the site, the Internet Archive had ten billion pages indexed in 2001, and an estimate from October of 2004 indicated 30 billion pages. The Internet Archive has copies of sites as they have changed.
So, if a search engine is tracking changes to documents over time, it needs to store information about those changes. But storing exact copies of documents could cause its database to balloon in size quickly.
And some type of strategy may be needed to keep the amount of information fairly small.
Fingerprints and Other Approaches
There was a mention of "fingerprints" in this thread from rcjordan, and the duplicate content patent I mentioned in a prior post also talks about a fingerprint strategy. That's one possible way for a Search Engine to track changes without having to keep exact copies of earlier documents for comparison.
When fingerprints are compared, the method used to only look at a few places on the prints, and see if there is a match in those places. There isn't an attempt to overlay one print with another and try to achieve an exact match.
Rather, information is stored about the characteristics of a print at different places on a finger, and that information is stored in a data base. When looking for a match, the information about those points are compared to each others. So, matching fingerprints doesn't call for exact matches of prints, but rather matches at a number of predetermined points on prints.
The patent application addresses this need for storage capacity, when "monitoring the documents for content changes," and provides a number of different ways for tracking changes while not storing full copies of documents
Representations of the documents, like the fingerprint information, can be stored and monitored for changes. Here are the strategies that the patent application mentions:
[list]"Signatures" of documents may be stored instead of the entire documents.
[*]Term vector's may be stored and monitored for relatively large changes.
[*]A relatively small portion (e.g., a few terms) of the documents that are determined to be important or the most frequently occurring (excluding "stop words") may be stored and monitored.
[*]a summary or other representation of a document may be maintained.
[*]A similarity hash (which may be used to detect near-duplication of a document) for the document and monitor it for changes. A change in a similarity hash may be considered to indicate a relatively large change in its associated document. (See the patent on duplicate content - in addition to looking for duplicates on documents in other places in the web, it can be used to compare newer and older copies of the same document.)
[*]Other possible options may be considered and used, too.[list]Of course, there is the possibility that there is enough storage, and full copies of documents can be maintained and monitored (See Google's cache, for instance.)
(I've kept the language of some of those list items exactly the same as they appear on the application because it's either not really clear what those words mean yet, or they are already fairly clear.)
Some Change isn't Good
While it has been implied earlier, the next section (0055) shows a recognition that for some types of queries, changes may not be a good thing. For those changes, the update score could be adjusted based upon the the difference for an "average date-of-change" of the results from that query.
Here's how that might work when results from a search are returned by the search engine:
[list]Each document has a last change date,
[*]An average is calculated for the documents in the results,
[*]The document's ranking scores are modified positively or negatively based on the difference between the last change date and the average date-of-change for all of those documents.[list]
Summary of the Section
Some type of age rank is created for a document based upon information collected about how old that document is and how it has changed over time.
This summary of this secton of the patent adds a detail about "large" documents, belonging to more than one person or organization. For those, the score may be broken down into scores for smaller sections, where content, and changes to that content are under the control of one person or organization.
For instance, each of the blogs hosted on Blogspot might be treated differently under this age scoring system.
Posted 02 April 2005 - 10:28 PM
The bit in bold is interesting, and really something that bears remembering. The age or otherwise of a document may go either way.
 Search engine 125 may use the inception date of a document for scoring of the document. For example, it may be assumed that a document with a fairly recent inception date will not have a significant number of links from other documents (i.e., back links). For existing link-based scoring techniques that score based on the number of links to/from a document, this recent document may be scored lower than an older document that has a larger number of links (e.g., back links). When the inception date of the documents are considered, however, the scores of the documents may be modified (either positively or negatively) based on the documents' inception dates.
(My Emphasis Added)
Just to clarrify: has this patent actually been gratnted, or is it just an application???
Posted 02 April 2005 - 11:13 PM
It's just an application.
All we should really be doing is analyzing the patent application without drawing conclusions. It's possible that for some queries older is better, and for others fresher is better.
I would guess, for instance, that a search for a historical document, such as "The Declaration of Independence" might favor an older document that hasn't changed much over time. But a search for "physics research" might favor fresher material. But that's just a guess.
While this application is broad in its coverage, it isn't very deep. It describes a wide range of possible ways to rank documents, but it doesn't often explain the "why" behind the "how."
And, we can't be certain that Google has adopted any or all of the various permutations described. It is just an application, but parts of it could be in use by Google. It's been unpublished until a few days ago, but parts of it could have been in use as early as a little less than a year before the file date. (any earlier and the application could possibly be denied patent status.)
While the application uses a hypothetical "Search Engine 125" in its descriptions instead of Google, the nature of patents and the fact that all of the Google employees who gathered to write it are still employed by Google suggests that is is something that Google will be assigned at some point in time.
But even that doesn't mean that Google will use any of it.
Patents are exclusionary, sort of like robots.txt files are exclusionary. Those robots.txt files don't tell search spiders which pages to index, but rather which pages to not index. Patents don't grant the inventors or assignees the right to use the method described. They allow those inventors or assignees the right to stop others from using the invention described.
Posted 03 April 2005 - 01:29 AM
http://assignments.u...e=GOOGLE&page=1, 20050071741 is the third column number, and http://assignments.u...pub=20050071741 shows:
Patent #: NONE Issue Dt: (this is bank)
Posted 04 April 2005 - 02:13 PM
Bill - Bravo! Excellent stuff as I read through your thread here. Thanks.
Posted 04 April 2005 - 08:56 PM
We're getting there. Slowly, but the patent application isn't going anywhere. I was impressed that you were able to condense the patent down to something quite readable, and understandable on your pages.
Posted 04 April 2005 - 10:50 PM
Age and popularity seem to matter under this patent application when it comes to how searchers interact with search results.
1. Click Through Rates
The first paragraph in this section tells us that it can make a difference when someone chooses one result within the results of a query over another one. That choice repeated over time by searchers can cause a document to to be ranked higher than other documents retrieved in the same query.
Regardless of this patent application, a good page title, written with the recognition that the title will be seen outside of the context of the page it is upon, can influence someone to click upon it. It may convince someone that it is a result they should visit, possibly even ahead of pages that are listed higher in search results.
Under this application, an interesting and persuasive page title that seems to fulfill the objectives of a searcher, may move a result up in rankings if it attracts click-throughs.
2. Popular Pages are Pages that Show Up in Popular Queries
Documents which show up regularly as a result of queries for "hot" topics or "breaking news events" may be scored higher than pages which don't show up for searches for popular terms (under this patent application).
We know from the Google Zeitgeist that Google can track which queries are the most popular over a period of time. Imagine a search engine which gives documents (individual pages and even sites) a boost in rankings for appearing in searches for popular terms.
3. Increasing a Document's Popularity in Similar Queries Increases that Document's Rankings
This section seems to indicate that not only will the search engine in this patent application watch which documents show up for popular queries, but will also check to see if those queries that the document shows up for are related to other popular queries it shows up in. If a document starts showing up more often in those similar and popular queries, it may gain a ranking bonus over other documents for those queries.
For some reason, I find myself thinking of the news stories that appear as the most relevant results in Google News for a query when reading this paragraph. Could Google News be using something like this?
4. Documents Become More Popular by Correctly Answering Queries with Answers that Change Over Time
This paragraph recognizes that the answers to some searches do change over time. The example that they provide is a search for "World series champion."
5. Freshness and Staleness
Some queries may be "stale" and return stale documents as search results. Factors to look for when determining if a document is stale include:
[list]document creation date,
[*]forward/back link growth,
[*]etc.[list]For some queries, there may be a presumption that more recent documents are very important (an example is given of Frequently Asked Questions pages).
But, the patent application's search engine will track whether fresher or staler results are selected as the result of queries. Will users of the search engine choose older, staler documents over new ones as a result of a search? Even if the staler site is initially ranked lower than a fresher one? If so, staler pages may end up ranking higher for those queries.
Staler results that may be showing up in the results for broad search terms rather than more specific ones, may have their scored lowered based upon their staleness. A presumption seems to be that newer, updated pages should show up more frequently for broad searches on a topic.
7. Using Clickthroughs to Readjust Rankings Based upon Staleness
That presumption that a stale document may not be the best result, doesn't always hold true. If people tend to pick a lower ranked older stale site over a newer, higher ranked site, then the staler site may be adjusted positively.
8. Spam and Queries
A site that ranks for a wide breadth of queries which may not be similar may be one spamming the search engine, and may have its rankings decreased.
It seems like a lot of information is being calculated or collected for which pages show up in rankings for which types of queries.
I'm not sure how we are supposed to interpret the phrase "discordant set of queries" in this section except possibly as a document showing up in rankings for searches that the search engine doesn't deem similar.
This section appears to pay a fair amount of attention to what people are searching for, and which documents they select as a result of those searches. It also keeps a close eye on which types of documents are chosen as a result of those queries, and if a document shows up in similar queries it benefits. It the document ranks well in nonsimilar queries, it can lose ranking.
This section seems like it could benefit sites that tend to focus on a narrow set of popular topics, and have well written titles and use keywords in well written sentences that can attract clicks out of context.
Posted 06 April 2005 - 12:41 AM
A new and sandboxed site gets a highly targeted article published on a high-traffic site, with link to new site's home page.
Does the spidering (or other factors?) from this link help the new site out of the sandbox?
If the new site has an easily spidered navigation structure, say some nice sidebars with deep text links, would spiders be more likely to follow deep links with names that are similar to the original article's keywords?
What if the deep links' linked text, anchor title, and file name were written to be search terms that would give highly targeted results on a topic? on any topic, or specifically the original article's keywords.
Would that further encourage spiders?
And would all this encouragement help the new site get itself out of the sandbox?
Posted 06 April 2005 - 05:27 AM
There are also some older new factors floating around like semantical relationships between pages and such things. These take a good amount of time to figure out, as well.
Basically, Google has always placed a high importance on how pages relate to each other. In the old days, PR was the primary factor involved in generating a value for these relationships. Nowadays, there is so much more.
So, you've got a new site with a new page on it. What does Google know about it? Well, it knows what the content is - but it still needs to validate that content and to do its voodoo to find out what that content is actually about. It knows how many links it's got, but it doesn't know anything about the relvance of those links, yet.
Determining the relevance of those links is critical, now, to how Google does everything. Age factors (as explained in this patent) or not, there's a lot here. What's the linking page about? How does it relate to your page that it links to? Under what circumstances are the pages related (i.e. what terms are common to them both)? How authoritive is the linking page - and, how authoritive is the linking page in respects to the specific topic at hand?
The more in agreement all the links are, the easier it'll be to do the math to figure out what the site/pages are about. The more authoritive and focussed the linking pages are, the less links you'll need before a pattern can be determined.
So, Google isn't putting a site in the sandbox for the sake of it - in fact, it doesn't "put" the site/page anywhere - it's just missing data that really helps in ranking. So, the effect is that your page won't rank very well until that data exists.
To answer your question, Elizabeth - yes, a link from an authorative site that is on topic will probably speed up the process of getting you out of the sandbox. The reason is simple. Let's use this analogy...
You're trying to figure out which is the best search engine. You come here (where we have a bunch of people who work with search engines every day) and you ask, "What's the best search engine?" You get three answers. Two people say Google, one person says Yahoo. You ask the same question at a site that has a bunch of people who play Half Life. Six people answer and all but one of them say Google.
Which votes are you going to be most likely to believe? The ones from here? Or the ones from the gaming site? More importantly, on the authoritive site (here) there's also a vote for Yahoo. What is the significance of it? The fact that we are more of an authority on the subject than the gaming site means that that Yahoo vote has at least some importance or significance.
Now, if you never asked on an authoritive site and just focused on getting your best search engine votes from other non-related sites, you'd really have to get a lot of answers from a lot of people in order to be sure what was the best. Their opinions just don't mean as much (though their opinions on the best graphics cards for large 3D environment rendering would blow us out of the water).
Same's true with Google. They need fewer votes from authorities on a subject to figure out what's going on on your page than they would need if you only had a bunch of links from sites not dealing with your topic. They can still do it using non-authoritive links - they just need a lot more and thus, it takes more time.
Would that link get you out of the sandbox right away? Probably not - though maybe.
<shrug> This rambled on more than I meant it to. I was a bit all over the place (I hate my first post of the morning... Anyway, there's bound to be something useful in there...
Posted 07 April 2005 - 12:25 AM
If the new site has an easily spidered navigation structure, say some nice sidebars with deep text links, would spiders be more likely to follow deep links with names that are similar to the original article's keywords?
I did tackle another section tonight on the freshness of links, which I'll post next. The section after that one talks about anchor text, and how it fits in, so I hope that you bear with me.
Would that further encourage spiders?
I've seen a couple of statements within the patent that some documents that are considered stale might have links that are ignored for determining rankings based upon age. But I'm not sure that those links will be ignored when spiders decide which links to follow to find other documents.
The patent application doesn't seem to describe the travels of a spider through the web, but rather the influence upon rankings of documents based upon historical data associated with them, including links to and from those documents, and other changes that can happen to a document.
So, Google isn't putting a site in the sandbox for the sake of it - in fact, it doesn't "put" the site/page anywhere - it's just missing data that really helps in ranking. So, the effect is that your page won't rank very well until that data exists.
That seems to sum everything up nicely about a "sandbox." Excellent post, Grumpus. It fits in well with my next attempt at a piece of the patent. Especially the part about links from authoritative sites.
Posted 07 April 2005 - 12:38 AM
1. Link-based factors - Appearance Date and Disappearance Date
Link based factors may be used to create or change a score for a document. These may be discovered during a crawl or index updating operation, and may relate to the appearance and disappearance dates of links to a document.
Appearance date of a link -
[list]a. the date a search engine first finds the link or
b. the date of the document was found that contains the link or
c. the date that document was last updated. [list] of a link -
[list]The first date that the document containing the link either:
a. dropped the link or
b. disappeared itself. [list]
2. Time Varying Behavior of Links
Those appearance and disappeance dates are used to enable a search engine to watch a document's links change over time, and score the document based upon the changes:
[list]a. when links appear or disappear,
b. the rate of appearance or disappearance of links over time,
c. the number of links appearing or disappearing during a given time period,
d. trends involving the appearance of new links versus disappearance of existing links to the document,
e. other possible factors.[list]
3. New links to and from a document
Compare the number or rate of new links acquired recently against an older time period.
A downward trend in number or rate of new links could mean a stale document, and a decrease in score.
An upward trend in number or rate of new links, depending upon situation and implementation, could mean a fresher document.
This "freshness" is taken to mean it is more relevant, or a document whose content is "fresh" by being recently created or updated.
4. Changes in Back Links to a Document
Noting over time the change in number or rate of the increase or decrease of back links to a document, or a page, can give a search engine insight into how fresh the document is.
A downward trend may signal that the document may be stale:
[list]a. no longer updated,
b. diminished in importance,
c. superceded by another document,
d. others. [list]
5. Changes in New Links to a Document
These are things that can be looked at in new links to a document:
[list]a. Amount of new links in a certain number of days against new links since the document was first found, or
b. The oldest age of the most recent y % of links compared to the age of the first link found. [list]Example:
When comparing two documents, consider y=10 and that documents are web sites that were both first found 100 days ago.
First site - 10% of the links were found less than 10 days ago,
Second site - none of the links were found less than 10 days ago
In this case, the metric results in 0.1 for site A and 0 for site B. The metric may be scaled appropriately. [list]
c. a more detailed analysis of the distribution of link dates
Models can be used to predict if a particular distribution of new links signifies a particular type of site:
a. no longer updated,
b. increasing or decreasing in popularity,
c. superceded by another document,
d. others. [list]
6. Weights in Links
Each link may be weighted by a function that increases with the link's freshness, which may be determined by the date of appearance or change of:
[list]a. the link,
b. the anchor text for the link,
c. the document containing the link. [list]The third option, the date of appearance or change of the document containing a link, may be the best indication of the freshness of a link based on the theory that a good link doesn't change when a document gets updated if the link is still relevant and good.
A link's freshness may not be updated based upon a minor edit of a tiny unrelated part of a document. Instead, updated documents may be tested for significant changes:
[list]a. Changes to a large portion of the document or
b. Changes to many different portions of the document.[list]A link's freshness may or may not be updated based upon the nature and extent of the change to the document.
7. Other Ways to Weigh Links
The freshness of links may be weighted based on:
[list] the documents containing the links are, such as government documents, or
b. How authoritative, or
c. The freshness
8. The Freshness of a Document can Depend upon the Freshness of Links to the Document
A search engine may raise or lower the score of a document to which there are links as a function of the sum of the weights of the links pointing to it.
This technique may be employed recursively. For example, assume that a document S is 2 years olds. Document S may be considered fresh if n % of the links to S are fresh or if the documents containing forward links to S are considered fresh. The latter can be checked by using the creation date of the document and applying this technique recursively.
9. Age Distributions of Links to Documents
Creation dates for links pointing to a document may be input to a function determining the age distribution of those links. Assuming the age distribution of a stale documents and fresh documents will be very different, a search engine may then score documents based, at least in part, on the age distributions associated with those documents.
10. Detecting Spam (Legitimate Documents Attract Links Slowly)
Dates links to a document appear can be used to detect "spam," where document owners, or their colleagues, create links to their own document to boost the score assigned by a search engine.
A typical, "legitimate" document attracts back links slowly.
Faster increases in back links may signal:
[list]a. a topical phenomenon (a hot topic in the news, reported upon the document), or
b. attempts to spam a search engine for increased rankings by:
[list]i. by exchanging links,
ii. purchasing links, or
iii. gaining links from documents without "editorial discretion" on making links such as guest books, referrer logs, and "free for all" pages that let anyone add a link to a document. [list]
11. Disappearing Links
The disappearance of many links can mean that the document to which these links point is stale, and is no longer being updated or has been replaced by another document.
For example, a search engine may track:
[list]a. the date one or more links to a document disappear,
b. the number of links disappearing in a given period of time, or
c. another time-varying decrease in the number of links, or updates of links, to a document to identify documents that may be considered stale. [list]Once a document has been determined to be stale, links in that document may be discounted or ignored by a search engine when determining scores for documents pointed to by the links.
12. The Dangers of Featured Links
They dynamic nature of some links may also be considered, in addition to the age of links. Documents having a different featured link each day, despite it being very fresh link, may be ranked differently (perhaps lower) than documents that are consistently updated and consistently link to a given target document.
13. Scores of Documents may be Influenced by the Scores of Documents Linking to Them.
A search engine may generate a score for a document based on the scores of the documents with links to the document for all versions of the documents within a period of time. The major update times of the document may also calculate into this score.
Posted 07 April 2005 - 01:48 AM
So, Google isn't putting a site in the sandbox for the sake of it - in fact, it doesn't "put" the site/page anywhere - it's just missing data that really helps in ranking. So, the effect is that your page won't rank very well until that data exists.
What strikes me about this patent is not so much that it is a patent but that it exposes details of how Google works. Both black and white hat SEM can analyze Google's avowed methods and put them to the test, for any range of motivations. Exposure indicates Google's confidence in their algo. Exposure invites challenge. Whatever they've come up behind closed doors will now be tested out in the open.
And I'm just another person trying to connect some dots. :-)
Posted 07 April 2005 - 03:31 AM
I think the Google folk have done a pretty good job in this patent. It's a pretty classic case.
Posted 07 April 2005 - 07:22 AM
If we understand why, though, we not only get a working idea of what needs to be done to help our sites rank better using this portion of the algo, but we can also anticipate what will likely happen in the future as more technologies are added to the soup. Optimizing a site for "what works now" has always been a tricky thing - it works now, but it rarely works for long. If we know the motivations though, it's fairly easy to see "what will work in the future" and optimize our sites for that ahead of time.
I've always followed this philosophy and have never once had to rework an existing site because it suddenly slipped off the radar because of one of the Winter Shuffles we've seen from Google over the past few years, nor during the days of the monthly updates.
Another thing that people tend to do when looking at stuff like this is to look at things like this by themselves. If we try to determine why (or even what) Google is doing by just looking at this document, we have nothing to do but to take it for face value - which we all know isn't going to be a very good thing. Looking at it this way is also dangerous because it can make this seem a lot more important than it really is.
So, let's look at the motivation here, a bit.
We know that Google's ultimate goal is to be able to look at the web just as a human would - albeit a human with a photographic memory and an advanced cartography degree. We also know that Google's ultimate goal is to index the entire web some day - and, though it doesn't say it in the patent, I suspect that this is the want that Google is moving toward with this patent.
Okay, on the surface, knowing the age of any given link does probably have some value in ranking. If a page has lots of links, but they are all more than a year old, it tells us some things. One, it's not a very timely topic. It might be a page about some popular thing from days gone by (Pet Rocks, anyone?), it might be a chronicle of an historical event, or whatever.
For example, there's a bunch of stuff going on in Italy right now with the recent passing of the Pope. So, after I visit Italy, I create a page on my site that has a photo gallery of the stuff that's going on right now. Because of the timely nature of my page, I'm likely to get a bunch of links over the coming few weeks.
Now, someone does the search "Pope's death". If they had performed that search two weeks ago, you would likely have gotten a bunch of pages listing all the Popes throughout history along with the date they died. But now if someone types that in, it's a fair assumption that they are looking for information about this current set of events. My page has lots of new links, so even without figuring out how old the page is, we can assume that there is something topical going on now in this arena of "Pope Death". And, there's suddenly a bunch more searches going on using this term, too. So, it makes sense to match up the increased searches to the page we know is most recently updated on the subject.
Over time, my page will get fewer and fewer new links as the event gets f***her and f***her into the past. But, then again, any other page that is on the same subject is going to get fewer and fewer new links, too. And, there will surely be new pages in the area of Papal History (or future) that will be getting newer links. Still, I should rank well for "Pope's Death" for a good long time since those other new pages will be considerably different and, unless we have a repeat of the last time, there shouldn't be another Papal death for a few decades. So, having "old links" isn't going to be a bad thing for me three years from now.
Some topics have longer lives. They are just things that are interesting or damned funny. For example, my all-time favorite web page is from back in at least 1996 - maybe earlier (though it did get redesigned once and moved once or twice). It is the Food I Fear page. It's been around so long, it doesn't likely get a bunch of new links like something that is timely would - but, it's a classic, so it probably does get several new links each year. The combination of lots of age-old links combined with a sprinkling of new links that pop up every once in a while is valuable information to Google. It tells it that this page really does have longevity. New links, old links, links links links. In fact, because of this nice consistent thread of new links, the fact that it has old links and new links might even tell Google that it's okay to treat all links equally, regardless of age. So, if that's the case, then old links aren't a bad thing and it would suit Google's needs well - since it is so consistently linked to (and that linking isn't tied into a burst of populariry for a certain term as the patent does describe tracking various user metrics and habits), it's a good all-around page on it's topic.
Another thing this patent doesn't really go into is whether age is good or bad (or more importantly, whether it's good or bad in all contexts). I suspect that to suit Google's wants, new is a really good thing if there's also a surge in search term popularity for a term or set of terms that has that page in the "batch". But, if there's isn't a term surge, then those new links aren't going to help as much because if there isn't a desire for that page out there, then where did those links suddenly come from? A link exchange deal? Someone adding the link to their own sites? Where? Why? No interest in the topic, but a sudden interest in the page? Google will need some old links (or wait for these to get old and see if new ones keep coming) to be able to give it a boost.
All these things, though, are relatively minor in the grand scheme of things. Figuring out whether I'm exactly right above, close to right, or completely wrong will help you, but this "Age of the Links" thing will never be the secret to ranking well. It'll help you to rank more easily - especially on time sensitive documents or historical ones - but not so much that you should really devote any substantial amount of time to it.
Finally, the most important thing that this patent does is something that the patent doesn't even describe. And, it also has nothing to do with ranking pages, either - using it for ranking of pages is just one little thing to add to the arsenal since there is some useful assumptions that can be made by tracking link age.
The key here is that this is a major step toward Google's ability to index the entire web. It will never be practical to crawl each and every page on the web on a regular basis. That's pretty much a given. But, this patent gives us some serious insight into how to determine which parts of the web need to be crawled on a regular basis, and which pages don't need to be crawled much at all.
Let's use the Food I Fear page again, as an example since it's been around forever. That page has a boatload of links all over the web that have been accumulated over its decade of life. Google has a history of the page that says that it hasn't changed in any substantial way since it's inception, either - sure, it moved and now has color and a graphic that it didn't have when it was first made, but essentially, it's the same.
So, why does Google need to waste resources? (Remember, our bottleneck here isn't processing speed nor storage - those things are cheap - the bottleneck is bandwidth. So spidering is more costly, in almost every way, than keeping records and accessing them). Why does Google need to crawl that page every month? In the past, it needed to do so in case the page moved, or changed, or vanished, or whatever. It had no way of knowing anything about it unless it actually hit the page.
Now, though, it's keeping a record of links to that page. So now, Google crawls a number of the pages that link to that page. Most of those pages are unchanged. A few have some new content on them. But none of those pages stopped linking to that page. With this data, it might be safe to assume that the page is still there and we donn't need to crawl it. But, how can we improve our chances of making the proper assumption?
Well, we use this patent, that's how.
We crawl and we find a new link to this page. It's a new link and, thus, it drastically improves the probability that the page is still there.
We crawl and we find a bunch of new links to the page. This definitely improves the probability that it's still there, but more importantly, it increases the probability that it's been changed in some way - maybe new content. Maybe something happened in the world and the page is suddenly relevant to current events for some reason. Whatever. We should probably crawl it again and see what's up (and put it in the mixing bowl to find out why all these new links are here).
We crawl and find a link is no longer there - the linking page is there, but the link itself is gone. This is a sign that potentially means that the page is missing. True, it could mean that the guy just decided to remove the link. We don't know yet, but it decreases the probability that the page is still there. If we find more pages where the link is gone, we can modify our probability factor accordingly.
So, at some point, based upon the value of the age of links, the frequency of update of the pages that the links are on, and the value of the historical data of those links and the document itself reaches a point where you have enough data to generate a number that gives us a percentage chance of whether the page is still there or not and whether it has been updated or not. Figure out those percentages to an obscenely large decimal value and sort it. Start with the ones that are most likely to have changed and start working down the list. When the clock runs out (i.e. today is the day we rebuild the index) we've gotten through the high percentage pages. The low percentage pages, statistically, should be identical to the way they were before, and even if they've changed a bit, they should at least still be there. So, we can just leave those in the index this month and not worry about crawling them - afterall, they should be there.
Run some scattered random samplings each month to see if your "this page should be the same" calculations are accurate among that sampling and, if it's within a reasonable margin of error, you don't need to tweak the aging algo. If it's beyond that margin, you need to tweak it some.
Lather, rinse, repeat.
Ranking pages is the minor revelation from this patent. The real power comes on the backend.
Posted 07 April 2005 - 09:33 PM
There are commonly two parts to an anchor tag. One of them is the "destination anchor" or what we often refer to as a link. The other part is techically known as the "source anchor." The source anchor is the visible part - the text that we see when a link is in front of us. The destination anchor is the address of the location on the web where clicking on that source anchor will bring us.
<a href="destination anchor">source anchor</a>
The "link-based criteria" section of the patent application focused upon the destination anchor, and its power to lead surfers and search engine spiders to other pages. This patent application refers to the source anchor as "anchor text," and pinpoints the value of some meaning associated with changes in either the destination, or in the anchor text that leads to it.
What happens when the anchor text of a link changes? What significance does a search engine import to that change? What happens when the change is in the destination? Does the text associated with the link change to match changes in the destination? Does it stay the same when the destination changes?
What can changes in anchor text pointing to a document tell us about that document?
Anchor Text as Part of the Document it Points to
One of the more interesting statements made in this section is that "anchor text is often considered to be part of the document to which its associated link points." This is something that we often take for granted when optimizing pages for keyword phrases - anchor text has a great deal of value in telling a search engine what the page it points to is about. The patent application seems to be confirming that assumption.
Changes in Anchor Text and Changes in Destinations
Information collected by the search engine about changes in anchor text can influence the ranking of the document it points at. It may be able to tell us if the document has been updated, or has changed in its focus. The words used don't matter as much as the fact that there has been a change.
Another alternative; if the document changes significantly and the anchor text pointing at it doesn't, then the domain the document is associated with may have changed. This could happen when a domain expires, and has a new owner.
Since anchor text could be considered part of the document to which its destination anchor points, a significant change in the domain can make queries pointing to it based upon anchor text misleading. That's a problem that a search engine doesn't want.
To solve that problem, the date that the domain changed its purpose may be calculated by looking at the date that:
[list]a. the text of a document changes significantly or
b. when the text of the anchor text changes significantly. [list]After figuring out that date, the links and possibly the anchor text from before that date could be ignored or discounted.
Freshness of Anchor Text
The freshness of anchor text may also be used to score documents and could be determined by the date of appearance or change of:
[list]a. the anchor text,
b. the link associated with the anchor text, and/or
c. the document to which the associated link points. [list]
In the section on links, we are told that when a document is updated in a significant manner, and the links on the page don't change, it's a good indication of the freshness of the destination of the link. This is based upon an assumption that the link wouldn't change when the rest of the document does if the link is still good.
The patent makes the same assumption for anchor text.
Posted 07 April 2005 - 10:09 PM
The thing with this patent - and the other patents we've looked at over the years - is to try to determine the motives, the why do they want to do this?, and then see how that fits in with the other stuff we know.
Yep. That's one of the things I've been asking myself as I work through the application. What assumptions are being made? Why is this section important? What can it tell me about the search engine behind the patent application? And, how does it fit in with what we know through experimentation, and from the statements of others?
We do spend a lot of time asking how? We should be probing why.
Another thing that people tend to do when looking at stuff like this is to look at things like this by themselves.
It makes sense to look at some of the other patents, and white papers, and interviews with search engine representatives. And while looking at those, see how this might fit with them, or how it might be different.
We also know that Google's ultimate goal is to index the entire web some day - and, though it doesn't say it in the patent, I suspect that this is the want that Google is moving toward with this patent.
I'm enjoying the references to spam detection, to the problems with expired domains, to changes in the purposes of web sites, to "featured links of the day" and other things we see on the web. This patent does recognize some of the realities of the web, and how they can impact upon a search engine.
The how describes methods to detect these things. The why is paying attention to changes to enable the search engine to recognize the significance of changes, in some very real life contexts, to provide relevant results.
And, there's suddenly a bunch more searches going on using this term, too. So, it makes sense to match up the increased searches to the page we know is most recently updated on the subject.
Google has been showing us the most popular searches on the Goolge ZeitGeist for a few years now. Instead of just an interesting curiosity, a measure of the "spirit of the times," it's been showing us that it can notice what are topical subjects, at least on a big picture level. Maybe the "why" hasn't just been a way to amuse visitors to Google, but rather a development of a way to track new answers to questions where the answers change over time.
But, this patent gives us some serious insight into how to determine which parts of the web need to be crawled on a regular basis, and which pages don't need to be crawled much at all.
I've been noticing some statements in the application about ignoring some links and some documents. We've seen those types of choices being made for at least a few years - some sites getting spidered more frequently than others, some sites getting very few visits. Googlebot goes to where the action is.
Posted 08 April 2005 - 06:17 AM
The final paragraph tries to grab even more territory and again seems too wide to me.
So it's all great stuff. However it got me thinking on whether it is necessarily the right motivation for doing all this. It seems to represent an almost god-like view of the ant-hill of web pages. I think you got it right, Bill.
It will also be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code--it being understood that one of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.
However relevance is a very personal thing linked to my question and what matters to my topic. The Google ZeitGeist relates to all searchers. Half of those may be teenagers who will have very different needs from my searching. So is relevance best served by seeing the mass movement. Or would it be better to try to spot that very small group of individuals who have been searching for the same thing as me and let me know how they rated things. Amazon tries that. If the world is mostly 'Long Tail' now, that seems a better way to go. How you'd compute all that, heaven only knows. However you can get the Patent before you know how to do it.
Googlebot goes to where the action is.
Posted 08 April 2005 - 10:52 PM
Or would it be better to try to spot that very small group of individuals who have been searching for the same thing as me and let me know how they rated things.
Good point. Just how closely is the patent application's search engine going to track who goes where?
Will the search engine of this patent application watch the traffic of every site carefully? Will it try to uncover meaningfulness in drops in traffic, in patterns involving traffic following advertising to a site, and in changing or repeating traffic patterns?
The application does tell us that historical data about traffic to and from a document over time plays a part in this this temporal level of assigning a score to a document.
But your question assumes that some measure of relevance plays a part in this scoring based upon age data. I'm not so sure. This patent application seems like a level of analysis that may be largely independant of the determination of relevance. It might be more like smelling that milk has gone bad before you get the chance to taste it - an independant interpretation of data that contains meaning in itself.
Or is it?
This section on traffic is a smaller one that some of the ones that have gone before. It may partially explain why a search engine like Google might want to purchase a company that has employees with an expertise in web analytics. But, I'm not sure that they need that level of expertise for this section alone.
Obviously, as the patent tells us, a large drop off in traffic to a site may mean that a document is no longer being updated or may have been replaced by another document.
But, is this traffic brought to the site by a search engine, or all traffic? I don't know how they would track "all traffic" unless they used some means independant of the search engine. What if the search engine had some type of toolbar that could be used to gauge traffic to web pages? A9 using the Alexa toolbar? Google using its toolbar?
How many people have an Alexa or a Google Toolbar installed on their browser? What percentage of the online public?
I do find myself surprised to still see this on the Google Toolbar download page:
Microsoft Windows 95/98/ME/NT/2000/XP
Microsoft Internet Explorer 5.0+
Pop-up blocker requires Internet Explorer 5.5+
Aren't there a couple of Google employees working on the Firefox browser these days? And their toolbar doesn't work with Firefox?
Identifying Patterns in Traffic
A dropoff in traffic to a site may mean its stale. But, it may also mean that the site experiences changes in traffic to a site based upon some type of pattern. For example, baseball blogs may be more popular during the baseball season. Christmas sites may enjoy more traffic towards the end of every year. If a pattern is identified by the search engine, it may adjust that time-based score.
Here are the factors that the application may look at when considering advertising traffic:
(1) the extent to and rate at which advertisements are presented or updated by a given document over time;
(2) the quality of the advertisers (e.g., a document whose advertisements refer/link to documents known to search engine 125 over time to have relatively high traffic and trust, such as amazon.com, may be given relatively more weight than those documents whose advertisements refer to low traffic/untrustworthy documents, such as a pornographic site); and
(3) the extent to which the advertisements generate user traffic to the documents to which they relate (e.g., their click-through rate).
Questions about this Advertising Section:
1. Interesting that Amazon.com is named as a site that has a relatively high level of trust. Is this measure of trust a subjective, human defined thing? How does a commercial site earn trust from a search engine? Why should a commercial site gain this type of benefit? Why shouldn't one?
2. Is the advertising mentioned in this section an advertising program possibly owned by a search engine, or will the search engine carefully monitor as many different advertising programs as they can? Is there a potential problem with that type of monitoring if it involves everyone's advertising?
3. If it is only the search engines's advertising program, does this mean that using paid advertising program can benefit a site's rankings? Is the "objectiveness" of the search engine's unpaid rankings called into question by this section on advertising traffic?
Posted 10 April 2005 - 08:03 AM
I know you say we shouldn't draw conclusions however I am drawn to these.
If a new domain is registered and the site is launched the site is linked to from an established site that google recognizes and therefore indexes the site. However it does not have enough links to show up in the search results for competitive keywords, so "Naturally" you create links to as many sites as possible to gain some prominence on other sites in the serps for relevant search terms.
Too many links too quickly, the site goes to the "sandbox" or the "lack of historical data box." The site can't get the CTR as it is not coming up in the serps, the sites that are linking to it come up page 2, 3. Long establish major competitors come up on pg.1, 2.
How do you increase traffic and CTR - Adwords, though Adwords are not meant to help Natural results. Is the CTR for adwords included in this fingerprint of a site? With this method I'm not sure which I would prefer. Could anyone shed some light on this?
Posted 10 April 2005 - 11:44 AM
I've been trying to sneak some sleep in now and then. That's why there are still some sections of the patent left that I haven't touched upon yet.
But, I'll try to address your questions, here...
The Patent is the Search Engine's Roadmap
It may be possible to consider using this patent application as our roadmap of things to do, and things not to do when working on a site, but I would caution anyone from taking it completely as gospel. It's not a set of guidelines from the search engine on things for us to do.
The patent seems like a pretty comprehensive list of factors that may go into how a search engine can use historic data to give rankings to part of a page, or a full page, or a site, or a group of sites (subdomains joined together or sites associated in some other manner). But, we have to keep in mind the nature of patents and patent applications.
In may ways, this patent application is a roadmap that the search engine is following.
[list]Some of the factors mentioned in the application are considerations that a search engine may already be taking into account in rankings.
Some of them are potentially alternatives amongst possible choices of things to follow, so that if they choose to do one thing, they may not do another.
Some of them are things that they may not have figured out how to do in a timely enough manner, but wanted to exclude other search engines from using by including them in the patent application.
By drawing out this roadmap, they may be using parts of it, and they may be working on how to implement other parts. Some of them may be feasible from a technical stance, but maybe not from a legal one. Or, they may come up with better ideas that aren't captured in the patent application.[list]Reading through the application, it sort of feels like the results of what would happen if you took a handful of very bright and knowledgeable people, and sent them off together on a weeklong retreat, and asked them to brainstorm about all of the ways that historical data could be used by a search engine.
Some Questions based upon the Application
I seem to be building a big list of questions rather than answers, when it comes to this patent application. I don't think that's a bad thing. If we know what the potential issues are behind it, we have a much better chance of learning the answers to them as we keep our eyes open and try to figure out what it means to us.
Here are some other questions I have based upon your questions, and how some of the different factors of this patent application might work:
[list]Can a site be included in the index before there is enough historical data collected to have these time based factors included in the rankings of the site and its pages?
Is there a certain number of links to the site necessary before it will be included?
If a site has less links, but it has some, will the time be different if those links are from pages that are: authoritative, trusted, or popular?
If the topic that the page covers is one that is timely, and very popular, will the number of links, and the rate of growth of new links be calculated differently than for a less popular and timely topic?
Is the page updated on a regular basis, and if it is, are the parts of it that are changed significant or insignificant?
What are other pages like that are typically returned in queries that this site might be returned for? Are they older and established? Are they updated regularly? What parts of those pages change?[list]
There are a great number of factors listed in the patent application. And we have no idea how they interact with each other, whether some have been implemented or not, and what value they might have in comparison to each other.
Older Considerations to Keep in Mind
Other ranking considerations most probably haven't gone away. We still need to keep those in mind, and address them. For instance, using intelligently crafted page titles is probably a good practice to continue to use. With the possibility that some type of clickthrough effect may (or may not) be in place, it helps to make sure that the page title is something that would be persuasive and attract clicks when it shows up in search results for the queries you are aiming at. But that really isn't a change if you've been doing it all along.
Using some type of linking structure that is easy for people to follow and search engines to index also continues to make sense. A good hierarchical structure can help a search engine index a set of pages because it can get some meaning from that structure. Using anchor text that helps define those parts, using words that are meaningful to visitors (keywords or trigger words) are ways to get a search engine to index your pages well.
Making sure that pages can be indexed without potential impediments from java script menus, session IDs, directory structures that may be too deep, images for links instead of words, and not enough content on a page are still primary considerations.
While we work out how historical data could possibly affect a site, we need to make sure that we address all of the other potential problems that may be keeping the pages of a site from being indexed the way that it should be.
The things I'm mentioning are on-page factors for ranking, but off page factors are also important, and looking for links from other pages is still important, and needs to be addressed, too.
What are the Assumptions behind the Application?
What this patent application potentially adds to getting pages listed in the search engine index is a set of data that covers information about a page (actually, document in its wider definition) based upon changes over time in quite a few areas.
The patent seems to be trying to find ways to measure data that its inventors likes on pages, and that they think will bring pages forward that searchers will like. So, given that we've done all of those things we would do anyway to make sure that a page will rank well, like addressing on-page factors and getting some links to the site, what other things do we do to get people interested in the pages? Here are some off the top of my head, and there may be a good number of others to think about:
[list]1. Make sure that if the site covers information that changes over time, that the pages are updated to reflect those changes. Make other changes that make the pages appear to be tended to, such as updating copyright notices, fixing links that go bad, changing the descriptions in anchor text and title attributes when a site linked to changes its focus.
2. Include pages that are timely and topical, and cover news within the area the page covers that people will be interested in. This may mean press releases, a blog with timely news and opinions, a news page that describes the current state of the industry, a specials page that shows attention to shopping trends, and things that people are searching for, and so on.
3. Build content that people will be interested in linking to, bookmarking, referring friends to, and so on. This means that you should make interesting and new content fairly visible. If you have a newsletter for your site, include an archive of old newsletters, and link to some of the recent editions from the front page of the site, possibly. Or if you write articles on a weekly or monthly basis, include links to them from a front page, or a main page, and update those links. Include keywords or trigger words in those links, and only show a handful of titles from the latest on that main page.[list]If a site gets a lot of links over a fairly short period of time, but a look at the site indicates fresh material that covers topical information that is also updated on a regular basis, then it will quite possibly be treated better by a search engine than a site that doesn't change at all.
The patent application also mentions advertising links as one of the factors that it may consider. Is it something they are doing now, or is part of the roadmap for the future? I don't know, and I can't tell if it is something that they will ever implement. But, it could be. More importantly, if advertising gets traffic to your site, and increases the odds that some people will like your pages enough to link to them, then it could have a positive impact anyway.
We aren't sure yet how these different factors will influence rankings, but we do have more things to think about now that we have the patent application.
Posted 10 April 2005 - 04:16 PM
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users