Jump to content

Leading Community for Usability, Search Engine Marketing,
Social Networking, Site Planning & Web Site Development, Since 1998


Photo

Of Sandboxes and Toolbars: Google's New Patent Application


51 replies to this topic

#1 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 31 March 2005 - 10:00 PM

There's a new patent application from Google:

Information retrieval based on historical data

I've seen a couple of forum threads and blog posts about it (including this one started by msgraph, who seems to have been the first to spot the application - nice going), and thought that it would be a good idea to bread down the patent step-by-step and see what lurks underneath all of the legal language.

In a few places, it's been called an explanation for Google's Sandbox - a place where new sites go instead of gaining page rank, and being able to rank well in Google's results. Mentions of the use of Google's toolbar and the gathering of information about a site also factor into some of the discussions I've seen.

With all of that press, it pays to take a closer look. I'm not sure that I have the time to go through the whole thing all in one sitting, but please feel free to jump in and help me dissect this patent.

#2 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 31 March 2005 - 10:34 PM

One of the first questions I had when I heard about this patent application was "how long has this been around?"

I had done some searching of the patent and patent application databases not too long ago using the name of one of the Authors of this patent. But I didn't come across this one.

So, when was it filed and made part of the application database? We see two dates towards the top of the patent application. One is March 31, 2005. There's a "filed" date of December 31, 2003.

I'm guessing that the patent was placed out for public view today, rather than on that file date. Why is the date important? We find an answer to that on the pages of the US Patent and Trademark Office, in their section on General Information Concerning Patents

It is important when a patent application is filed. From that Patent Office page we see:

If the invention has been described in a printed publication anywhere, or has been in public use or on sale in this country more than one year before the date on which an application for patent is filed in this country, a patent cannot be obtained. In this connection it is immaterial when the invention was made, or whether the printed publication or public use was by the inventor himself/herself or by someone else.


So, if a patent is in public use more than a year before it has been applied for, a patent upon it "cannot be obtained" even if it is the inventor of the subject matter covered by the patent.

So, why the two dates? Well, a patent applicant can request that a patent not be immediately published, as described above.

From the same page:

On filing of a plant or utility application on or after November 29, 2000, an applicant may request that the application not be published, but only if the invention has not been and will not be the subject of an application filed in a foreign country that requires publication 18 months after filing (or earlier claimed priority date) or under the Patent Cooperation Treaty. Publication occurs after the expiration of an 18-month period following the earliest effective filing date or priority date claimed by an application. Following publication, the application for patent is no longer held in confidence by the Office and any member of the public may request access to the entire file history of the application.


So, while a patent application can be filed, it may not need to be published immediately. In this instance, we have a period of fifteen months from the date of filing to the time of publication.

So, it isn't a new application. Just one that has been kept quiet for a while. For the application to become an actual patent, the invention it describes shouldn't have been in use more than a year before it was filed, even by its inventor. So, that date would seem to be December 31, 2002.

#3 rcjordan

rcjordan

    Gravity Master Member

  • Members
  • 189 posts

Posted 31 March 2005 - 10:54 PM

We're looking at a document which largely confirms many of our old assumptions about labeling and subsequent filtering via "fingerprints." None of them are earth-shattering or even particularly lethal individually but they do have a certain common-sense element to them.

And it covers much more than the sandbox. Here's the "age of domain" section:
================================
[0097] According to an implementation consistent with the principles of the invention, information relating to a domain associated with a document may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor information relating to how a document is hosted within a computer network (e.g., the Internet, an intranet or other network or database of documents) and use this information to score the document.

[0098] Individuals who attempt to deceive (spam) search engines often use throwaway or "doorway" domains and attempt to obtain as much traffic as possible before being caught. Information regarding the legitimacy of the domains may be used by search engine 125 when scoring the documents associated with these domains.

[0099] Certain signals may be used to distinguish between illegitimate and legitimate domains. For example, domains can be renewed up to a period of 10 years. Valuable (legitimate) domains are often paid for several years in advance, while doorway (illegitimate) domains rarely are used for more than a year. Therefore, the date when a domain expires in the future can be used as a factor in predicting the legitimacy of a domain and, thus, the documents associated therewith.

[0100] Also, or alternatively, the domain name server (DNS) record for a domain may be monitored to predict whether a domain is legitimate. The DNS record contains details of who registered the domain, administrative and technical addresses, and the addresses of name servers (i.e., servers that resolve the domain name into an IP address). By analyzing this data over time for a domain, illegitimate domains may be identified. For instance, search engine 125 may monitor whether physically correct address information exists over a period of time, whether contact information for the domain changes relatively often, whether there is a relatively high number of changes between different name servers and hosting companies, etc. In one implementation, a list of known-bad contact information, name servers, and/or IP addresses may be identified, stored, and used in predicting the legitimacy of a domain and, thus, the documents associated therewith.

[0101] Also, or alternatively, the age, or other information, regarding a name server associated with a domain may be used to predict the legitimacy of the domain. A "good" name server may have a mix of different domains from different registrars and have a history of hosting those domains, while a "bad" name server might host mainly pornography or doorway domains, domains with commercial words (a common indicator of spam), or primarily bulk domains from a single registrar, or might be brand new. The newness of a name server might not automatically be a negative factor in determining the legitimacy of the associated domain, but in combination with other factors, such as ones described herein, it could be.
======================================

Some SEO groups I know have been conjecturing for years about how they'd filter if they ran a search engine. Some have developed "spammer profiles" that are extraordinarily close to the items in the patent.

#4 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 31 March 2005 - 11:04 PM

The first nine sections of the patent description take a look at the present state of the web, of search engines, and very generally how search engines work. Here's part of that description:

Ideally, a search engine, in response to a given user's search query, will provide the user with the most relevant results. One category of search engines identifies relevant documents based on a comparison of the search query terms to the words contained in the documents. Another category of search engines identifies relevant documents using factors other than, or in addition to, the presence of the search query terms in the documents. One such search engine uses information associated with links to or from the documents to determine the relative importance of the documents.


That section also names two problems that can have a negative effect upon the relevance of search engine results. One is "spamming techniques" that artifically "inflate" the rankings of sites. Another are "stale" sites that are ranked higher than fresher sites with more recently updated information and "contain more recent data."

From this introduction, it appears that this patent application is intended to address people spamming search results, and make it easier for newer sites to rank well against older, staler sites.

That sort of seems to go against the concept of a "Sandbox" effect, where newer sites seem to be penalized and unable to rank well in Google. Or does it? We probably need to delve deeper.

#5 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 31 March 2005 - 11:52 PM

Excellent points, rcjordan.

I also noticed that it's a good fit with one of the google patents on duplicate pages that deals with fingerprints, and looking at snippets of information to determine whether content on multiple pages are duplicates.

Detecting duplicate and near-duplicate files

At one point in that patent on duplicate content, it lists some factors that it might use to determine which duplicate page to use, including age of the document:

In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent*) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent*, etc.) is returned.


* My emphasis.

One thing that bothered me about the use of age to determine which duplicate, and near duplicate pages, to return and which to filter out is that any document on the web can be saved and have a recent time stamp, even if it has been on the web for years.

After this newer patent, we may have a better sense of how Google determines the age of a page. There are other factors listed in the application which describe ways in which Google can do that. (Monika Henzinger is a co-inventor of both patents, which may account for some similarities.)

The sections on determining the age of a page is an important part of this newer patent application. It's probably worth looking at those closely, and trying to translate them from the legal language they are presently couched in.

#6 projectphp

projectphp

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3934 posts
  • Twitter:motherwell
  • Facebook:http://www.facebook.com/mmotherwell

Posted 01 April 2005 - 12:02 AM

This is quite interesting, but not really very informative in a lot of places. Are changes good, bad, indifferent? Doesn't say in many places. Except for:

20. The method of claim 19, wherein the scoring the document includes: determining whether stale documents are considered favorable for a search query when the document is determined to be stale, and scoring the document based, at least in part, on whether stale documents are considered favorable for the search query when the document is determined to be stale.

22. The method of claim 1, wherein the one or more types of history data includes information relating to behavior of links over time; and wherein the generating a score includes: determining behavior of links associated with the document, and scoring the document based, at least in part, on the behavior of links associated with the document.

Interesting. So stale will be determined, and dropped links would matter, as would new links. Might make things more timely, without the overhead of PageRank, and with perhaps different areas given different stale values. After all, stale as it relates to the works of van Gogh is different to stale as it realtes to SEO!!

54. A method for ranking a linked document, comprising: determining an age of linkage data associated with the linked document; and ranking the linked document based on a decaying function of the age of the linkage data.

LMAO!! Now not only do links "bleed" PageRank, but they now also "decay". I wonder if they bleed decayed PageRank :)

All in all, some interesting ideas, and again, very hard to see a way to manipulate!! All I can think of would be rotating links with every Google crawl to keep links "fresh". Anyone else think of any black hat uses for all this ;)??

[0088] According to an implementation consistent with the principles of the invention, information relating to traffic associated with a document over time may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor the time-varying characteristics of traffic to, or other "use" of, a document by one or more users. A large reduction in traffic may indicate that a document may be stale (e.g., no longer be updated or may be superseded by another document).

Toolbar usage.. Interesting...

#7 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 01 April 2005 - 12:26 AM

Good stuff, projectphp. (the link decay factor seems like an interesting approach.)

I''m plodding along, and you're asking some great questions. I hope that I can uncover some answers by being slow and methodical. :)

Assigning Age Rank

How do we tell the age of a document, and determine whether or not it is stale? What types of things would be used to give a score to a document based upon that age?

1. Information is gathered from a couple of different sources about the age of a document.

2. Information is gathered from a few different sources about the age of links leading to and from that document.

We'll get to those sources further along. But first...

Defining a Document

Before we look too deeply at this patent, and determine whether it has an impact on the ranking of web pages based upon the age of those pages, we have to get something else out to the way.

One of the important aspects of this patent is that a "document" isn't necessarily just a web page. A document could be a web page, or it could be "an e-mail, a web site, a file, a combination of files, one or more files with embedded links to other files, a news group posting, a blog, a web advertisement, etc."

So, the application is looking at more than just web pages. It can look at parts of pages, or even collections of pages.

The patent also notes at this point that it is important to note that documents can have "forward" links leading from them to other documents, and "back" links leading to them.

So, why is it important to define "documents" differently than pages? How can that make a difference?

#8 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 01 April 2005 - 12:37 AM

Looking back at the patent on duplicate files I linked to above, it also uses a definition of the word "documents" that doesn't just apply to individual web pages:

In the following, the term "document(s)" should be broadly interpreted and may include content such as Web pages, text files, multimedia files, object features, link structure, etc..


Is that a difference that matters? I don't know yet.

#9 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 01 April 2005 - 01:40 AM

Historical data involving a document can influence ranking scores. Here is one of the things that can make a difference:

document inception date

This can be determined a number of different ways (maybe based upon what type of document it is, or by what implementation of the application is being used):

[list]When first crawled by the search engine
When first submitted to the search engine
When a link to the document is first discovered
Domain registration date
When first referenced in another document
When a document first reaches a certain number of pages
By the time stamp of the document on the server it is hosted upon.[list]
The application tells us that under a link-based ranking system not using age-based information, a document with less links to and from the document may rank lower than a document with more links to and from it.

But, if the document with less links can be determined to be newer, based upon the document inception date, it might just rank higher than an older document with more links because it has a higher rate of growth. But too many links, coming too quickly to the newer document, may also be a sign that some type of spamming is happening.

So, how is that rate determined, and how much does it influence the overall ranking of a page?

This formula is given as one way of determining that:

H=L/log(F+2), 

where H may refer to the history-adjusted link score, L may refer to the link score given to the document, which can be derived using any known link scoring technique (e.g., the scoring technique described in U.S. Pat. No. 6,285,999) that assigns a score to a document based on links to/from the document, and F may refer to elapsed time measured from the inception date associated with the document (or a window within this period).


The patent further refines this formula by negating some of the difference between the ages of the documents, in a recognition that some "older documents may be more favorable than newer ones" and that some sets of results can be fairly mature. The scores of documents can be influenced (positively or negatively) by the difference between the document's age, and the average age of documents resulting from a query.

So, a fairly new site that appears amongst a set of results that are, on the average fairly old, may find it being negatively influenced by that difference in age.

There are, however, a number of other ways to assign a score based upon age, which can influence the ranking of a site. The patent goes into those in more detail. And I will, too.

Tommorrow...

#10 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 01 April 2005 - 02:27 AM

Back in February I wrote a paper, "On the Googleness of Being", in which I shared some observations and guesswork about this TimeRank factor (that is what I called it).

If it's okay, I'll post the link here (I posted it in the Spider-Food forum):

http://forums.spider...?showtopic=2767

This is a hypothetical position, not the official Michael Martinez interpretation of Google. But it comes close to stating some of the principles I have been working with for a long time.

The patent doesn't really tell us whether Google is doing this stuff now, but it does make it sound like they have been tinkering under the hood with these ideas -- especially given the search engine's behavior over the past year.

#11 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 01 April 2005 - 07:59 AM

I remember reading your post describing the TimeRank factor last month, and thinking that it was an idea we should definitely keep in mind. :)

I'm a little torn between moving forward with a slow, plodding look at the patent, or rereading your posts first, and then going forward. I might try to keep at the patent before jumping off to seeing how your interpretation there matches up.

But I like the idea of seeing how what you wrote matches up with what Google has now released that seems to indicate that they are using age of documents as a consideration in rnaking pages. If anyone else wants to bring this discussion that way, I'd say go for it.

#12 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 01 April 2005 - 11:00 AM

Well, their patent covers a lot of possibilities. I only covered a few. There remains no clear confirmation of any of my suggestions, but it does seem like Google is moving in the directions I have described.

I'm almost tempted to describe the patent as an April Fool's Joke, to be honest. It is so exhaustive I don't see how they could possibly seriously attempt all that stuff. But then, would the patent office really appreciate that? Would it be the first time someone patented nonsense as a joke?

#13 DianeV

DianeV

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 7213 posts

Posted 01 April 2005 - 02:08 PM

Well, it's clear enough that it's an if/else scenario, although that's probably too obvious to state.

#14 bearmugs

bearmugs

    Mach 1 Member

  • Members
  • 261 posts

Posted 01 April 2005 - 04:53 PM

A quick note to projectphp

I like that new avatar. Is it from down under?

John AH!

#15 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 01 April 2005 - 09:10 PM

Web Pages Change, Part 1

You have a web site. It ranks well in Google, and has for years, and you are afraid of changing anything. But, you think that if you make some changes, you might get more conversions on your page.

If you update the page, will this historical data measuring make a difference?

You have a blog that you update almost everyday. You go on vacation for two weeks, and then have a family emergency that keeps you from your web site for another two weeks. Has the failure to update your site in a month influenced how your page ranks in Google?


Content Updates/Changes

The application recognizes that pages change. Some of them change more rapidly than others. How does that fit into Google's ranking of pages?

We are given another mathematical formula in this section.

U=f(UF, UA)

An "Update score" (U) is calculated using frequency of change, and amount of change.

An "update frequency score" (UF) may be used to calculate how often a document (or page) changes over time. It could be determined by the average time between updates or the amount of updates over a period of time.

An "Update amount score" (UA) represents how much a document (or page) has changed over time. The update amount score looks at a number of possible changes, and gives different weights to different kinds of changes.

Kinds of UA updates:

[list]The number of "new" or unique pages associated with a document over a period of time.


[*]The ratio of the number of new or unique pages associated with a document over a period of time versus the total number of pages associated with that document.


[*]The amount that the document is updated over one or more periods of time (e.g., n % of a document's visible content may change over a period t (e.g., last months), which might be an average value.


[*]The amount that the document (or page) has changed in one or more periods of time (e.g., within the last x days). [list]
Weights of UA updates:

Unimportant if updated/changed:

[list]Javascript,


[*]comments,


[*]advertisements,


[*]navigational elements,


[*]boilerplate material, or;


[*]date/time tags. [list]These could be given little weight or even ignored altogether when determining UA.


[b]Important
if updated/changed (e.g., more often, more recently, more extensively, etc.):

[list]title or;

[*]anchor text associated with the forward links.[list]These could have a much bigger impact when determining UA.

#16 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 02 April 2005 - 01:20 AM

Google has been crawling Javascript for some time now. I first noticed retrievals of my Javascript templates in 2003 (the technology was developed or at least tested out of Stanford University). Googleguy has confirmed that they do crawl Javascript, although it's unclear how well they do it.

I think the patent implies they will look at whether ads change, too.

#17 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 02 April 2005 - 01:29 AM

Right, Michael.

It's not saying that they will or won't crawl javascript. That's really outside of the scope of this patent application.

Rather, the application is drawing a line in the sand here. It's looking at how a web page can change over time, and deciding that some aspects of change on a page, or part of a page, or web site as a whole may be considered less important than others.

So, if someone is showing ads on their site, or using java script to display an RSS feed, or so on, and these things change on a regular basis, they are much less important than a page title change, or a change in the anchor text of a link leading from the page.

#18 AbleReach

AbleReach

    Peacekeeper Administrator

  • Admin - Top Level
  • 6370 posts

Posted 02 April 2005 - 02:02 AM

Bill and others,

Your dedication is impressive. Thank you for sifting through this data and writing out your thoughts.


Elizabeth

#19 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 02 April 2005 - 02:19 AM

The Inventors of the Subject of the Patent Application

I was going to pick up where I left off, but I noticed in another thread on another forum a valid complaint against some interpretations of this patent application.

The complaint was that the application doesn't have Google's name on it. And because of that, it's misleading to attribute many of the things listed in the patent application to Google, and to what Google is doing on the web presently.

The complaint was mainly targeted at an indepth analysis of the patent application by Randfish, which I'm trying not to read while we pursue this much more rambling and discursive analysis. But, if you want to jump over to what he has done (yes, I've peeked), it's at:

http://www.socengine...ata-patent.html

And while this patent hasn't been legally assigned to Google, and we are making an assumption if we believe that it might be, I thought it might be a good idea to find out a little more about some of the authors of the patent application.

At this point in time and at the time of filing of the patent application, they do all appear to be Google employees. I don't know how I would feel as an employer if some of my best and brightest employees got together, and invented what appears to be a complex and fairly comprehensive application on the use of aging information to influence the rankings of documents in search engine results.

But, it doesn't have the search company's name upon it.

So, keep in mind that all of this speculation about how this aging information fits in may not describe how Google presently works. Instead, the application covers the workings of "Search Engine 125."

Here are the inventors (I'm not sure of all of their official titles at Google, but I've included what I was able to find):

Anurag Acharya
Principal Engineer, Google

One of the major forces behind Google Scholar.

Anurag Acharya Helped Google’s Scholarly Leap

Scholarly pursuits


Matt Cutts
Software Engineer, Google

A popular speaker at Search Engine Strategies and other conferences, including one for Consumer Webwatch, where he made this statement:

Because for some searches, there's some small or — not the majority, but some percentage of searches which are commercial. Say, 30, 40 percent, where somebody's looking to buy a product. And in those sorts of searches, the advertisements can be just as useful as the search results. In fact, we often order the search results by the click-through percentage.



He also gives a number of interviews on behalf of Google, like these:

Inside the Google search machine

Interview with Matt Cutts, Google


Jeffrey Dean
Distinguished engineer in the Google Systems Research Lab

If you look through a number of Google patents and patent applications, you may notice his name appears more than once as a co-inventor. Here's a paper on the physical side of Google's operation, that he co-authored:

Web Search For A Planet: The Google Cluster Architecture


Paul Haahr
Senior software engineer, search quality group. (Google)
http://www.webcom.co...aahr/about.html


Monika Henzinger
Director of Research at Google

Another name that springs up in Google patents and patent applications on a regular basis. And a few for some other search companies before Google had the good fortune to employ her.

Still Searching (After All These Years)

Google's Research Maven

Urs Hoelzle
Google Fellow

Peeking Into Google

Urs Hoelzle, Vice President of Operations, Google

Google's Management Team


Steve Lawrence
Google Senior Research Scientist

Steve Lawrence: List of publications from the DBLP Bibliography Server

The main developer of CiteSeer, presumably involved with Google Scholar, and the lead developer of Google's desktop search.

Google unveils desktop search

Karl Pfleger
Google employee

Karl Pfleger's Other Interests


Olcan Sercinoglu
Google Software Engineer

Worked on developing the cluster management system used by MapReduce.

MapReduce: Simplifed Data Processing on Large Clusters

After an internship at Google, Inc., he now works full time for the California company as a software engineer.

Engineering News: School of Engineering and Applied Science at Washington University in St. Louis (Summer 2002)


Simon Tong
Research Scientist, Google Inc.

One of the folks in the Google Labs.

alicebot-general - Fw: Fw: another way cool thing at Google


ps. On preview, you're welcome Elizabeth. I'm trying to explain as much of this in as plain language as I can. I think I'm getting better at it as I go. Hopefully, I'll get even better. :)

pps. Nice Avatar. It's good to be able to see the faces behind posts.

#20 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 02 April 2005 - 09:54 PM

Pages Change, Part 2

I probably should have included one more paragraph in the previous post dealing with changes. The next section of the patent talks about comparing the rate and amount of changes over more than one period of time.

The way that paragraph is worded, it appears that the patent application may often favor sites that are updated frequently, and that show an increase in a rate, and amount of change.

All of this extra monitoring creates some new challenges, like where to put all the extra data that comes with it.


The Problem of Storing Historic Data to Compare for Changes

It struck me while reading all of this information about tracking changes in documents that it would take a considerable amount of storage to house older copies of documents and allow a comparison.

While Google notes presently claims "8,058,044,651 web pages" on the front of the site, the Internet Archive had ten billion pages indexed in 2001, and an estimate from October of 2004 indicated 30 billion pages. The Internet Archive has copies of sites as they have changed.

So, if a search engine is tracking changes to documents over time, it needs to store information about those changes. But storing exact copies of documents could cause its database to balloon in size quickly.

And some type of strategy may be needed to keep the amount of information fairly small.


Fingerprints and Other Approaches


There was a mention of "fingerprints" in this thread from rcjordan, and the duplicate content patent I mentioned in a prior post also talks about a fingerprint strategy. That's one possible way for a Search Engine to track changes without having to keep exact copies of earlier documents for comparison.

When fingerprints are compared, the method used to only look at a few places on the prints, and see if there is a match in those places. There isn't an attempt to overlay one print with another and try to achieve an exact match.

Rather, information is stored about the characteristics of a print at different places on a finger, and that information is stored in a data base. When looking for a match, the information about those points are compared to each others. So, matching fingerprints doesn't call for exact matches of prints, but rather matches at a number of predetermined points on prints.

The patent application addresses this need for storage capacity, when "monitoring the documents for content changes," and provides a number of different ways for tracking changes while not storing full copies of documents

Representations of the documents, like the fingerprint information, can be stored and monitored for changes. Here are the strategies that the patent application mentions:

[list]"Signatures" of documents may be stored instead of the entire documents.


[*]Term vector's may be stored and monitored for relatively large changes.


[*]A relatively small portion (e.g., a few terms) of the documents that are determined to be important or the most frequently occurring (excluding "stop words") may be stored and monitored.


[*]a summary or other representation of a document may be maintained.


[*]A similarity hash (which may be used to detect near-duplication of a document) for the document and monitor it for changes. A change in a similarity hash may be considered to indicate a relatively large change in its associated document. (See the patent on duplicate content - in addition to looking for duplicates on documents in other places in the web, it can be used to compare newer and older copies of the same document.)

[*]Other possible options may be considered and used, too.[list]Of course, there is the possibility that there is enough storage, and full copies of documents can be maintained and monitored (See Google's cache, for instance.)

(I've kept the language of some of those list items exactly the same as they appear on the application because it's either not really clear what those words mean yet, or they are already fairly clear.)

Some Change isn't Good

While it has been implied earlier, the next section (0055) shows a recognition that for some types of queries, changes may not be a good thing. For those changes, the update score could be adjusted based upon the the difference for an "average date-of-change" of the results from that query.

Here's how that might work when results from a search are returned by the search engine:


[list]Each document has a last change date,


[*]An average is calculated for the documents in the results,


[*]The document's ranking scores are modified positively or negatively based on the difference between the last change date and the average date-of-change for all of those documents.[list]
Summary of the Section

Some type of age rank is created for a document based upon information collected about how old that document is and how it has changed over time.

This summary of this secton of the patent adds a detail about "large" documents, belonging to more than one person or organization. For those, the score may be broken down into scores for smaller sections, where content, and changes to that content are under the control of one person or organization.

For instance, each of the blogs hosted on Blogspot might be treated differently under this age scoring system.



Reply to this topic



  


0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users