![]() ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Mar 31 2005, 10:00 PM |
|
|
There's a new patent application from Google:
Information retrieval based on historical data I've seen a couple of forum threads and blog posts about it (including this one started by msgraph, who seems to have been the first to spot the application - nice going), and thought that it would be a good idea to bread down the patent step-by-step and see what lurks underneath all of the legal language. In a few places, it's been called an explanation for Google's Sandbox - a place where new sites go instead of gaining page rank, and being able to rank well in Google's results. Mentions of the use of Google's toolbar and the gathering of information about a site also factor into some of the discussions I've seen. With all of that press, it pays to take a closer look. I'm not sure that I have the time to go through the whole thing all in one sitting, but please feel free to jump in and help me dissect this patent. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Mar 31 2005, 10:34 PM |
|
|
One of the first questions I had when I heard about this patent application was "how long has this been around?"
I had done some searching of the patent and patent application databases not too long ago using the name of one of the Authors of this patent. But I didn't come across this one. So, when was it filed and made part of the application database? We see two dates towards the top of the patent application. One is March 31, 2005. There's a "filed" date of December 31, 2003. I'm guessing that the patent was placed out for public view today, rather than on that file date. Why is the date important? We find an answer to that on the pages of the US Patent and Trademark Office, in their section on General Information Concerning Patents It is important when a patent application is filed. From that Patent Office page we see: QUOTE If the invention has been described in a printed publication anywhere, or has been in public use or on sale in this country more than one year before the date on which an application for patent is filed in this country, a patent cannot be obtained. In this connection it is immaterial when the invention was made, or whether the printed publication or public use was by the inventor himself/herself or by someone else. So, if a patent is in public use more than a year before it has been applied for, a patent upon it "cannot be obtained" even if it is the inventor of the subject matter covered by the patent. So, why the two dates? Well, a patent applicant can request that a patent not be immediately published, as described above. From the same page: QUOTE On filing of a plant or utility application on or after November 29, 2000, an applicant may request that the application not be published, but only if the invention has not been and will not be the subject of an application filed in a foreign country that requires publication 18 months after filing (or earlier claimed priority date) or under the Patent Cooperation Treaty. Publication occurs after the expiration of an 18-month period following the earliest effective filing date or priority date claimed by an application. Following publication, the application for patent is no longer held in confidence by the Office and any member of the public may request access to the entire file history of the application. So, while a patent application can be filed, it may not need to be published immediately. In this instance, we have a period of fifteen months from the date of filing to the time of publication. So, it isn't a new application. Just one that has been kept quiet for a while. For the application to become an actual patent, the invention it describes shouldn't have been in use more than a year before it was filed, even by its inventor. So, that date would seem to be December 31, 2002. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Mar 31 2005, 11:52 PM |
|
|
Excellent points, rcjordan.
I also noticed that it's a good fit with one of the google patents on duplicate pages that deals with fingerprints, and looking at snippets of information to determine whether content on multiple pages are duplicates. Detecting duplicate and near-duplicate files At one point in that patent on duplicate content, it lists some factors that it might use to determine which duplicate page to use, including age of the document: QUOTE In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent*) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent*, etc.) is returned. * My emphasis. One thing that bothered me about the use of age to determine which duplicate, and near duplicate pages, to return and which to filter out is that any document on the web can be saved and have a recent time stamp, even if it has been on the web for years. After this newer patent, we may have a better sense of how Google determines the age of a page. There are other factors listed in the application which describe ways in which Google can do that. (Monika Henzinger is a co-inventor of both patents, which may account for some similarities.) The sections on determining the age of a page is an important part of this newer patent application. It's probably worth looking at those closely, and trying to translate them from the legal language they are presently couched in. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 1 2005, 01:40 AM |
|
|
Historical data involving a document can influence ranking scores. Here is one of the things that can make a difference:
document inception date This can be determined a number of different ways (maybe based upon what type of document it is, or by what implementation of the application is being used): [list]When first crawled by the search engine When first submitted to the search engine When a link to the document is first discovered Domain registration date When first referenced in another document When a document first reaches a certain number of pages By the time stamp of the document on the server it is hosted upon.[list] The application tells us that under a link-based ranking system not using age-based information, a document with less links to and from the document may rank lower than a document with more links to and from it. But, if the document with less links can be determined to be newer, based upon the document inception date, it might just rank higher than an older document with more links because it has a higher rate of growth. But too many links, coming too quickly to the newer document, may also be a sign that some type of spamming is happening. So, how is that rate determined, and how much does it influence the overall ranking of a page? This formula is given as one way of determining that: QUOTE H=L/log(F+2), where H may refer to the history-adjusted link score, L may refer to the link score given to the document, which can be derived using any known link scoring technique (e.g., the scoring technique described in U.S. Pat. No. 6,285,999) that assigns a score to a document based on links to/from the document, and F may refer to elapsed time measured from the inception date associated with the document (or a window within this period). The patent further refines this formula by negating some of the difference between the ages of the documents, in a recognition that some "older documents may be more favorable than newer ones" and that some sets of results can be fairly mature. The scores of documents can be influenced (positively or negatively) by the difference between the document's age, and the average age of documents resulting from a query. So, a fairly new site that appears amongst a set of results that are, on the average fairly old, may find it being negatively influenced by that difference in age. There are, however, a number of other ways to assign a score based upon age, which can influence the ranking of a site. The patent goes into those in more detail. And I will, too. Tommorrow... |
||
| Offline | ![]() |
Star MemberGroup: Members
Joined: 24-February 05
Posts: 517
|
Apr 1 2005, 02:27 AM |
|
|
Back in February I wrote a paper, "On the Googleness of Being", in which I shared some observations and guesswork about this TimeRank factor (that is what I called it).
If it's okay, I'll post the link here (I posted it in the Spider-Food forum): http://forums.spider-food.net/index.php?showtopic=2767 This is a hypothetical position, not the official Michael Martinez interpretation of Google. But it comes close to stating some of the principles I have been working with for a long time. The patent doesn't really tell us whether Google is doing this stuff now, but it does make it sound like they have been tinkering under the hood with these ideas -- especially given the search engine's behavior over the past year. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 2 2005, 02:19 AM |
|
|
The Inventors of the Subject of the Patent Application
I was going to pick up where I left off, but I noticed in another thread on another forum a valid complaint against some interpretations of this patent application. The complaint was that the application doesn't have Google's name on it. And because of that, it's misleading to attribute many of the things listed in the patent application to Google, and to what Google is doing on the web presently. The complaint was mainly targeted at an indepth analysis of the patent application by Randfish, which I'm trying not to read while we pursue this much more rambling and discursive analysis. But, if you want to jump over to what he has done (yes, I've peeked), it's at: http://www.socengine.com/seo/guide/google-...ata-patent.html And while this patent hasn't been legally assigned to Google, and we are making an assumption if we believe that it might be, I thought it might be a good idea to find out a little more about some of the authors of the patent application. At this point in time and at the time of filing of the patent application, they do all appear to be Google employees. I don't know how I would feel as an employer if some of my best and brightest employees got together, and invented what appears to be a complex and fairly comprehensive application on the use of aging information to influence the rankings of documents in search engine results. But, it doesn't have the search company's name upon it. So, keep in mind that all of this speculation about how this aging information fits in may not describe how Google presently works. Instead, the application covers the workings of "Search Engine 125." Here are the inventors (I'm not sure of all of their official titles at Google, but I've included what I was able to find): Anurag Acharya Principal Engineer, Google One of the major forces behind Google Scholar. Anurag Acharya Helped Google’s Scholarly Leap Scholarly pursuits Matt Cutts Software Engineer, Google A popular speaker at Search Engine Strategies and other conferences, including one for Consumer Webwatch, where he made this statement: QUOTE Because for some searches, there's some small or — not the majority, but some percentage of searches which are commercial. Say, 30, 40 percent, where somebody's looking to buy a product. And in those sorts of searches, the advertisements can be just as useful as the search results. In fact, we often order the search results by the click-through percentage. He also gives a number of interviews on behalf of Google, like these: Inside the Google search machine Interview with Matt Cutts, Google Jeffrey Dean Distinguished engineer in the Google Systems Research Lab If you look through a number of Google patents and patent applications, you may notice his name appears more than once as a co-inventor. Here's a paper on the physical side of Google's operation, that he co-authored: Web Search For A Planet: The Google Cluster Architecture Paul Haahr Senior software engineer, search quality group. (Google) http://www.webcom.com/~haahr/about.html Monika Henzinger Director of Research at Google Another name that springs up in Google patents and patent applications on a regular basis. And a few for some other search companies before Google had the good fortune to employ her. Still Searching (After All These Years) Google's Research Maven Urs Hoelzle Google Fellow Peeking Into Google Urs Hoelzle, Vice President of Operations, Google Google's Management Team Steve Lawrence Google Senior Research Scientist Steve Lawrence: List of publications from the DBLP Bibliography Server The main developer of CiteSeer, presumably involved with Google Scholar, and the lead developer of Google's desktop search. Google unveils desktop search Karl Pfleger Google employee Karl Pfleger's Other Interests Olcan Sercinoglu Google Software Engineer Worked on developing the cluster management system used by MapReduce. MapReduce: Simplifed Data Processing on Large Clusters After an internship at Google, Inc., he now works full time for the California company as a software engineer. Engineering News: School of Engineering and Applied Science at Washington University in St. Louis (Summer 2002) Simon Tong Research Scientist, Google Inc. One of the folks in the Google Labs. alicebot-general - Fw: Fw: another way cool thing at Google ps. On preview, you're welcome Elizabeth. I'm trying to explain as much of this in as plain language as I can. I think I'm getting better at it as I go. Hopefully, I'll get even better. pps. Nice Avatar. It's good to be able to see the faces behind posts. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 2 2005, 09:54 PM |
|
|
Pages Change, Part 2
I probably should have included one more paragraph in the previous post dealing with changes. The next section of the patent talks about comparing the rate and amount of changes over more than one period of time. The way that paragraph is worded, it appears that the patent application may often favor sites that are updated frequently, and that show an increase in a rate, and amount of change. All of this extra monitoring creates some new challenges, like where to put all the extra data that comes with it. The Problem of Storing Historic Data to Compare for Changes It struck me while reading all of this information about tracking changes in documents that it would take a considerable amount of storage to house older copies of documents and allow a comparison. While Google notes presently claims "8,058,044,651 web pages" on the front of the site, the Internet Archive had ten billion pages indexed in 2001, and an estimate from October of 2004 indicated 30 billion pages. The Internet Archive has copies of sites as they have changed. So, if a search engine is tracking changes to documents over time, it needs to store information about those changes. But storing exact copies of documents could cause its database to balloon in size quickly. And some type of strategy may be needed to keep the amount of information fairly small. Fingerprints and Other Approaches There was a mention of "fingerprints" in this thread from rcjordan, and the duplicate content patent I mentioned in a prior post also talks about a fingerprint strategy. That's one possible way for a Search Engine to track changes without having to keep exact copies of earlier documents for comparison. When fingerprints are compared, the method used to only look at a few places on the prints, and see if there is a match in those places. There isn't an attempt to overlay one print with another and try to achieve an exact match. Rather, information is stored about the characteristics of a print at different places on a finger, and that information is stored in a data base. When looking for a match, the information about those points are compared to each others. So, matching fingerprints doesn't call for exact matches of prints, but rather matches at a number of predetermined points on prints. The patent application addresses this need for storage capacity, when "monitoring the documents for content changes," and provides a number of different ways for tracking changes while not storing full copies of documents Representations of the documents, like the fingerprint information, can be stored and monitored for changes. Here are the strategies that the patent application mentions: [list]"Signatures" of documents may be stored instead of the entire documents. [*]Term vector's may be stored and monitored for relatively large changes. [*]A relatively small portion (e.g., a few terms) of the documents that are determined to be important or the most frequently occurring (excluding "stop words") may be stored and monitored. [*]a summary or other representation of a document may be maintained. [*]A similarity hash (which may be used to detect near-duplication of a document) for the document and monitor it for changes. A change in a similarity hash may be considered to indicate a relatively large change in its associated document. (See the patent on duplicate content - in addition to looking for duplicates on documents in other places in the web, it can be used to compare newer and older copies of the same document.) [*]Other possible options may be considered and used, too.[list]Of course, there is the possibility that there is enough storage, and full copies of documents can be maintained and monitored (See Google's cache, for instance.) (I've kept the language of some of those list items exactly the same as they appear on the application because it's either not really clear what those words mean yet, or they are already fairly clear.) Some Change isn't Good While it has been implied earlier, the next section (0055) shows a recognition that for some types of queries, changes may not be a good thing. For those changes, the update score could be adjusted based upon the the difference for an "average date-of-change" of the results from that query. Here's how that might work when results from a search are returned by the search engine: [list]Each document has a last change date, [*]An average is calculated for the documents in the results, [*]The document's ranking scores are modified positively or negatively based on the difference between the last change date and the average date-of-change for all of those documents.[list] Summary of the Section Some type of age rank is created for a document based upon information collected about how old that document is and how it has changed over time. This summary of this secton of the patent adds a detail about "large" documents, belonging to more than one person or organization. For those, the score may be broken down into scores for smaller sections, where content, and changes to that content are under the control of one person or organization. For instance, each of the blogs hosted on Blogspot might be treated differently under this age scoring system. |
||
| Offline | ![]() |
![]()
|
|
3 Pages 1 2 3 >
|
|
| Lo-Fi Version | Time is now: 9th February 2010 - 06:56 PM |
| Meet our Moderators: | cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |