![]() ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Feb 16 2005, 01:43 AM |
|
|
It's hard to tell exactly what Google might use to determine which page to show, if it determines that two pages contain enough similar content for it to consider them duplicates.
The best words we may have on the subject is the Google patent: Detecting duplicate and near-duplicate files QUOTE In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned. While page rank may be one consideration, relevance (assuming that there are some slight differences and one is more relevant than the other), the "best trust of host", or the age of each document may also play some part in the decision as to which page appears, and which is filtered. What is the "best trust of Host?" I'm not exactly sure. Maybe somethnig to do with Authorities and Hubs? Some other patents from search engines that cover the topic of duplicate data: Method for clustering closely resembling data objects Method and apparatus for finding mirrored hosts by analyzing connectivity and IP addresses Method and apparatus for finding mirrored hosts by analyzing urls Method for identifying related pages in a hyperlinked database Method for identifying near duplicate pages in a hyperlinked database While none of those are assigned to Google, the last four I listed share an author with the Google patent. The first one is cited in the Google patent. None of these patents may be descriptions of exactly what Google is doing, when it comes to duplicate content, but they may provide some insight into possible approaches. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Feb 22 2005, 12:16 AM |
|
|
You're welcome, Folex.
QUOTE He is aware that the content looks the same but that the code is completely different, G reads the code and not the text, so this should not be a problem. I was under the impression that for google to dictate your content for clarity/density etc, then it would almost certainly have to read the body text aswell as the code. Another patent from Google on duplicate content that I didn't see before might make him think differently: Detecting query-specific duplicate documents Of course, just because Google has a patent on this doesn't mean that they are using it. But they could be. It provides a different approach that might be significant. Imagine a set of pages that are very similar in that they share header, footer, navigation, and links out. But they also contain some information which is very different as the main content of a page. Should they be considered duplicates? Under a duplicate removal system which looks at the level of similarity between two pages before a query happens, one of those pages may not be displayed because there is so much similarity between the two pages (even if the comparison is based upon a limited amount of information about each page - sort of a finger printing of a page as described in a couple of the patents linked to above.) Now imagine that the comparison kicks in after the query is conducted, and instead of comparing whole pages the comparison is between a relevant snippet from each page that is returned for the queries. Or more than one relevant snippet from each page. Or snippets and page titles. Or snippets and "last update" dates. Or that relevant snippet and some other part of the page In that instance, even if much of the code is different, if the content is similar, one of the pages may be removed from the query results as duplicate content. From the conclusion to the patent: QUOTE By limiting the portion(s) of the documents being compared based on a query, a large range of duplicate document types, including those that would be missed by conventional similarity determination techniques, may now be detected. Further, since only a portion(s) of the documents are compared, the similarity threshold can be set relatively higher, thereby decreasing the number of documents that would be falsely identified as duplicates if a lower threshold were used. |
||
| Offline | ![]() |
![]()
|
|
| Lo-Fi Version | Time is now: 9th February 2010 - 05:45 PM |
| Meet our Moderators: | cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |