Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Duplicate Content again


  • Please log in to reply
4 replies to this topic

#1 folex

folex

    Mach 1 Member

  • Members
  • 272 posts

Posted 15 February 2005 - 12:03 PM

Afternoon all,

I was just trying to get a bit of clarification on duplicate content.

A reseller of ours who is above us in the SERPS for a few of our keywords has recently changed all the content on his web pages to the exact same copy as is on ours. He had always done rather well using his own, alot of the site also has links to our main site.

Our main site has been gaining momentum and has been moving up the SERPS rather well over the last 2 months, I do expect it to slow and settle, but I am concerned over the resellers tactic. I am aware of the issues around duplicate content and this has backed it up.

Quote from SearchEngineWatch

""Multiple domains create as many issues as they address," "You can own 100 domains covering significantly different products, linking them to a central domain—each solving the subject-focus issues but creating linking issues. Furthermore, purchasing 100 domains loaded with keywords and pointing them to the same physical site can cause duplicate content issues. Each situation is unique and they all have to be handled differently."

Now I know we are not dealing with 100's of domains but based on the fact that this reseller has duplicated our content, I am lead to believe his intentions are not honourable.

Fortunately I have been working on another variation of the site with new content and this has given me page ones for most of my keywords and I am on page 2 for my most important one.

Just wondering if there were any thoughts on this one.

"my note for the day"

F.

#2 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 15 February 2005 - 06:14 PM

If Google does recognize the two sites (yours and the partner's) as being duplicates of the same content then it will only show one version of each page it deems to be a duplicate - either yours or the partner's. If his site has the more link popularity (higher PageRank) then it is thought likely (not certain) to be his version that is chosen as the 'primary' and your version that is filtered out as the duplicate.

On the other hand, if your version has the higher 'weight' factor, then yours will be the version showing in results, and it is his version that will be filtered out. It is rare that the affiliate has the higher PageRank, but it can sometimes happen.

A cease and desist leter from legal counsel (on the grounds of copyright infringement) is usually a persuasive and powerful way to make an affiliate realise that copying your text is not appropriate or legal behaviour.

#3 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 16 February 2005 - 01:43 AM

It's hard to tell exactly what Google might use to determine which page to show, if it determines that two pages contain enough similar content for it to consider them duplicates.

The best words we may have on the subject is the Google patent: Detecting duplicate and near-duplicate files

In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned.


While page rank may be one consideration, relevance (assuming that there are some slight differences and one is more relevant than the other), the "best trust of host", or the age of each document may also play some part in the decision as to which page appears, and which is filtered.

What is the "best trust of Host?" I'm not exactly sure. Maybe somethnig to do with Authorities and Hubs?

Some other patents from search engines that cover the topic of duplicate data:

Method for clustering closely resembling data objects

Method and apparatus for finding mirrored hosts by analyzing connectivity and IP addresses

Method and apparatus for finding mirrored hosts by analyzing urls

Method for identifying related pages in a hyperlinked database

Method for identifying near duplicate pages in a hyperlinked database

While none of those are assigned to Google, the last four I listed share an author with the Google patent. The first one is cited in the Google patent.

None of these patents may be descriptions of exactly what Google is doing, when it comes to duplicate content, but they may provide some insight into possible approaches.

#4 folex

folex

    Mach 1 Member

  • Members
  • 272 posts

Posted 16 February 2005 - 10:31 AM

Thanks Ammon and Bill for your replies,

Ammon: it is rare that the affiliate has the higher PageRank, but it can sometimes happen.

the reseller used to work for the competition and had a high position before we worked with him, he had already gained a top position.

I mailed the company/person and his responce was:

He is aware that the content looks the same but that the code is completely different, G reads the code and not the text, so this should not be a problem.

I was under the impression that for google to dictate your content for clarity/density etc, then it would almost certainly have to read the body text aswell as the code.

Hmmm

I must of got it wrong?

F.

#5 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 22 February 2005 - 12:16 AM

You're welcome, Folex.

He is aware that the content looks the same but that the code is completely different, G reads the code and not the text, so this should not be a problem.  

I was under the impression that for google to dictate your content for clarity/density etc, then it would almost certainly have to read the body text aswell as the code.


Another patent from Google on duplicate content that I didn't see before might make him think differently:

Detecting query-specific duplicate documents

Of course, just because Google has a patent on this doesn't mean that they are using it. But they could be.

It provides a different approach that might be significant. Imagine a set of pages that are very similar in that they share header, footer, navigation, and links out. But they also contain some information which is very different as the main content of a page. Should they be considered duplicates?

Under a duplicate removal system which looks at the level of similarity between two pages before a query happens, one of those pages may not be displayed because there is so much similarity between the two pages (even if the comparison is based upon a limited amount of information about each page - sort of a finger printing of a page as described in a couple of the patents linked to above.)

Now imagine that the comparison kicks in after the query is conducted, and instead of comparing whole pages the comparison is between a relevant snippet from each page that is returned for the queries. Or more than one relevant snippet from each page. Or snippets and page titles. Or snippets and "last update" dates. Or that relevant snippet and some other part of the page

In that instance, even if much of the code is different, if the content is similar, one of the pages may be removed from the query results as duplicate content.

From the conclusion to the patent:

By limiting the portion(s) of the documents being compared based on a query, a large range of duplicate document types, including those that would be missed by conventional similarity determination techniques, may now be detected. Further, since only a portion(s) of the documents are compared, the similarity threshold can be set relatively higher, thereby decreasing the number of documents that would be falsely identified as duplicates if a lower threshold were used.





RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users